-
Notifications
You must be signed in to change notification settings - Fork 26
/
README.Rmd
413 lines (306 loc) · 16.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "textclean"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---
```{r, echo=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
pacman::p_load_current_gh('trinker/numform')
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
````
[![Project Status: Active - The project has reached a stable, usable
state and is being actively
developed.](http://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/textclean.svg?branch=master)](https://travis-ci.org/trinker/textclean)
[![Coverage Status](https://coveralls.io/repos/trinker/textclean/badge.svg?branch=master)](https://coveralls.io/r/trinker/textclean?branch=master)
[![](http://cranlogs.r-pkg.org/badges/textclean)](https://cran.r-project.org/package=textclean)
`r verbadge`
<img src="inst/textclean_logo/r_textclean2.png" width="200" alt="textclean Logo">
**textclean** is a collection of tools to clean and process text. Many of these tools have been taken from the **qdap** package and revamped to be more intuitive, better named, and faster.
# Functions
The main functions, task category, & descriptions are summarized in the table below:
| Function | Task | Description |
|---------------------------|-------------|---------------------------------------|
| `mgsub` | subbing | Multiple `gsub` |
| `sub_holder` | subbing | Hold a value prior to a `strip` |
| `swap` | subbing | Simultaneously swap patterns 1 & 2 |
| `strip` | deletion | Remove all non word characters |
| `filter_empty_row` | filter rows | Remove empty rows |
| `filter_row` | filter rows | Remove rows matching a regex |
| `filter_NA` | filter rows | Remove `NA` text rows |
| `filter_element` | filter elements | Remove matching elements from a vector |
| `replace_contractions` | replacement | Replace contractions with both words |
| `replace_emoticon` | repalcement | Replace emoticons with word equivalent |
| `replace_grade` | repalcement | Replace grades (e.g., "A+") with word equivalent |
| `replace_html` | replacement | Replace HTML tags and symbols |
| `replace_incomplete` | replacement | Replace incomplete sentence end-marks |
| `replace_non_ascii` | replacement | Replace non-ascii with equivalent or remove |
| `replace_number` | replacement | Replace common numbers |
| `replace_ordinal` | replacement | Replace common ordinal number form |
| `replace_rating` | repalcement | Replace ratings (e.g., "10 out of 10", "3 stars") with word equivalent |
| `replace_symbol` | replacement | Replace common symbols |
| `replace_white` | replacement | Replace regex white space characters |
| `replace_token` | replacement | Remove or replace a vector of tokens with a single value |
| `add_comma_space` | replacement | Replace non-space after comma |
| `add_missing_endmark` | replacement | Replace missing endmarks with desired symbol |
| `check_text` | check | Text report of potential issues |
| `has_endmark` | check | Check if an element has an end-mark |
# Installation
To download the development version of **textclean**:
Download the [zip ball](https://github.com/trinker/textclean/zipball/master) or [tar ball](https://github.com/trinker/textclean/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/lexicon",
"trinker/textclean"
)
```
# Contact
You are welcome to:
- submit suggestions and bug-reports at: <https://github.com/trinker/textclean/issues>
- send a pull request on: <https://github.com/trinker/textclean/>
- compose a friendly e-mail to: <tyler.rinker@gmail.com>
# Demonstration
## Load the Packages/Data
```{r, message=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr)
pacman::p_load_gh("trinker/textshape", "trinker/lexicon", "trinker/textclean")
```
## Check Text
One of the most useful tools in **textclean** is `check_text` which scans text variables and reports potential problems. Not all potential problems are definite problems for analysis but the report provides an overview of what may need further preparation. The report also provides suggested functions for the reported problems. The report provides information on the following:
1. **non_character** - Text that is `factor`.
2. **missing_ending_punctuation** - Text with no endmark at the end of the string.
3. **empty** - Text that contains an empty element (i.e., `""`).
4. **double_punctuation** - Text that contains two punctuation marks in the same string.
5. **non_space_after_comma** - Text that contains commas with no space after them.
6. **no_alpha** - Text that contains string elements with no alphabetic characters.
7. **non_ascii** - Text that contains non-ASCII characters.
8. **missing_value** - Text that contains missing values (i.e., `NA`).
9. **containing_escaped** - Text that contains escaped (see `?Quotes`).
10. **containing_digits** - Text that contains digits.
11. **indicating_incomplete** - Text that contains endmarks that are indicative of incomplete/trailing sentences (e.g., `...`).
12. **potentially_misspelled** - Text that contains potentially misspelled words.
Here is an example:
```{r}
x <- c("i like", "i want. thet them ther .", "I am ! that|", "", NA,
"they,were there", ".", " ", "?", "3;", "I like goud eggs!",
"bi\xdfchen Z\xfcrcher", "i 4like...", "\\tgreat", "She said \"yes\"")
Encoding(x) <- "latin1"
x <- as.factor(x)
check_text(x)
```
And if all is well the user should be greeted by a cow:
```{r}
y <- c("A valid sentence.", "yet another!")
check_text(y)
```
## Row Filtering
It is useful to filter/remove empty rows or unwanted rows (for example the researcher dialogue from a transcript). The `filter_empty_row` & `filter_row` do empty row do just this. First I'll demo the removal of empty rows.
```{r}
## create a data set wit empty rows
(dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4),
ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)]))))
filter_empty_row(dat)
```
Next we filter out rows. The `filter_row` function takes a data set, a column (named or numeric position) and regex terms to search for. The `terms` argument takes regex(es) allowing for partial matching. `terms` is case sensitive but can be changed via the `ignore.case` argument.
```{r}
filter_row(dataframe = DATA, column = "person", terms = c("sam", "greg"))
filter_row(DATA, 1, c("sam", "greg"))
filter_row(DATA, "state", c("Comp"))
filter_row(DATA, "state", c("I "))
filter_row(DATA, "state", c("you"), ignore.case = TRUE)
```
## Stripping
Often it is useful to remove all non relevant symbols and case from a text (letters, spaces, and apostrophes are retained). The `strip` function accomplishes this. The `char.keep` argument allows the user to retain characters.
```{r}
strip(DATA$state)
strip(DATA$state, apostrophe.remove = TRUE)
strip(DATA$state, char.keep = c("?", "."))
```
## Subbing
### Multiple Subs
`gsub` is a great tool but often the user wants to replace a vector of elements with another vector. `mgsub` allows for a vector of patterns and replacements. Note that the first argument of `mgsub` is the data, not the `pattern` as is standard with base R's `gsub`. This allows `mgsub` to be used in a **magrittr** pipeline more easily. Also note that by default `fixed = TRUE`. This means the search `pattern` is not a regex per-se. This makes the replacement much faster when a regex search is not needed. `mgsub` also reorders the patterns to ensure patterns contained within patterns don't over write the longer pattern. For example if the pattern `c('i', 'it')` is given the longer `'it'` is replaced first (though `order.pattern = FALSE` can be used to negate this feature).
```{r}
mgsub(DATA$state, c("it's", "I'm"), c("<<it is>>", "<<I am>>"))
mgsub(DATA$state, "[[:punct:]]", "<<PUNCT>>", fixed = FALSE)
mgsub(DATA$state, c("i", "it"), c("<<I>>", "[[IT]]"))
mgsub(DATA$state, c("i", "it"), c("<<I>>", "[[IT]]"), order.pattern = FALSE)
```
### Stashing Character Pre-Sub
There are times the user may want to stash a set of characters before subbing out and then return the stashed characters. An example of this is when a researcher wants to remove punctuation but not emoticons. The `subholder` function provides tooling to stash the emoticons, allow a punctuation stripping, and then return the emoticons. First I'll create some fake text data with emoticons, then stash the emoticons (using a unique text key to hold their place), then I'll strip out the punctuation, and last put the stashed emoticons back.
```{r}
(fake_dat <- paste(hash_emoticons[1:11, 1, with=FALSE][[1]], DATA$state))
(m <- sub_holder(fake_dat, hash_emoticons[[1]]))
(m_stripped <-strip(m$output))
m$unhold(m_stripped)
```
## Replacement
**textclean** contains tools to replace substrings within text with other substrings that may be easier to analyze. This section outlines the uses of these tools.
### Contractions
Some analysis techniques require contractions to be replaced with their multi-word forms (e.g., "I'll" -> "I will"). `replace_contrction` provides this functionality.
```{r}
x <- c("Mr. Jones isn't going.",
"Check it out what's going on.",
"He's here but didn't go.",
"the robot at t.s. wasn't nice",
"he'd like it if i'd go away")
replace_contraction(x)
```
### Emoticons
Some analysis techniques examine words, meaning emoticons may be ignored. `replace_emoticons` replaces emoticons with their word forms equivalents.
```{r}
x <- c(
"text from: http://www.webopedia.com/quick_ref/textmessageabbreviations_02.asp",
"... understanding what different characters used in smiley faces mean:",
"The close bracket represents a sideways smile )",
"Add in the colon and you have sideways eyes :",
"Put them together to make a smiley face :)",
"Use the dash - to add a nose :-)",
"Change the colon to a semi-colon ; and you have a winking face ;) with a nose ;-)",
"Put a zero 0 (halo) on top and now you have a winking, smiling angel 0;) with a nose 0;-)",
"Use the letter 8 in place of the colon for sunglasses 8-)",
"Use the open bracket ( to turn the smile into a frown :-("
)
replace_emoticon(x)
```
### Grades
In analysis where grades may be discussed it may be useful to convert the letter forms into word meanings. The `replace_grade` can be used for this task.
```{r}
text <- c(
"I give an A+",
"He deserves an F",
"It's C+ work",
"A poor example deserves a C!"
)
replace_grade(text)
```
### HTML
Sometimes HTML tags and symbols stick around like pesky gnats. The `replace_html` function makes light work of them.
```{r}
x <- c(
"<bold>Random</bold> text with symbols: < > & " '",
"<p>More text</p> ¢ £ ¥ € © ®"
)
replace_html(x)
```
### Incomplete Sentences
Sometimes an incomplete sentence is denoted with multiple end marks or no punctuation at all. `replace_incomplete` standardizes these sentences with a pipe (`|`) endmark (or one of the user's choice).
```{r}
x <- c("the...", "I.?", "you.", "threw..", "we?")
replace_incomplete(x)
replace_incomplete(x, '...')
```
### Non-ASCII Characters
R can choke on non-ASCII characters. They can be re-encoded but the new encoding may lack interpretability (e.g., ¢ may be converted to `\xA2` which is not easily understood or likely to be matched in a hash look up). `replace_non_ascii` attempts to replace common non-ASCII characters with a text representation (e.g., ¢ becomes "cent") Non recognized non-ASCII characters are simply removed (unless `remove.nonconverted = FALSE`).
```{r}
x <- c(
"Hello World", "6 Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher",
'This is a \xA9 but not a \xAE', '6 \xF7 2 = 3', 'fractions \xBC, \xBD, \xBE',
'cows go \xB5', '30\xA2'
)
Encoding(x) <- "latin1"
x
replace_non_ascii(x)
replace_non_ascii(x, remove.nonconverted = FALSE)
```
### Numbers
Some analysis requires numbers to be converted to text form. `replace_number` attempts to perform this task. `replace_number` handles comma separated numbers as well.
```{r}
x <- c("I like 346,457 ice cream cones.", "They are 99 percent good")
y <- c("I like 346457 ice cream cones.", "They are 99 percent good")
replace_number(x)
replace_number(y)
replace_number(x, num.paste = TRUE)
replace_number(x, remove=TRUE)
```
### Ratings
Some texts use ratings to convey satisfaction with a particular object. The `replace_rating` function replaces the more abstract rating with word equivalents.
```{r}
x <- c("This place receives 5 stars for their APPETIZERS!!!",
"Four stars for the food & the guy in the blue shirt for his great vibe!",
"10 out of 10 for both the movie and trilogy.",
"* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars!",
"For service, I give them no stars.", "This place deserves no stars.",
"10 out of 10 stars.",
"My rating: just 3 out of 10.",
"If there were zero stars I would give it zero stars.",
"Rating: 1 out of 10.",
"I gave it 5 stars because of the sound quality.",
"If it were possible to give them 0/10, they'd have it."
)
replace_rating(x)
```
### Ordinal Numbers
Again, some analysis requires numbers, including ordinal numbers, to be converted to text form. `replace_ordinal` attempts to perform this task for ordinal number 1-100 (i.e., 1st - 100th).
```{r}
x <- c(
"I like the 1st one not the 22nd one.",
"For the 100th time stop those 3 things!",
"I like the 3rd 1 not the 12th 1."
)
replace_ordinal(x)
replace_ordinal(x, TRUE)
replace_ordinal(x, remove = TRUE)
replace_number(replace_ordinal(x))
```
### Symbols
Text often contains short-hand representations of words/phrases. These symbols may contain analyzable information but in the symbolic form they cannot be parsed. The `replace_symbol` function attempts to replace the symbols `c("$", "%", "#", "@", "& "w/")` with their word equivalents.
```{r}
x <- c("I am @ Jon's & Jim's w/ Marry",
"I owe $41 for food",
"two is 10% of a #"
)
replace_symbol(x)
```
### White Space
Regex white space characters (e.g., `\n`, `\t`, `\r`) matched by `\s` may impede analysis. These can be replaced with a single space `" "` via the `replace_white` function.
```{r}
x <- "I go \r
to the \tnext line"
x
cat(x)
replace_white(x)
```
### Tokens
Often an analysis requires converting tokens of a certain type into a common form or removing them entirely. The `mgsub` function can do this task, however it is regex based and time consuming when the number of tokens to replace is large. For example, one may want to replace all proper nouns that are first names with the word name. The `replace_token` provides a fast way to replace a group of tokens with a single replacement.
This example shows a use case for `replace_token`:
```{r}
## Set Up the Tokens to Replace
nms <- gsub("(^.)(.*)", "\\U\\1\\L\\2", common_names, perl = TRUE)
head(nms)
## Set Up the Data
x <- split_portion(sample(c(sample(grady_augmented, 5000),
sample(nms, 10000, TRUE))), n.words = 12)
x$text.var <- paste0(x$text.var, sample(c('.', '!', '?'), length(x$text.var), TRUE))
head(x$text.var)
head(replace_tokens(x$text.var, nms, 'NAME'))
```
This demonstration shows how fast token replacement can be with `replace_token`:
```{r}
tic <- Sys.time()
head(replace_tokens(x$text.var, nms, "<<NAME>>"))
(toc <- Sys.time() - tic)
tic <- Sys.time()
head(mgsub(x$text.var, nms, "<<NAME>>"))
(toc <- Sys.time() - tic)
```
```{r}
tic <- Sys.time()
out <- replace_tokens(rep(x$text.var, 20), nms, "<<NAME>>")
(toc <- Sys.time() - tic)
```
Now let's amp it up with 20x more text data. Thet's `r f_comma(length(x$text.var) * 20)` rows of text (`r f_comma(sum(stringi::stri_count_words(x$text.var))*20)` words) and `r f_comma(length(nms))` tokens in `r round(toc, 2)` seconds.
```
tic <- Sys.time()
out <- replace_tokens(rep(x$text.var, 20), nms, "<<NAME>>")
(toc <- Sys.time() - tic)
```
```{r, echo=FALSE}
toc
```