Skip to content

Commit

Permalink
* replace_word_elongation added to replace word elongations (a.k.a.…
Browse files Browse the repository at this point in the history
… "word

  lengthening") with the most likely normalized word form.  See
  http://www.aclweb.org/anthology/D11-105 for details.

Related to trinker/sentimentr#66
  • Loading branch information
trinker committed Dec 31, 2017
1 parent 5897de3 commit 0a00cff
Show file tree
Hide file tree
Showing 11 changed files with 317 additions and 78 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Description: Tools to clean and process text. Tools are geared at checking for
easily handled by analysis algorithms. The 'replace_emoticon()' function replaces
emoticons with word equivalents.
Depends: R (>= 3.2.3)
Imports: english(>= 1.0-2), lexicon (>= 0.7.2), qdapRegex, stringi, textshape(>= 1.0.1), utils
Imports: data.table, english(>= 1.0-2), lexicon (>= 0.7.2), qdapRegex, stringi, textshape(>= 1.0.1), utils
Suggests: testthat
Date: 2017-12-30
License: GPL-2
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ export(replace_rating)
export(replace_symbol)
export(replace_tokens)
export(replace_white)
export(replace_word_elongation)
export(strip)
export(sub_holder)
export(swap)
Expand Down
4 changes: 4 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ NEW FEATURES
* `replace_internet_slang` added to replace Internet acronyms and abbreviations
with machine friendly word equivalents.

* `replace_word_elongation` added to replace word elongations (a.k.a. "word
lengthening") with the most likely normalized word form. See
http://www.aclweb.org/anthology/D11-105 for details.

* `fgsub` added for the ability to match, extract, operate a function over the
extracted strings, & replace the original matches with the extracted strings.
This performs similar functionality to `gsubfn::gsubfn` but is less powerful.
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ textclean 0.6.0 -
* `replace_internet_slang` added to replace Internet acronyms and abbreviations
with machine friendly word equivalents.

* `replace_word_elongation` added to replace word elongations (a.k.a. "word
lengthening") with the most likely normalized word form. See
http://www.aclweb.org/anthology/D11-105 for details.

* `fgsub` added for the ability to match, extract, operate a function over the
extracted strings, & replace the original matches with the extracted strings.
This performs similar functionality to `gsubfn::gsubfn` but is less powerful.
Expand Down
100 changes: 100 additions & 0 deletions R/replace_word_elongation.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#' Replace Word Elongations
#'
#' In informal writing people may use a form of text embellishment to emphasize
#' or alter word meanings called elongation (a.k.a. "word lengthening"). For
#' example, the use of "Whyyyyy" conveys frustration. Other times the usage may
#' be to be more sexy (e.g., "Heyyyy there"). Other times it may be used for
#' emphasis (e.g., "This is so gooood"). This function uses an augmented form
#' of Armstrong & Fogarty's (2007) algorithm. The algorithm first attempts to
#' replace the elongation with known semantic replacements (optional; default is
#' \code{FALSE}). After this the algorithm locates all places were the same
#' letter (case insensitive) appears 3 times consecutively. These elements are
#' then further processed. The matches are replaces via \code{fgsub} by first
#' taking the elongation to it's canonical form (drop all > 1 consecutive letters
#' to a single letter) and then replacing with the most common word used in 2008
#' in Google's ngram data set that takes the canonical form. If the canonical
#' form is not found in the Google data set then the canonical form is used as
#' the replacement.
#'
#' @param x The text variable.
#' @param impart.meaning logical. If \code{TRUE}, known elongation semantics
#' are used as replacements (see \code{textclean:::meaning_elongations} for
#' known elongation semantics and replacements).
#' @param \ldots ignored.
#' @return Returns a vector with word elongations replaced.
#' @references
#' Armstrong, D. B., Fogarty, G. J., & Dingsdag, D. (2007). Scales measuring
#' characteristics of small business information systems. Proceedings of the
#' 2011 Conference on Empirical Methods in Natural Language Processing (pp.
#' 562-570). Edinburgh, Scotland. Retrieved from
#' http://www.aclweb.org/anthology/D11-1052 \cr \cr
#' \url{http://storage.googleapis.com/books/ngrams/books/datasetsv2.html} \cr \cr
#' \url{https://www.theatlantic.com/magazine/archive/2013/03/dragging-it-out/309220} \cr \cr
#' \url{https://english.stackexchange.com/questions/189517/is-there-a-name-term-for-multiplied-vowels}
#' @export
#' @examples
#' x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds',
#' 'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!',
#' "I'm liiiike whyyyyy me?", "Wwwhhatttt!", NA)
#'
#' replace_word_elongation(x)
#' replace_word_elongation(x, impart.meaning = TRUE)
replace_word_elongation <- function(x, impart.meaning = FALSE, ...){

## replace with meaningful
if (isTRUE(impart.meaning)){
x <- mgsub(x, meaning_elongations[['x']], meaning_elongations[['y']],
fixed = FALSE, perl = TRUE, ignore.case = TRUE)
}

## consider only groupings with a triple letter
locs <- stringi::stri_detect_regex(x, elongation_search_pattern,
opts_regex = list(case_insensitive = TRUE))

locs[is.na(locs)] <- FALSE

if (sum(locs) == 0) return(x)

txt <- x[locs]
canonicalk <- data.table::data.table(canonical)

## replace tripple letter words with most common form or else canonical form
x[locs] <- .fgsub(txt, elongation_pattern, function(x, can = canonical){

y <- gsub("([a-z])(\\1+)", '\\1', tolower(x), perl = TRUE)

z <- data.table::data.table(canonical = y)
out <- merge(z, can, by = 'canonical')$word

if (length(out) == 0 || is.na(out)) out <- y
out

})

x

}



# Known with meaning
b2 <- "(?<=^|[^A-Za-z'-])(%s)(?=$|[^A-Za-z'-])"
meaning_elongations <- data.frame(
x = sprintf(b2, c('hey{2,}', 'fi{3,}ne', 'no{3,}', 'sor{3,}y|sory{2,}|sor{3,}y{2,}',
'thanks{2,}', 'tha{2,}nks', 'ri{3,}ght', 'why{3,}', 'real{2,}y'
)),
y = c('hey sexy', 'not fine', 'sarcastic', 'not sorry',
'not thankful', 'very thankful', 'not correct', 'frustration', 'surprised'
),
stringsAsFactors = FALSE
)

elongation_search_pattern <- "(?i)([a-z])(\\1{2,})"
elongation_pattern <- "\\b\\w*([a-z])(\\1{2,})\\w*\\b"







Binary file modified R/sysdata.rda
Binary file not shown.
13 changes: 13 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,20 @@ to_byte <- function(x){
}


.fgsub <- function(x, pattern, fun, ...){


hits <- stringi::stri_extract_all_regex(x, pattern)
pats <- unique(unlist(hits))
reps <- paste0('textcleanholder', seq_along(pats), 'textcleanholder')
freps <- unlist(lapply(pats, fun))

x <- mgsub(x, pats, reps)

mgsub(x, reps, freps)


}



54 changes: 37 additions & 17 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ The main functions, task category, & descriptions are summarized in the table be
| `drop_NA` | filter rows | Remove `NA` text rows |
| `drop_element`/`keep_element` | filter elements | Filter matching elements from a vector |
| `replace_contractions` | replacement | Replace contractions with both words |
| `replace_emoji` | repalcement | Replace emojis with word equivalent or unique identifier |
| `replace_emoticon` | repalcement | Replace emoticons with word equivalent |
| `replace_emoji` | repalcement | Replace emojis with word equivalent or unique identifier |
| `replace_grade` | repalcement | Replace grades (e.g., "A+") with word equivalent |
| `replace_html` | replacement | Replace HTML tags and symbols |
| `replace_incomplete` | replacement | Replace incomplete sentence end-marks |
Expand All @@ -59,7 +59,8 @@ The main functions, task category, & descriptions are summarized in the table be
| `replace_rating` | repalcement | Replace ratings (e.g., "10 out of 10", "3 stars") with word equivalent |
| `replace_symbol` | replacement | Replace common symbols |
| `replace_white` | replacement | Replace regex white space characters |
| `replace_token` | replacement | Remove or replace a vector of tokens with a single value |
| `replace_token` | replacement | Remove or replace a vector of tokens with a single value |
| `replace_word_elongation` | repalcement | Replace word elongations with shortened form |
| `add_comma_space` | replacement | Replace non-space after comma |
| `add_missing_endmark` | replacement | Replace missing endmarks with desired symbol |
| `make_plural` | replacement | Add plural endings to singular noun forms |
Expand Down Expand Up @@ -230,6 +231,23 @@ x <- c("Mr. Jones isn't going.",
replace_contraction(x)
```


### Emojis

Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form. `replace_emoji` replaces emojis with their word forms equivalents.

```{r}
x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"),
stringsAsFactors = FALSE)[[3]][1:3]
x
```

```{r}
replace_emoji(x)
```


### Emoticons

Some analysis techniques examine words, meaning emoticons may be ignored. `replace_emoticon` replaces emoticons with their word forms equivalents.
Expand All @@ -251,21 +269,6 @@ x <- c(
replace_emoticon(x)
```

### Emojis

Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form. `replace_emoji` replaces emojis with their word forms equivalents.

```{r}
x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"),
stringsAsFactors = FALSE)[[3]][1:3]
x
```

```{r}
replace_emoji(x)
```


### Grades

Expand Down Expand Up @@ -352,6 +355,7 @@ x <- c(
replace_names(x)
replace_names(x, replacement = '<<NAME>>')
```

### Non-ASCII Characters

R can choke on non-ASCII characters. They can be re-encoded but the new encoding may lack interpretability (e.g., &cent; may be converted to `\xA2` which is not easily understood or likely to be matched in a hash look up). `replace_non_ascii` attempts to replace common non-ASCII characters with a text representation (e.g., &cent; becomes "cent") Non recognized non-ASCII characters are simply removed (unless `remove.nonconverted = FALSE`).
Expand Down Expand Up @@ -447,6 +451,22 @@ cat(x)
replace_white(x)
```

### Word Elongation

In informal writing people may use a form of text embellishment to emphasize or alter word meanings called elongation (a.k.a. "word lengthening"). For example, the use of "Whyyyyy" conveys frustration. Other times the usage may be to be more sexy (e.g., "Heyyyy there"). Other times it may be used for emphasis (e.g., "This is so gooood").

The `replace_word_elongation` function replaces these un-normalized forms with the most likely normalized form. The `impart.meaning` argument can replace a short list of known elongations with semantic replacements.

```{r}
x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds',
'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!',
"I'm liiiike whyyyyy me?", "Wwwhhatttt!", NA)
replace_word_elongation(x)
replace_word_elongation(x, impart.meaning = TRUE)
```



### Tokens

Expand Down

0 comments on commit 0a00cff

Please sign in to comment.