* replace_word_elongation added to replace word elongations (a.k.a.…

… "word lengthening") with the most likely normalized word form. See http://www.aclweb.org/anthology/D11-105 for details. Related to trinker/sentimentr#66
trinker · Dec 31, 2017 · 0a00cff · 0a00cff
1 parent 5897de3
commit 0a00cff
Show file tree

Hide file tree

Showing 11 changed files with 317 additions and 78 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -10,7 +10,7 @@ Description: Tools to clean and process text.  Tools are geared at checking for
           easily handled by analysis algorithms.  The 'replace_emoticon()' function replaces
           emoticons with word equivalents.
 Depends: R (>= 3.2.3)
-Imports: english(>= 1.0-2), lexicon (>= 0.7.2), qdapRegex, stringi, textshape(>= 1.0.1), utils
+Imports: data.table, english(>= 1.0-2), lexicon (>= 0.7.2), qdapRegex, stringi, textshape(>= 1.0.1), utils
 Suggests: testthat
 Date: 2017-12-30
 License: GPL-2

diff --git a/NAMESPACE b/NAMESPACE
@@ -43,6 +43,7 @@ export(replace_rating)
 export(replace_symbol)
 export(replace_tokens)
 export(replace_white)
+export(replace_word_elongation)
 export(strip)
 export(sub_holder)
 export(swap)

diff --git a/NEWS b/NEWS
@@ -33,6 +33,10 @@ NEW FEATURES
 * `replace_internet_slang` added to replace Internet acronyms and abbreviations
   with machine friendly word equivalents.
 
+* `replace_word_elongation` added to replace word elongations (a.k.a. "word 
+  lengthening") with the most likely normalized word form.  See 
+  http://www.aclweb.org/anthology/D11-105 for details.
+
 * `fgsub` added for the ability to match, extract, operate a function over the
   extracted strings, & replace the original matches with the extracted strings.
   This performs similar functionality to `gsubfn::gsubfn` but is less powerful.

diff --git a/NEWS.md b/NEWS.md
@@ -33,6 +33,10 @@ textclean 0.6.0 -
 * `replace_internet_slang` added to replace Internet acronyms and abbreviations
   with machine friendly word equivalents.
 
+* `replace_word_elongation` added to replace word elongations (a.k.a. "word 
+  lengthening") with the most likely normalized word form.  See 
+  http://www.aclweb.org/anthology/D11-105 for details.
+
 * `fgsub` added for the ability to match, extract, operate a function over the
   extracted strings, & replace the original matches with the extracted strings.
   This performs similar functionality to `gsubfn::gsubfn` but is less powerful.

diff --git a/R/replace_word_elongation.R b/R/replace_word_elongation.R
@@ -0,0 +1,100 @@
+#' Replace Word Elongations
+#' 
+#' In informal writing people may use a form of text embellishment to emphasize 
+#' or alter word meanings called elongation (a.k.a. "word lengthening").  For 
+#' example, the use of "Whyyyyy" conveys frustration.  Other times the usage may 
+#' be to be more sexy (e.g., "Heyyyy there").  Other times it may be used for 
+#' emphasis (e.g., "This is so gooood").  This function uses an augmented form
+#' of Armstrong & Fogarty's (2007) algorithm.  The algorithm first attempts to
+#' replace the elongation with known semantic replacements (optional; default is
+#' \code{FALSE}).  After this the algorithm locates all places were the same 
+#' letter (case insensitive) appears 3 times consecutively.  These elements are
+#' then further processed.  The matches are replaces via \code{fgsub} by first
+#' taking the elongation to it's canonical form (drop all > 1 consecutive letters
+#' to a single letter) and then replacing with the most common word used in 2008
+#' in Google's ngram data set that takes the canonical form.  If the canonical 
+#' form is not found in the Google data set then the canonical form is used as
+#' the replacement.
+#' 
+#' @param x  The text variable.
+#' @param impart.meaning logical.  If \code{TRUE}, known elongation semantics
+#' are used as replacements (see \code{textclean:::meaning_elongations} for 
+#' known elongation semantics and replacements).
+#' @param \ldots ignored.
+#' @return Returns a vector with word elongations replaced.
+#' @references
+#' Armstrong, D. B., Fogarty, G. J., & Dingsdag, D. (2007). Scales measuring 
+#' characteristics of small business information systems. Proceedings of the 
+#' 2011 Conference on Empirical Methods in Natural Language Processing (pp. 
+#' 562-570). Edinburgh, Scotland. Retrieved from 
+#' http://www.aclweb.org/anthology/D11-1052 \cr \cr
+#' \url{http://storage.googleapis.com/books/ngrams/books/datasetsv2.html} \cr \cr
+#' \url{https://www.theatlantic.com/magazine/archive/2013/03/dragging-it-out/309220} \cr \cr
+#' \url{https://english.stackexchange.com/questions/189517/is-there-a-name-term-for-multiplied-vowels}
+#' @export
+#' @examples
+#' x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds', 
+#'     'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!',
+#'     "I'm liiiike whyyyyy me?", "Wwwhhatttt!", NA)
+#' 
+#' replace_word_elongation(x)
+#' replace_word_elongation(x, impart.meaning = TRUE)
+replace_word_elongation <- function(x, impart.meaning = FALSE, ...){
+
+    ## replace with meaningful
+    if (isTRUE(impart.meaning)){
+        x <- mgsub(x, meaning_elongations[['x']], meaning_elongations[['y']], 
+            fixed = FALSE, perl = TRUE, ignore.case = TRUE)
+    }
+
+    ## consider only groupings with a triple letter
+    locs <- stringi::stri_detect_regex(x, elongation_search_pattern, 
+        opts_regex = list(case_insensitive = TRUE))
+
+    locs[is.na(locs)] <- FALSE
+
+    if (sum(locs) == 0) return(x)
+
+    txt <- x[locs]
+    canonicalk <- data.table::data.table(canonical)
+
+    ## replace tripple letter words with most common form or else canonical form
+    x[locs] <- .fgsub(txt, elongation_pattern, function(x, can = canonical){
+
+        y <- gsub("([a-z])(\\1+)", '\\1', tolower(x), perl = TRUE)
+
+        z <- data.table::data.table(canonical = y)
+        out <- merge(z, can, by = 'canonical')$word
+
+        if (length(out) == 0 || is.na(out)) out <- y
+        out
+
+    })
+
+    x
+
+}
+
+
+
+# Known with meaning
+b2 <- "(?<=^|[^A-Za-z'-])(%s)(?=$|[^A-Za-z'-])"
+meaning_elongations <- data.frame(
+    x = sprintf(b2, c('hey{2,}', 'fi{3,}ne', 'no{3,}', 'sor{3,}y|sory{2,}|sor{3,}y{2,}',
+        'thanks{2,}', 'tha{2,}nks', 'ri{3,}ght', 'why{3,}', 'real{2,}y'
+    )),
+    y = c('hey sexy', 'not fine', 'sarcastic', 'not sorry', 
+        'not thankful', 'very thankful', 'not correct', 'frustration', 'surprised'
+    ), 
+    stringsAsFactors = FALSE
+)
+
+elongation_search_pattern <- "(?i)([a-z])(\\1{2,})"
+elongation_pattern <- "\\b\\w*([a-z])(\\1{2,})\\w*\\b"
+
+
+
+
+
+
+
diff --git a/R/sysdata.rda b/R/sysdata.rda
diff --git a/R/utils.R b/R/utils.R
@@ -119,7 +119,20 @@ to_byte <- function(x){
 }
 
 
+.fgsub <- function(x, pattern, fun, ...){
+
 
+    hits <- stringi::stri_extract_all_regex(x, pattern)
+    pats <- unique(unlist(hits))
+    reps <- paste0('textcleanholder', seq_along(pats), 'textcleanholder')
+    freps <- unlist(lapply(pats, fun))
+
+    x <- mgsub(x, pats, reps)
+
+    mgsub(x, reps, freps)
+
+
+}
 
 
 
diff --git a/README.Rmd b/README.Rmd
@@ -45,8 +45,8 @@ The main functions, task category, & descriptions are summarized in the table be
 | `drop_NA`                 | filter rows | Remove `NA` text rows                 |
 | `drop_element`/`keep_element` | filter elements | Filter matching elements from a vector   |
 | `replace_contractions`    | replacement | Replace contractions with both words  |
+| `replace_emoji`           | repalcement | Replace emojis with word equivalent or unique identifier |
 | `replace_emoticon`        | repalcement | Replace emoticons with word equivalent               |
-| `replace_emoji`           | repalcement | Replace emojis with word equivalent or unique identifier              |
 | `replace_grade`           | repalcement | Replace grades (e.g., "A+") with word equivalent     |
 | `replace_html`            | replacement | Replace HTML tags and symbols  |
 | `replace_incomplete`      | replacement | Replace incomplete sentence end-marks  |
@@ -59,7 +59,8 @@ The main functions, task category, & descriptions are summarized in the table be
 | `replace_rating`          | repalcement | Replace ratings (e.g., "10 out of 10", "3 stars") with word equivalent |
 | `replace_symbol`          | replacement | Replace common symbols                |
 | `replace_white`           | replacement | Replace regex white space characters  |
-| `replace_token`           | replacement | Remove or replace a vector of tokens with a single value  |
+| `replace_token`           | replacement | Remove or replace a vector of tokens with a single value |
+| `replace_word_elongation` | repalcement | Replace word elongations with shortened form  |
 | `add_comma_space`         | replacement | Replace non-space after comma         |
 | `add_missing_endmark`     | replacement | Replace missing endmarks with desired symbol |
 | `make_plural`             | replacement | Add plural endings to singular noun forms |
@@ -230,6 +231,23 @@ x <- c("Mr. Jones isn't going.",
 replace_contraction(x)
 ```
 
+
+### Emojis
+
+Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form.  `replace_emoji` replaces emojis with their word forms equivalents.
+
+```{r}
+x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"), 
+    stringsAsFactors = FALSE)[[3]][1:3]
+
+x
+```
+
+```{r}
+replace_emoji(x)
+```
+
+
 ### Emoticons
 
 Some analysis techniques examine words, meaning emoticons may be ignored.  `replace_emoticon` replaces emoticons with their word forms equivalents.
@@ -251,21 +269,6 @@ x <- c(
 replace_emoticon(x)
 ```
 
-### Emojis
-
-Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form.  `replace_emoji` replaces emojis with their word forms equivalents.
-
-```{r}
-x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"), 
-    stringsAsFactors = FALSE)[[3]][1:3]
-
-x
-```
-
-```{r}
-replace_emoji(x)
-```
-
 
 ### Grades
 
@@ -352,6 +355,7 @@ x <- c(
 replace_names(x)
 replace_names(x, replacement = '<<NAME>>')
 ```
+
 ### Non-ASCII Characters
 
 R can choke on non-ASCII characters.  They can be re-encoded but the new encoding may lack interpretability (e.g., &cent; may be converted to `\xA2` which is not easily understood or likely to be matched in a hash look up).  `replace_non_ascii` attempts to replace common non-ASCII characters with a text representation (e.g., &cent; becomes "cent")  Non recognized non-ASCII characters are simply removed (unless `remove.nonconverted = FALSE`).
@@ -447,6 +451,22 @@ cat(x)
 replace_white(x)
 ```
 
+### Word Elongation
+
+In informal writing people may use a form of text embellishment to emphasize or alter word meanings called elongation (a.k.a. "word lengthening").  For example, the use of "Whyyyyy" conveys frustration.  Other times the usage may be to be more sexy (e.g., "Heyyyy there").  Other times it may be used for emphasis (e.g., "This is so gooood").
+
+The `replace_word_elongation` function replaces these un-normalized forms with the most likely normalized form.  The `impart.meaning` argument can replace a short list of known elongations with semantic replacements.
+
+```{r}
+x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds',
+    'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!',
+    "I'm liiiike whyyyyy me?", "Wwwhhatttt!", NA)
+
+replace_word_elongation(x)
+replace_word_elongation(x, impart.meaning = TRUE)
+```
+
+
 
 ### Tokens