Change to post processing #1801

koheiw · 2019-12-13T05:32:54Z

remove preserve_special
add special handling for hyphens and twitter using tokens_compound()

This is mainly for #1503, but can also address #1446

- remove preserve_special - add special handling for hyphens and twitter using tokens_compound

koheiw · 2019-12-13T05:40:14Z

It is failing tests due to #1477. remove_hyphens needs to be split_hyphens in order to gain the same result from different tokens functions, and tokenization mode (word/fasterword/fastetword).

R/tokens.R

kbenoit · 2019-12-15T17:20:07Z

It's failing a number of tests now because of changes in how hyphens are processed. In former behaviour, remove_hyphens = TRUE, remove_punct = FALSE both split a hyphenated word and kept the hyphen. In the PR, it removes the inner hyphen.

So this passed before the PR, but fails in the PR (from test-tokens.R:381):

    expect_equal(as.character(tokens(txt, remove_hyphens = TRUE, remove_punct = FALSE)[[1]]),
                 c("a", "b", "-", "c", "d", ".", "!"))

For consistency, I think we should keep the older behaviour, and consider making the change later as a policy decision once we work out the full consistency of all of the token processing rules, across different tokenization "engines".

koheiw · 2019-12-15T19:29:11Z

My plan is to call the same internal function from tokens.tokens() and tokens.corpus() to make their behavior strictly the same. Let's decided what is the desired behavior, or leave this branch until we become clear about this. I think v2.0 is the best timing to switch to new behavior, although it would not be to different.

My suggestion is adding

separate_hyphens
separate_twitter
and using remove_punct to remove separated hyphens or Twitter tags.

Current remove_hyphens = TRUE will be the same as separate_hyphens = TRUE and remove_punct = TRUE. The difference is that it becomes possible keeping hyphens as separate token "-" when separate_hyphens = TRUE and remove_punct = FALSE. The same for remove_twitter.

expect_equal(as.character(tokens(txt, separate_hyphens = TRUE, remove_punct = FALSE)[[1]]),
                 c("a", "b", "-", "c", "d", ".", "!"))

koheiw · 2019-12-15T19:42:28Z

The new argument can be compound_hyphens or join_hyphens to make it clear that users can do it manually using tokens_compound()

kbenoit · 2019-12-15T19:54:40Z

Not foolproof though...

> tokens("Pre- and post-processing") %>%
+     tokens_compound("-", window = 1)
tokens from 1 document.
text1 :
[1] "Pre_-_and"        "post-processing"

koheiw · 2019-12-15T20:04:00Z

Good point 🤔

kbenoit · 2019-12-15T20:08:04Z

Other packages do this differently, although I prefer to match our previous behaviour than adopt this rule (which I don't think makes sense):

> tm::removePunctuation("I can't zig-zag.")
[1] "I cant zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_contractions = TRUE)
[1] "I can't zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_dashes = TRUE)
[1] "I cant zig-zag"

Overall, I think this would be a good case for

adding a new argument called separate_hyphens = FALSE which turns "zig-zag" into "zig", "-", "zag", just as does remove_hyphens = TRUE does now;
deprecate remove_hyphens
not automatically splitting on the first tokenization step, since it's probably faster to split post-tokenization, rather than to concatenate post-tokenization, and it avoids the problem above.

Let's keep the existing functionality now for Twitter since this is a very regular case of always having one of two special characters at the beginning of the token, and compounding will always work for this.

Can we implement the new processing workflow for v2 and still match this behaviour?

library("quanteda")
## Package version: 1.5.2

txt <- "Ex-post, @quantedainit #rstats."

tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"           "post"         "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"      "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"            "post"          "@quantedainit" "#rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       "@quantedainit" "#rstats"
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex"            "-"             "post"          ","            
## [5] "@quantedainit" "#rstats"       "."
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = FALSE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       ","             "@quantedainit" "#rstats"      
## [5] "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"            "-"             "post"          ","            
## [5] "@quantedainit" "#rstats"       "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       ","             "@quantedainit" "#rstats"      
## [5] "."

koheiw · 2019-12-15T20:11:48Z

In tokens.corpus(), we are file in this way

> stri_split_boundaries("Pre- and post-processing", type = "word", skip_word_none = FALSE) %>% 
+   as.tokens() %>% 
+   tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE) %>% 
+   tokens_compound("-", window = 1, concatenator = "")
tokens from 1 document.
text1 :
[1] "Pre-"            ""                "and"             ""                "post-processing"

but not sure in tokens.tokens()...

kbenoit · 2019-12-15T20:31:18Z

Ah right, we actually split then preserve now. So we already get the above case wrong, unless we use the whitespace splitter, in which case the behaviour is inconsistent.

> tokens("Pre- and post-processing", remove_hyphens = FALSE)
tokens from 1 document.
text1 :
[1] "Pre"             "-"               "and"             "post-processing"

> tokens("Pre- and post-processing", remove_hyphens = FALSE, what = "fasterword")
tokens from 1 document.
text1 :
[1] "Pre-"            "and"             "post-processing"

For total consistency we would not use any of the stri_split_boundaries() options at all, but just split whitespace, and then do the operations in post, but this might limit our applicability to non-English or at least non-Western languages.

I'm not sure of the solution, but we should be maximising flexibility to choose different tokenisation engines, while also ensuring consistency of our remove_* options in applying to them after splitting.

Worth a discussion probably, or solve it in January.

koheiw · 2019-12-15T20:34:04Z

We can get closer.

require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.9.9004
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

txt <- "Ex-post, @quantedainit #rstats."

tokens2 <- function(x, remove_punct, remove_twitter, remove_hyphens) {
  result <- as.tokens(stri_split_boundaries(x, type = "word", skip_word_none = FALSE))
  result <- tokens_remove(result, "^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE)
  if (!remove_twitter)
    result <- tokens_compound(result, "^[#@]$", window = c(0, 1), valuetype = "regex", concatenator = "")
  if (!remove_hyphens)
    result <- tokens_compound(result, "-", window = 1, valuetype = "regex", concatenator = "")
  if (remove_punct)
    result <- tokens_remove(result, "^[\\p{P}]+$", valuetype = "regex")
  return(result)
}
tokens2(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"           "post"         ""             "quantedainit" ""            
#> [6] "rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"            "post"          ""              "@quantedainit"
#> [5] ""              "#rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post"       ""              "@quantedainit" ""             
#> [5] "#rstats"
tokens2(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#>  [1] "Ex"           "-"            "post"         ","            ""            
#>  [6] "@"            "quantedainit" ""             "#"            "rstats"      
#> [11] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"            "-"             "post"          ","            
#> [5] ""              "@quantedainit" ""              "#rstats"      
#> [9] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post"       ","             ""              "@quantedainit"
#> [5] ""              "#rstats"       "."

^{Created on 2019-12-15 by the reprex package (v0.3.0)}

Merge branch 'master' into post-processing # Conflicts: # NEWS.md # R/tokens.R

Change to post processing

f2fa753

- remove preserve_special - add special handling for hyphens and twitter using tokens_compound

Fix

15d4237

koheiw mentioned this pull request Dec 13, 2019

Add padding to tokens #1802

Closed

koheiw requested a review from kbenoit December 13, 2019 10:12

Merge branch 'master' into post-processing

b5e0565

kbenoit reviewed Dec 15, 2019

View reviewed changes

R/tokens.R Outdated Show resolved Hide resolved

Fix man page; update NEWS

93ee3b4

kbenoit added this to the v2.0 essentials milestone Jan 2, 2020

kbenoit added this to To do in Innsbruck work plan Jan 11, 2020

merge with master

d427a42

Merge branch 'master' into post-processing # Conflicts: # NEWS.md # R/tokens.R

kbenoit closed this Jan 18, 2020

kbenoit removed this from To do in Innsbruck work plan Jan 21, 2020

kbenoit mentioned this pull request Jan 26, 2020

Substantial improvement and simplification of tokens approach #1857

Merged

kbenoit deleted the post-processing branch February 17, 2020 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to post processing #1801

Change to post processing #1801

koheiw commented Dec 13, 2019 •

edited

Loading

koheiw commented Dec 13, 2019

kbenoit commented Dec 15, 2019

koheiw commented Dec 15, 2019 •

edited

Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019 •

edited

Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019 •

edited

Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019

koheiw commented Dec 15, 2019 •

edited

Loading

Change to post processing #1801

Change to post processing #1801

Conversation

koheiw commented Dec 13, 2019 • edited Loading

koheiw commented Dec 13, 2019

kbenoit commented Dec 15, 2019

koheiw commented Dec 15, 2019 • edited Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019 • edited Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019 • edited Loading

koheiw commented Dec 15, 2019

kbenoit commented Dec 15, 2019

koheiw commented Dec 15, 2019 • edited Loading

koheiw commented Dec 13, 2019 •

edited

Loading

koheiw commented Dec 15, 2019 •

edited

Loading

kbenoit commented Dec 15, 2019 •

edited

Loading

kbenoit commented Dec 15, 2019 •

edited

Loading

koheiw commented Dec 15, 2019 •

edited

Loading