Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to post processing #1801

Closed
wants to merge 5 commits into from
Closed

Change to post processing #1801

wants to merge 5 commits into from

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Dec 13, 2019

  • remove preserve_special
  • add special handling for hyphens and twitter using tokens_compound()

This is mainly for #1503, but can also address #1446

- remove preserve_special
- add special handling for hyphens and twitter using tokens_compound
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 13, 2019

It is failing tests due to #1477. remove_hyphens needs to be split_hyphens in order to gain the same result from different tokens functions, and tokenization mode (word/fasterword/fastetword).

@koheiw koheiw mentioned this pull request Dec 13, 2019
@koheiw koheiw requested a review from kbenoit December 13, 2019 10:12
R/tokens.R Outdated Show resolved Hide resolved
@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2019

It's failing a number of tests now because of changes in how hyphens are processed. In former behaviour, remove_hyphens = TRUE, remove_punct = FALSE both split a hyphenated word and kept the hyphen. In the PR, it removes the inner hyphen.

So this passed before the PR, but fails in the PR (from test-tokens.R:381):

    expect_equal(as.character(tokens(txt, remove_hyphens = TRUE, remove_punct = FALSE)[[1]]),
                 c("a", "b", "-", "c", "d", ".", "!"))

For consistency, I think we should keep the older behaviour, and consider making the change later as a policy decision once we work out the full consistency of all of the token processing rules, across different tokenization "engines".

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

My plan is to call the same internal function from tokens.tokens() and tokens.corpus() to make their behavior strictly the same. Let's decided what is the desired behavior, or leave this branch until we become clear about this. I think v2.0 is the best timing to switch to new behavior, although it would not be to different.

My suggestion is adding

  • separate_hyphens
  • separate_twitter
    and using remove_punct to remove separated hyphens or Twitter tags.

Current remove_hyphens = TRUE will be the same as separate_hyphens = TRUE and remove_punct = TRUE. The difference is that it becomes possible keeping hyphens as separate token "-" when separate_hyphens = TRUE and remove_punct = FALSE. The same for remove_twitter.

expect_equal(as.character(tokens(txt, separate_hyphens = TRUE, remove_punct = FALSE)[[1]]),
                 c("a", "b", "-", "c", "d", ".", "!"))

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

The new argument can be compound_hyphens or join_hyphens to make it clear that users can do it manually using tokens_compound()

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2019

Not foolproof though...

> tokens("Pre- and post-processing") %>%
+     tokens_compound("-", window = 1)
tokens from 1 document.
text1 :
[1] "Pre_-_and"        "post-processing"

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

Good point 🤔

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2019

Other packages do this differently, although I prefer to match our previous behaviour than adopt this rule (which I don't think makes sense):

> tm::removePunctuation("I can't zig-zag.")
[1] "I cant zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_contractions = TRUE)
[1] "I can't zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_dashes = TRUE)
[1] "I cant zig-zag"

Overall, I think this would be a good case for

  • adding a new argument called separate_hyphens = FALSE which turns "zig-zag" into "zig", "-", "zag", just as does remove_hyphens = TRUE does now;
  • deprecate remove_hyphens
  • not automatically splitting on the first tokenization step, since it's probably faster to split post-tokenization, rather than to concatenate post-tokenization, and it avoids the problem above.

Let's keep the existing functionality now for Twitter since this is a very regular case of always having one of two special characters at the beginning of the token, and compounding will always work for this.

Can we implement the new processing workflow for v2 and still match this behaviour?

library("quanteda")
## Package version: 1.5.2

txt <- "Ex-post, @quantedainit #rstats."

tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"           "post"         "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"      "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"            "post"          "@quantedainit" "#rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       "@quantedainit" "#rstats"
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex"            "-"             "post"          ","            
## [5] "@quantedainit" "#rstats"       "."
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = FALSE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       ","             "@quantedainit" "#rstats"      
## [5] "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex"            "-"             "post"          ","            
## [5] "@quantedainit" "#rstats"       "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post"       ","             "@quantedainit" "#rstats"      
## [5] "."

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

In tokens.corpus(), we are file in this way

> stri_split_boundaries("Pre- and post-processing", type = "word", skip_word_none = FALSE) %>% 
+   as.tokens() %>% 
+   tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE) %>% 
+   tokens_compound("-", window = 1, concatenator = "")
tokens from 1 document.
text1 :
[1] "Pre-"            ""                "and"             ""                "post-processing"

but not sure in tokens.tokens()...

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2019

Ah right, we actually split then preserve now. So we already get the above case wrong, unless we use the whitespace splitter, in which case the behaviour is inconsistent.

> tokens("Pre- and post-processing", remove_hyphens = FALSE)
tokens from 1 document.
text1 :
[1] "Pre"             "-"               "and"             "post-processing"

> tokens("Pre- and post-processing", remove_hyphens = FALSE, what = "fasterword")
tokens from 1 document.
text1 :
[1] "Pre-"            "and"             "post-processing"

For total consistency we would not use any of the stri_split_boundaries() options at all, but just split whitespace, and then do the operations in post, but this might limit our applicability to non-English or at least non-Western languages.

I'm not sure of the solution, but we should be maximising flexibility to choose different tokenisation engines, while also ensuring consistency of our remove_* options in applying to them after splitting.

Worth a discussion probably, or solve it in January.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

We can get closer.

require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.9.9004
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

txt <- "Ex-post, @quantedainit #rstats."

tokens2 <- function(x, remove_punct, remove_twitter, remove_hyphens) {
  result <- as.tokens(stri_split_boundaries(x, type = "word", skip_word_none = FALSE))
  result <- tokens_remove(result, "^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE)
  if (!remove_twitter)
    result <- tokens_compound(result, "^[#@]$", window = c(0, 1), valuetype = "regex", concatenator = "")
  if (!remove_hyphens)
    result <- tokens_compound(result, "-", window = 1, valuetype = "regex", concatenator = "")
  if (remove_punct)
    result <- tokens_remove(result, "^[\\p{P}]+$", valuetype = "regex")
  return(result)
}
tokens2(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"           "post"         ""             "quantedainit" ""            
#> [6] "rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"            "post"          ""              "@quantedainit"
#> [5] ""              "#rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post"       ""              "@quantedainit" ""             
#> [5] "#rstats"
tokens2(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#>  [1] "Ex"           "-"            "post"         ","            ""            
#>  [6] "@"            "quantedainit" ""             "#"            "rstats"      
#> [11] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex"            "-"             "post"          ","            
#> [5] ""              "@quantedainit" ""              "#rstats"      
#> [9] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post"       ","             ""              "@quantedainit"
#> [5] ""              "#rstats"       "."

Created on 2019-12-15 by the reprex package (v0.3.0)

@kbenoit kbenoit added this to the v2.0 essentials milestone Jan 2, 2020
@kbenoit kbenoit added this to To do in Innsbruck work plan Jan 11, 2020
Merge branch 'master' into post-processing

# Conflicts:
#	NEWS.md
#	R/tokens.R
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants