-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change to post processing #1801
Conversation
- remove preserve_special - add special handling for hyphens and twitter using tokens_compound
It is failing tests due to #1477. |
It's failing a number of tests now because of changes in how hyphens are processed. In former behaviour, So this passed before the PR, but fails in the PR (from expect_equal(as.character(tokens(txt, remove_hyphens = TRUE, remove_punct = FALSE)[[1]]),
c("a", "b", "-", "c", "d", ".", "!")) For consistency, I think we should keep the older behaviour, and consider making the change later as a policy decision once we work out the full consistency of all of the token processing rules, across different tokenization "engines". |
My plan is to call the same internal function from My suggestion is adding
Current expect_equal(as.character(tokens(txt, separate_hyphens = TRUE, remove_punct = FALSE)[[1]]),
c("a", "b", "-", "c", "d", ".", "!")) |
The new argument can be |
Not foolproof though... > tokens("Pre- and post-processing") %>%
+ tokens_compound("-", window = 1)
tokens from 1 document.
text1 :
[1] "Pre_-_and" "post-processing" |
Good point 🤔 |
Other packages do this differently, although I prefer to match our previous behaviour than adopt this rule (which I don't think makes sense): > tm::removePunctuation("I can't zig-zag.")
[1] "I cant zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_contractions = TRUE)
[1] "I can't zigzag"
> tm::removePunctuation("I can't zig-zag.", preserve_intra_word_dashes = TRUE)
[1] "I cant zig-zag" Overall, I think this would be a good case for
Let's keep the existing functionality now for Twitter since this is a very regular case of always having one of two special characters at the beginning of the token, and compounding will always work for this. Can we implement the new processing workflow for v2 and still match this behaviour? library("quanteda")
## Package version: 1.5.2
txt <- "Ex-post, @quantedainit #rstats."
tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex" "post" "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post" "quantedainit" "rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex" "post" "@quantedainit" "#rstats"
tokens(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post" "@quantedainit" "#rstats"
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex" "-" "post" ","
## [5] "@quantedainit" "#rstats" "."
tokens(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = FALSE)
## Warning in tokens_internal(texts(x), ...): remove_twitter reset to FALSE when
## remove_punct = FALSE
## tokens from 1 document.
## text1 :
## [1] "Ex-post" "," "@quantedainit" "#rstats"
## [5] "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
## tokens from 1 document.
## text1 :
## [1] "Ex" "-" "post" ","
## [5] "@quantedainit" "#rstats" "."
tokens(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Ex-post" "," "@quantedainit" "#rstats"
## [5] "." |
In > stri_split_boundaries("Pre- and post-processing", type = "word", skip_word_none = FALSE) %>%
+ as.tokens() %>%
+ tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE) %>%
+ tokens_compound("-", window = 1, concatenator = "")
tokens from 1 document.
text1 :
[1] "Pre-" "" "and" "" "post-processing" but not sure in |
Ah right, we actually split then preserve now. So we already get the above case wrong, unless we use the whitespace splitter, in which case the behaviour is inconsistent.
For total consistency we would not use any of the I'm not sure of the solution, but we should be maximising flexibility to choose different tokenisation engines, while also ensuring consistency of our Worth a discussion probably, or solve it in January. |
We can get closer. require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.9.9004
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
txt <- "Ex-post, @quantedainit #rstats."
tokens2 <- function(x, remove_punct, remove_twitter, remove_hyphens) {
result <- as.tokens(stri_split_boundaries(x, type = "word", skip_word_none = FALSE))
result <- tokens_remove(result, "^[\\p{Z}\\p{C}]+$", valuetype = "regex", padding = TRUE)
if (!remove_twitter)
result <- tokens_compound(result, "^[#@]$", window = c(0, 1), valuetype = "regex", concatenator = "")
if (!remove_hyphens)
result <- tokens_compound(result, "-", window = 1, valuetype = "regex", concatenator = "")
if (remove_punct)
result <- tokens_remove(result, "^[\\p{P}]+$", valuetype = "regex")
return(result)
}
tokens2(txt, remove_punct = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex" "post" "" "quantedainit" ""
#> [6] "rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex" "post" "" "@quantedainit"
#> [5] "" "#rstats"
tokens2(txt, remove_punct = TRUE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post" "" "@quantedainit" ""
#> [5] "#rstats"
tokens2(txt, remove_punct = FALSE, remove_twitter = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex" "-" "post" "," ""
#> [6] "@" "quantedainit" "" "#" "rstats"
#> [11] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = TRUE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex" "-" "post" ","
#> [5] "" "@quantedainit" "" "#rstats"
#> [9] "."
tokens2(txt, remove_punct = FALSE, remove_twitter = FALSE, remove_hyphens = FALSE)
#> tokens from 1 document.
#> text1 :
#> [1] "Ex-post" "," "" "@quantedainit"
#> [5] "" "#rstats" "." Created on 2019-12-15 by the reprex package (v0.3.0) |
Merge branch 'master' into post-processing # Conflicts: # NEWS.md # R/tokens.R
tokens_compound()
This is mainly for #1503, but can also address #1446