Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2336 +/- ##
==========================================
- Coverage 95.97% 95.94% -0.04%
==========================================
Files 96 96
Lines 5620 5625 +5
==========================================
+ Hits 5394 5397 +3
- Misses 226 228 +2 ☔ View full report in Codecov by Sentry. |
|
We can compute how many tokens were removed easily even when require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE, padding = TRUE)
#> removed 20,180 features
#>
sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 0
sum(ntoken(toks)) - sum(ntoken(toks2, remove_padding = TRUE))
#> [1] 4584 |
|
Nice addition. On removing |
kbenoit
left a comment
There was a problem hiding this comment.
Should we add this to ntype() for consistency? (It's only a 1 token difference, but still, why not be consistent?)
There are three types of arguments in quanteda:
We need a design policy on when and how we support 2 and 3 to keep our packages easy to use and maintain. We can decide 2 based on speed up gain (e.g. the code become more than x times faster); 3 can be based on reduced steps (e.g. users need to type x commands less). |
We can but we probably need to add |
|
Thinking about this, Might even make sense then to include it in the methods for dfms too then. |
|
This is more consistent now, since formerly, it was: > tokens("a b c") |>
+ tokens_remove("b", padding = TRUE) |>
+ ntype()
text1
2
> tokens("a b c") |>
+ tokens_remove("b", padding = TRUE) |>
+ ntoken()
text1
3 Now it's 3 for both. And for dfms created from this, it's also consistent, although we don't have a removal option. > tokens("a b c") |>
+ tokens_remove("b", padding = TRUE) |>
+ dfm() |>
+ ntoken()
text1
3
> tokens("a b c") |>
+ tokens_remove("b", padding = TRUE) |>
+ dfm() |>
+ ntoken(remove_padding = TRUE)
text1
3 |
If
ntoken(x, remove_padding = TRUE), it returns the number of tokens ignoring paddings. The results are the same asntoken(tokens_remove(x, pattern = ""))but much more efficient. This will be useful in showing the number of tokens removed in verbose messagess (#2329).I kept
...inntoken()but I think the argument should be deprecated as it does not work inntoken.dfm(). User should just runntoken(tokens(x))if necessary. The documentation also says "... not used".