Skip to content

Add remove_padding to ntoken()#2336

Merged
kbenoit merged 9 commits intomasterfrom
dev-ntoken
Jan 11, 2024
Merged

Add remove_padding to ntoken()#2336
kbenoit merged 9 commits intomasterfrom
dev-ntoken

Conversation

@koheiw
Copy link
Copy Markdown
Collaborator

@koheiw koheiw commented Jan 4, 2024

If ntoken(x, remove_padding = TRUE), it returns the number of tokens ignoring paddings. The results are the same as ntoken(tokens_remove(x, pattern = "")) but much more efficient. This will be useful in showing the number of tokens removed in verbose messagess (#2329).

I kept ... in ntoken() but I think the argument should be deprecated as it does not work in ntoken.dfm(). User should just run ntoken(tokens(x)) if necessary. The documentation also says "... not used".

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7baa4ee) 95.97% compared to head (0c6f5b9) 95.94%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2336      +/-   ##
==========================================
- Coverage   95.97%   95.94%   -0.04%     
==========================================
  Files          96       96              
  Lines        5620     5625       +5     
==========================================
+ Hits         5394     5397       +3     
- Misses        226      228       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@koheiw
Copy link
Copy Markdown
Collaborator Author

koheiw commented Jan 4, 2024

We can compute how many tokens were removed easily even when padding = TRUE.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)

toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE, padding = TRUE)
#> removed 20,180 features
#> 

sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 0
sum(ntoken(toks)) - sum(ntoken(toks2, remove_padding = TRUE))
#> [1] 4584

@kbenoit
Copy link
Copy Markdown
Collaborator

kbenoit commented Jan 8, 2024

Nice addition.

On removing ...: Right now the ... in ntoken() and ntype() for char and corpus are already deprecated because those entire functions are. Calling it on-the-fly in ntoken.tokens() is not nearly as expensive since it's a removal only, not re-tokenising. And keeping the default remove options via ... would be consistent with adding the new remove_padding option. Just as remove_padding adds the capability of computing these quickly if a count is desired, the other remove options provide the chance to do this if a user wants to compare the token counts with and without punctuation, for instance.

@kbenoit kbenoit self-requested a review January 8, 2024 06:09
Copy link
Copy Markdown
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to ntype() for consistency? (It's only a 1 token difference, but still, why not be consistent?)

@koheiw
Copy link
Copy Markdown
Collaborator Author

koheiw commented Jan 8, 2024

On removing ...: Right now the ... in ntoken() and ntype() for char and corpus are already deprecated because those entire functions are. Calling it on-the-fly in ntoken.tokens() is not nearly as expensive since it's a removal only, not re-tokenising. And keeping the default remove options via ... would be consistent with adding the new remove_padding option. Just as remove_padding adds the capability of computing these quickly if a count is desired, the other remove options provide the chance to do this if a user wants to compare the token counts with and without punctuation, for instance.

There are three types of arguments in quanteda:

  1. essential (e.g. pattern and modify_if in tokens_*())
  2. speed up (e.g. ntoken(x, remove_padding) is much faster than ntoken(tokens_remove(x, ""))
  3. short cut (e.g. ntoken(x, ...) is the same as ntoken(tokens(x, ...)))

We need a design policy on when and how we support 2 and 3 to keep our packages easy to use and maintain. We can decide 2 based on speed up gain (e.g. the code become more than x times faster); 3 can be based on reduced steps (e.g. users need to type x commands less).

@koheiw
Copy link
Copy Markdown
Collaborator Author

koheiw commented Jan 8, 2024

Should we add this to ntype() for consistency? (It's only a 1 token difference, but still, why not be consistent?)

We can but we probably need to add types(remove_padding = TRUE) if we do so. The default value will be different between ntype() and ntoken() 🤔

@kbenoit
Copy link
Copy Markdown
Collaborator

kbenoit commented Jan 9, 2024

Thinking about this, remove_padding = FALSE ought to be the default for both. If a dfm creates the pad as a "feature" by default, then it should be considered a token/type by default. So I'd say include it for both and make the default false for both.

Might even make sense then to include it in the methods for dfms too then.

@kbenoit
Copy link
Copy Markdown
Collaborator

kbenoit commented Jan 10, 2024

This is more consistent now, since formerly, it was:

> tokens("a b c") |>
+     tokens_remove("b", padding = TRUE) |>
+     ntype()
text1 
    2 
> tokens("a b c") |>
+     tokens_remove("b", padding = TRUE) |>
+     ntoken()
text1 
    3 

Now it's 3 for both.

And for dfms created from this, it's also consistent, although we don't have a removal option.

> tokens("a b c") |>
+     tokens_remove("b", padding = TRUE) |>
+     dfm() |>
+     ntoken()
text1 
    3 
> tokens("a b c") |>
+     tokens_remove("b", padding = TRUE) |>
+     dfm() |>
+     ntoken(remove_padding = TRUE)
text1 
    3 

@kbenoit kbenoit merged commit 069e313 into master Jan 11, 2024
@kbenoit kbenoit deleted the dev-ntoken branch January 11, 2024 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants