Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dfm #2251

Merged
merged 44 commits into from
Apr 21, 2023
Merged

Update dfm #2251

merged 44 commits into from
Apr 21, 2023

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Apr 13, 2023

Call dfm.tokens_xptr() from dfm.tokens() for #2249. There are many changes and tests accordingly.

@codecov
Copy link

codecov bot commented Apr 13, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.47 🎉

Comparison is base (2b4d07a) 95.66% compared to head (620360e) 96.13%.

❗ Current head 620360e differs from pull request most recent head a8c2ea2. Consider uploading reports for the commit a8c2ea2 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2251      +/-   ##
==========================================
+ Coverage   95.66%   96.13%   +0.47%     
==========================================
  Files          97       97              
  Lines        5600     5435     -165     
==========================================
- Hits         5357     5225     -132     
+ Misses        243      210      -33     
Impacted Files Coverage Δ
R/dfm-methods.R 97.05% <ø> (ø)
R/dfm_lookup.R 95.16% <ø> (ø)
R/dfm_match.R 100.00% <ø> (ø)
R/dfm_replace.R 91.66% <ø> (ø)
R/dfm_sort.R 78.57% <ø> (ø)
R/dfm_trim.R 93.18% <ø> (ø)
R/dfm_weight.R 95.76% <ø> (ø)
R/dictionaries.R 94.24% <ø> (ø)
R/docnames.R 100.00% <ø> (ø)
R/bootstrap_dfm.R 100.00% <100.00%> (ø)
... and 15 more

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@koheiw koheiw requested a review from kbenoit April 13, 2023 09:15
@koheiw
Copy link
Collaborator Author

koheiw commented Apr 16, 2023

@kbenoit can you review this?

Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement and makes the underlying engine all based on xptr. It also removes the option of not doing that. Since the dfm is the same object, however, this should not affect compatibility, since a v3 dfm and a v4 dfm will be the same objects. (But do correct me if I'm wrong here!)

Two comments.

  1. As I always emphasise, deprecations should be separate from functionality changes, whenever they are not integral to functionality change. Pruning is satisfying (I know!) but creates untold headaches for me in particular. The most time-consuming part of releasing v3 was me notifying other package maintainers of breaking changes and often, issuing my own PRs to fix code in their packages that would be broken by even our warnings. There is also the issue of the user base and their code. I agree we will defunct anything deprecated in v3, but there are some nifty tools for doing this now, and they are far more informative than just .Defunct(). (See https://r-pkgs.org/lifecycle.html) So I prefer to do these in a separate series of PRs because this is the most delicate surgery. That means I'd prefer the removals not be removed here. There is a lot that is good that you tidied here, including some tests we suppressed the warnings on that I agree should be removed here, but not simply removing code for v3 deprecations.

  2. I still don't like the nomenclature of tokens_xptr. It is not consistent with our single-word object names, and the use of the underscore. It's not a function we have to call, I know, so I'm not 100% convinced it's a problem, but it's inconsistent - and consistency is one of the beautiful things about quanteda.

So I don't like seeing this for instance:

Error: dfm() only works on dfm, tokens, tokens_xptr objects.

Can we name this child class of tokens something that is a single word? tokensx or even just remove the underscore so it's tokensxptr? (I prefer the first.)

But it's not a deal-breaker.

R/data-documentation.R Show resolved Hide resolved
R/bootstrap_dfm.R Show resolved Hide resolved
tests/testthat/test-bootstrap.R Show resolved Hide resolved
tests/testthat/test-default-methods.R Show resolved Hide resolved
tests/testthat/test-dfm_group.R Show resolved Hide resolved
tests/testthat/test-dfm_lookup.R Show resolved Hide resolved
R/dfm.R Show resolved Hide resolved
@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

Thanks for reviewing. I decided to remove deprecated functions to to reduce the workload. It does not make sense to write code for functions that will be removed. Introduction of the tokens_xptr object as you can see in the number of commits that I made in the last three months.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

There should not be any function that tokenize texts on the fly like bootstrap_dfm() and nsentecens() etc. This is why I removed. Uses should pass a DFM to bootstrap_dfm().

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

More generally, we should reduced the number of functions more aggressively. It is becoming too hard to maintain. I am the only person who is doing the substantive development. Otherwise, I will really create quanteda.core package and leave functions that I think redundant with you to maintain.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

Please also note that despite the removed tests, the test coverage is higher significantly. We should keep removing redundant tests and examples too. I don't delete important tests.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 20, 2023

@koheiw I'm so deeply impressed by how much time that you are putting into the package development - and the spectacular performance gains they have brought - that I almost messaged you out of concern whether you still have a job! (I hope so...)

If there are functions you think should be removed, propose them in an issue. But please don't remove them from a performance-related PR because this slows down the assessment of the PR. We can discuss and remove them separately. Anything that generates errors or warnings in rev deps is one of the hardest things to test for CRAN compliance. And most time-consuming to fix. I don't mind doing it but I prefer to minimise the post-surgery recovery time!

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

I choose the name toknes_xptr for the following reasons when I started developing.

  1. The new tokens object inherits properties of tokens and externalptr objects, so the class label could be c("externalptr", "tokens"), but it triggers methods for externalptr when I don't want to. If c("tokens", "externalptr"), I have to check the second class label in all the S3 methods for tokens.
  2. I need a special class label for the new label that tell the users the nature of the object like "tokens_externalptr". It could be "tokensx", "xtokens" to make it short, but then I have to make a functions called tokensx_*() or xtokens_*(). It could be tokens-xptr or tokens.xptr but we know that the underscore is the best for S3 class labels.
  3. It cannot be "tokens2" because "tokens" will co-exist with the new object.
  4. We already have class labels with underscores like "textstat_proxy", "textstat_simil_sparse".

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 20, 2023

By the way, this is not a PR for performance improvement. I fixed the C++ function on this branch because replacing R code with C++ code broke many tests.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 21, 2023

I decided to develop v4.0 because the data that I analyze on my job is getting larger and larger.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 21, 2023

OK, all good points, let's do this then.

  1. We stick with the class name, since what you say makes sense, and I was already thinking about the underscored classes in textstats and textmodels as precedents, so let's stick with tokens_xptr.
  2. I'll restore the removed items that are not integral to the PR, but remove them in a separate PR so that I can test their effects, and not hold up things further. Then we can merge this PR and move on to the rest of v4 plan.

Kenneth Benoit added 3 commits April 21, 2023 07:53
Also adds an argument to check_class() so that we can list methods not to described as in the valid set.  Default is NULL so that existing calls to check_class() are unaffected.
We should be doing this for every change, every PR as we move into v4.
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to start running revdep_check() with each PR but let's merge it first and I'll start the process then.

@koheiw koheiw merged commit 892acc2 into master Apr 21, 2023
6 checks passed
@koheiw koheiw deleted the update-dfm branch April 21, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants