-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dfm #2251
Update dfm #2251
Conversation
Merge branch 'update-pattern2id' into update-dfm # Conflicts: # man/search_glob.Rd
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #2251 +/- ##
==========================================
+ Coverage 95.66% 96.13% +0.47%
==========================================
Files 97 97
Lines 5600 5435 -165
==========================================
- Hits 5357 5225 -132
+ Misses 243 210 -33
... and 4 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
@kbenoit can you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice improvement and makes the underlying engine all based on xptr. It also removes the option of not doing that. Since the dfm is the same object, however, this should not affect compatibility, since a v3 dfm and a v4 dfm will be the same objects. (But do correct me if I'm wrong here!)
Two comments.
-
As I always emphasise, deprecations should be separate from functionality changes, whenever they are not integral to functionality change. Pruning is satisfying (I know!) but creates untold headaches for me in particular. The most time-consuming part of releasing v3 was me notifying other package maintainers of breaking changes and often, issuing my own PRs to fix code in their packages that would be broken by even our warnings. There is also the issue of the user base and their code. I agree we will defunct anything deprecated in v3, but there are some nifty tools for doing this now, and they are far more informative than just
.Defunct()
. (See https://r-pkgs.org/lifecycle.html) So I prefer to do these in a separate series of PRs because this is the most delicate surgery. That means I'd prefer the removals not be removed here. There is a lot that is good that you tidied here, including some tests we suppressed the warnings on that I agree should be removed here, but not simply removing code for v3 deprecations. -
I still don't like the nomenclature of tokens_xptr. It is not consistent with our single-word object names, and the use of the underscore. It's not a function we have to call, I know, so I'm not 100% convinced it's a problem, but it's inconsistent - and consistency is one of the beautiful things about quanteda.
So I don't like seeing this for instance:
Error: dfm() only works on dfm, tokens, tokens_xptr objects.
Can we name this child class of tokens something that is a single word? tokensx
or even just remove the underscore so it's tokensxptr
? (I prefer the first.)
But it's not a deal-breaker.
Thanks for reviewing. I decided to remove deprecated functions to to reduce the workload. It does not make sense to write code for functions that will be removed. Introduction of the |
There should not be any function that tokenize texts on the fly like |
More generally, we should reduced the number of functions more aggressively. It is becoming too hard to maintain. I am the only person who is doing the substantive development. Otherwise, I will really create quanteda.core package and leave functions that I think redundant with you to maintain. |
Please also note that despite the removed tests, the test coverage is higher significantly. We should keep removing redundant tests and examples too. I don't delete important tests. |
@koheiw I'm so deeply impressed by how much time that you are putting into the package development - and the spectacular performance gains they have brought - that I almost messaged you out of concern whether you still have a job! (I hope so...) If there are functions you think should be removed, propose them in an issue. But please don't remove them from a performance-related PR because this slows down the assessment of the PR. We can discuss and remove them separately. Anything that generates errors or warnings in rev deps is one of the hardest things to test for CRAN compliance. And most time-consuming to fix. I don't mind doing it but I prefer to minimise the post-surgery recovery time! |
I choose the name
|
By the way, this is not a PR for performance improvement. I fixed the C++ function on this branch because replacing R code with C++ code broke many tests. |
I decided to develop v4.0 because the data that I analyze on my job is getting larger and larger. |
OK, all good points, let's do this then.
|
Also adds an argument to check_class() so that we can list methods not to described as in the valid set. Default is NULL so that existing calls to check_class() are unaffected.
We should be doing this for every change, every PR as we move into v4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to start running revdep_check()
with each PR but let's merge it first and I'll start the process then.
Call
dfm.tokens_xptr()
fromdfm.tokens()
for #2249. There are many changes and tests accordingly.