Update dfm #2251

koheiw · 2023-04-13T05:33:16Z

Call dfm.tokens_xptr() from dfm.tokens() for #2249. There are many changes and tests accordingly.

Merge branch 'update-pattern2id' into update-dfm # Conflicts: # man/search_glob.Rd

codecov · 2023-04-13T05:48:02Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.47 🎉

Comparison is base (2b4d07a) 95.66% compared to head (620360e) 96.13%.

❗ Current head 620360e differs from pull request most recent head a8c2ea2. Consider uploading reports for the commit a8c2ea2 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2251      +/-   ##
==========================================
+ Coverage   95.66%   96.13%   +0.47%     
==========================================
  Files          97       97              
  Lines        5600     5435     -165     
==========================================
- Hits         5357     5225     -132     
+ Misses        243      210      -33

Impacted Files	Coverage Δ
R/dfm-methods.R	`97.05% <ø> (ø)`
R/dfm_lookup.R	`95.16% <ø> (ø)`
R/dfm_match.R	`100.00% <ø> (ø)`
R/dfm_replace.R	`91.66% <ø> (ø)`
R/dfm_sort.R	`78.57% <ø> (ø)`
R/dfm_trim.R	`93.18% <ø> (ø)`
R/dfm_weight.R	`95.76% <ø> (ø)`
R/dictionaries.R	`94.24% <ø> (ø)`
R/docnames.R	`100.00% <ø> (ø)`
R/bootstrap_dfm.R	`100.00% <100.00%> (ø)`
... and 15 more

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

…o update-dfm

koheiw · 2023-04-16T10:00:33Z

@kbenoit can you review this?

kbenoit

This is a nice improvement and makes the underlying engine all based on xptr. It also removes the option of not doing that. Since the dfm is the same object, however, this should not affect compatibility, since a v3 dfm and a v4 dfm will be the same objects. (But do correct me if I'm wrong here!)

Two comments.

As I always emphasise, deprecations should be separate from functionality changes, whenever they are not integral to functionality change. Pruning is satisfying (I know!) but creates untold headaches for me in particular. The most time-consuming part of releasing v3 was me notifying other package maintainers of breaking changes and often, issuing my own PRs to fix code in their packages that would be broken by even our warnings. There is also the issue of the user base and their code. I agree we will defunct anything deprecated in v3, but there are some nifty tools for doing this now, and they are far more informative than just .Defunct(). (See https://r-pkgs.org/lifecycle.html) So I prefer to do these in a separate series of PRs because this is the most delicate surgery. That means I'd prefer the removals not be removed here. There is a lot that is good that you tidied here, including some tests we suppressed the warnings on that I agree should be removed here, but not simply removing code for v3 deprecations.
I still don't like the nomenclature of tokens_xptr. It is not consistent with our single-word object names, and the use of the underscore. It's not a function we have to call, I know, so I'm not 100% convinced it's a problem, but it's inconsistent - and consistency is one of the beautiful things about quanteda.

So I don't like seeing this for instance:

Error: dfm() only works on dfm, tokens, tokens_xptr objects.

Can we name this child class of tokens something that is a single word? tokensx or even just remove the underscore so it's tokensxptr? (I prefer the first.)

But it's not a deal-breaker.

R/data-documentation.R

R/bootstrap_dfm.R

tests/testthat/test-bootstrap.R

tests/testthat/test-default-methods.R

tests/testthat/test-dfm_group.R

tests/testthat/test-dfm_lookup.R

R/dfm.R

koheiw · 2023-04-20T13:26:56Z

Thanks for reviewing. I decided to remove deprecated functions to to reduce the workload. It does not make sense to write code for functions that will be removed. Introduction of the tokens_xptr object as you can see in the number of commits that I made in the last three months.

koheiw · 2023-04-20T13:30:22Z

There should not be any function that tokenize texts on the fly like bootstrap_dfm() and nsentecens() etc. This is why I removed. Uses should pass a DFM to bootstrap_dfm().

koheiw · 2023-04-20T13:34:44Z

More generally, we should reduced the number of functions more aggressively. It is becoming too hard to maintain. I am the only person who is doing the substantive development. Otherwise, I will really create quanteda.core package and leave functions that I think redundant with you to maintain.

koheiw · 2023-04-20T13:54:07Z

Please also note that despite the removed tests, the test coverage is higher significantly. We should keep removing redundant tests and examples too. I don't delete important tests.

kbenoit · 2023-04-20T16:22:36Z

@koheiw I'm so deeply impressed by how much time that you are putting into the package development - and the spectacular performance gains they have brought - that I almost messaged you out of concern whether you still have a job! (I hope so...)

If there are functions you think should be removed, propose them in an issue. But please don't remove them from a performance-related PR because this slows down the assessment of the PR. We can discuss and remove them separately. Anything that generates errors or warnings in rev deps is one of the hardest things to test for CRAN compliance. And most time-consuming to fix. I don't mind doing it but I prefer to minimise the post-surgery recovery time!

koheiw · 2023-04-20T23:51:47Z

I choose the name toknes_xptr for the following reasons when I started developing.

The new tokens object inherits properties of tokens and externalptr objects, so the class label could be c("externalptr", "tokens"), but it triggers methods for externalptr when I don't want to. If c("tokens", "externalptr"), I have to check the second class label in all the S3 methods for tokens.
I need a special class label for the new label that tell the users the nature of the object like "tokens_externalptr". It could be "tokensx", "xtokens" to make it short, but then I have to make a functions called tokensx_*() or xtokens_*(). It could be tokens-xptr or tokens.xptr but we know that the underscore is the best for S3 class labels.
It cannot be "tokens2" because "tokens" will co-exist with the new object.
We already have class labels with underscores like "textstat_proxy", "textstat_simil_sparse".

koheiw · 2023-04-20T23:58:20Z

By the way, this is not a PR for performance improvement. I fixed the C++ function on this branch because replacing R code with C++ code broke many tests.

koheiw · 2023-04-21T00:01:03Z

I decided to develop v4.0 because the data that I analyze on my job is getting larger and larger.

kbenoit · 2023-04-21T06:44:36Z

OK, all good points, let's do this then.

We stick with the class name, since what you say makes sense, and I was already thinking about the underscored classes in textstats and textmodels as precedents, so let's stick with tokens_xptr.
I'll restore the removed items that are not integral to the PR, but remove them in a separate PR so that I can test their effects, and not hold up things further. Then we can merge this PR and move on to the rest of v4 plan.

Also adds an argument to check_class() so that we can list methods not to described as in the valid set. Default is NULL so that existing calls to check_class() are unaffected.

We should be doing this for every change, every PR as we move into v4.

kbenoit

I'm going to start running revdep_check() with each PR but let's merge it first and I'll start the process then.

koheiw added 14 commits April 13, 2023 08:38

Use dfm.token_xptr and remove deprecated

7047908

Fix verbose messages

e2d9735

Merge branch 'update-pattern2id' into update-dfm # Conflicts: # man/search_glob.Rd

Add more tests for tokens_lookup()

0a61b50

Make cpp_dfm more reliable

ceb035e

Update tests

cb0478d

Merege updates of patter2fixed

bfc7336

Only allow bootstrap dfm

41005d0

Fix verbose message

f1b9736

Make index_types internal

f28c46d

Update tests

8cc29d3

Update tests

234e681

Improve verbose message

e3217e4

Build

bd85287

Merge branch 'master' into update-dfm

2343ada

koheiw added 5 commits April 13, 2023 14:51

Fix dfm() examples

2260999

Fix dfm examples

09fd688

Build

fc434fa

Merge branch 'update-dfm' of https://github.com/quanteda/quanteda int…

2be3baa

…o update-dfm

Add tests

c17a20a

koheiw requested a review from kbenoit April 13, 2023 09:15

koheiw added 9 commits April 13, 2023 18:33

Update benchmarking

44dfc51

Save Rmd

972ec1a

Ignore html

c362793

Delete html

9d7ad2d

Update

a0b7814

Fix test

402df8f

Change to unsigned int

788fcd5

Build

89b95c3

Build

1be5314

koheiw added 2 commits April 16, 2023 18:05

Specify name space

7302bf4

Ensure that objects are deep copied when ndoc changes

b1458d0

koheiw added 8 commits April 16, 2023 19:44

Deep copy when [] is used without i

8035fac

Add tests

1545c66

Use join_strings()

ce64837

tidy up tests

30d8381

Pass tpes by reference

32192d3

Add benchmarking on ngrams

52be503

Update and enable tests

5a10d52

Add tests for conbining

760f07b

koheiw mentioned this pull request Apr 20, 2023

Add a performance vignette #2248

Closed

kbenoit reviewed Apr 20, 2023

View reviewed changes

Kenneth Benoit added 3 commits April 21, 2023 07:53

Restore test for #389

b199bc1

Defunct dfm character and corpus methods

7fdea44

Also adds an argument to check_class() so that we can list methods not to described as in the valid set. Default is NULL so that existing calls to check_class() are unaffected.

Update NEWS for changes in this PR

66e0b41

We should be doing this for every change, every PR as we move into v4.

kbenoit approved these changes Apr 21, 2023

View reviewed changes

kbenoit mentioned this pull request Apr 21, 2023

Test bootstrap_dfm() removals from #2251 #2260

Closed

Merge branch 'master' into update-dfm

a8c2ea2

koheiw merged commit 892acc2 into master Apr 21, 2023
6 checks passed

koheiw deleted the update-dfm branch April 21, 2023 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dfm #2251

Update dfm #2251

koheiw commented Apr 13, 2023 •

edited

Loading

codecov bot commented Apr 13, 2023 •

edited

Loading

koheiw commented Apr 16, 2023

kbenoit left a comment

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

kbenoit commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 21, 2023

kbenoit commented Apr 21, 2023

kbenoit left a comment

Update dfm #2251

Update dfm #2251

Conversation

koheiw commented Apr 13, 2023 • edited Loading

codecov bot commented Apr 13, 2023 • edited Loading

Codecov Report

koheiw commented Apr 16, 2023

kbenoit left a comment

Choose a reason for hiding this comment

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

kbenoit commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 20, 2023

koheiw commented Apr 21, 2023

kbenoit commented Apr 21, 2023

kbenoit left a comment

Choose a reason for hiding this comment

koheiw commented Apr 13, 2023 •

edited

Loading

codecov bot commented Apr 13, 2023 •

edited

Loading