New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev collocations #753
Dev collocations #753
Conversation
Disable `ordered` for now
…en [min_size, max_size]
Codecov Report
@@ Coverage Diff @@
## master #753 +/- ##
==========================================
+ Coverage 79.13% 79.31% +0.18%
==========================================
Files 80 81 +1
Lines 5810 5822 +12
==========================================
+ Hits 4598 4618 +20
+ Misses 1212 1204 -8 |
Codecov Report
@@ Coverage Diff @@
## master #753 +/- ##
=========================================
+ Coverage 79.13% 79.3% +0.17%
=========================================
Files 80 81 +1
Lines 5810 5824 +14
=========================================
+ Hits 4598 4619 +21
+ Misses 1212 1205 -7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good, thanks. The p-values on the bigrams (max_size = 2
) are correct and I think the nesting behaviour is more correct. Overall the results are very similar to the previous code, but this is because of the strong influence of frequency on the results, regardless of the way the statistics are computed.
Suggest you do the following:
- change
sequences()
tosequences_old()
, and put it in its own file - change the
sequences2
tosequences()
(note: this will meantextstat_collocations(x, method = "bj")
will call the newsequences()
rather than the old) - update the tests as appropriate
- we ask @koheiw to review before merging into
master
From there we can use the bit-wise masking approach in C++ to reimplement the other association measures.
Ultimately too I'd like to figure out a parsimonious recommended workflow for selecting tokens using tokens_select(x, padding = TRUE)
before sending this to textstat_collocations()
.
Ultimately we will make sequences()
internal-only, and just call it through textstat_collocations()
.
@HaiyanLW I'm happy for you to merge this into Then you can make a new branch and work on implementing the LR, Chi2, Dice, and PMI using this method. |
Merge dev_collocations to master with corrected sequences() |
Modified the estimation functions in
sequence.R
as discussed with @kbenoit (new file issequences2.R
)Made changes in some arguments.