Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev collocations #753

Merged
merged 39 commits into from Jun 1, 2017
Merged

Dev collocations #753

merged 39 commits into from Jun 1, 2017

Conversation

HaiyanLW
Copy link
Collaborator

Modified the estimation functions in sequence.R as discussed with @kbenoit (new file is sequences2.R)
Made changes in some arguments.

@HaiyanLW HaiyanLW requested a review from kbenoit May 25, 2017 17:17
@codecov
Copy link

codecov bot commented May 25, 2017

Codecov Report

Merging #753 into master will increase coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #753      +/-   ##
==========================================
+ Coverage   79.13%   79.31%   +0.18%     
==========================================
  Files          80       81       +1     
  Lines        5810     5822      +12     
==========================================
+ Hits         4598     4618      +20     
+ Misses       1212     1204       -8

@codecov
Copy link

codecov bot commented May 25, 2017

Codecov Report

Merging #753 into master will increase coverage by 0.17%.
The diff coverage is 95.45%.

@@            Coverage Diff            @@
##           master    #753      +/-   ##
=========================================
+ Coverage   79.13%   79.3%   +0.17%     
=========================================
  Files          80      81       +1     
  Lines        5810    5824      +14     
=========================================
+ Hits         4598    4619      +21     
+ Misses       1212    1205       -7

Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good, thanks. The p-values on the bigrams (max_size = 2) are correct and I think the nesting behaviour is more correct. Overall the results are very similar to the previous code, but this is because of the strong influence of frequency on the results, regardless of the way the statistics are computed.

Suggest you do the following:

  • change sequences() to sequences_old(), and put it in its own file
  • change the sequences2 to sequences() (note: this will mean textstat_collocations(x, method = "bj") will call the new sequences() rather than the old)
  • update the tests as appropriate
  • we ask @koheiw to review before merging into master

From there we can use the bit-wise masking approach in C++ to reimplement the other association measures.

Ultimately too I'd like to figure out a parsimonious recommended workflow for selecting tokens using tokens_select(x, padding = TRUE) before sending this to textstat_collocations().

Ultimately we will make sequences() internal-only, and just call it through textstat_collocations().

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 1, 2017

@HaiyanLW I'm happy for you to merge this into master. Please update NEWS.md with a note of the changes, and increment the version by .01 first.

Then you can make a new branch and work on implementing the LR, Chi2, Dice, and PMI using this method.

@HaiyanLW
Copy link
Collaborator Author

HaiyanLW commented Jun 1, 2017

Merge dev_collocations to master with corrected sequences()

@HaiyanLW HaiyanLW merged commit ca79e4c into master Jun 1, 2017
@HaiyanLW HaiyanLW deleted the dev-collocations branch June 1, 2017 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants