Dev collocations #753

HaiyanLW · 2017-05-25T17:17:13Z

Modified the estimation functions in sequence.R as discussed with @kbenoit (new file is sequences2.R)
Made changes in some arguments.

Disable `ordered` for now

…en [min_size, max_size]

codecov · 2017-05-25T17:17:16Z

Codecov Report

Merging #753 into master will increase coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #753      +/-   ##
==========================================
+ Coverage   79.13%   79.31%   +0.18%     
==========================================
  Files          80       81       +1     
  Lines        5810     5822      +12     
==========================================
+ Hits         4598     4618      +20     
+ Misses       1212     1204       -8

codecov · 2017-05-25T17:17:16Z

Codecov Report

Merging #753 into master will increase coverage by 0.17%.
The diff coverage is 95.45%.

@@            Coverage Diff            @@
##           master    #753      +/-   ##
=========================================
+ Coverage   79.13%   79.3%   +0.17%     
=========================================
  Files          80      81       +1     
  Lines        5810    5824      +14     
=========================================
+ Hits         4598    4619      +21     
+ Misses       1212    1205       -7

kbenoit

This looks very good, thanks. The p-values on the bigrams (max_size = 2) are correct and I think the nesting behaviour is more correct. Overall the results are very similar to the previous code, but this is because of the strong influence of frequency on the results, regardless of the way the statistics are computed.

Suggest you do the following:

change sequences() to sequences_old(), and put it in its own file
change the sequences2 to sequences() (note: this will mean textstat_collocations(x, method = "bj") will call the new sequences() rather than the old)
update the tests as appropriate
we ask @koheiw to review before merging into master

From there we can use the bit-wise masking approach in C++ to reimplement the other association measures.

Ultimately too I'd like to figure out a parsimonious recommended workflow for selecting tokens using tokens_select(x, padding = TRUE) before sending this to textstat_collocations().

Ultimately we will make sequences() internal-only, and just call it through textstat_collocations().

kbenoit · 2017-06-01T11:09:09Z

@HaiyanLW I'm happy for you to merge this into master. Please update NEWS.md with a note of the changes, and increment the version by .01 first.

Then you can make a new branch and work on implementing the LR, Chi2, Dice, and PMI using this method.

HaiyanLW · 2017-06-01T15:19:26Z

Merge dev_collocations to master with corrected sequences()

HaiyanLW added 28 commits May 25, 2017 12:07

Remove valuetype from sequences.R

ff5df4c

remove features

7272c05

Remove case_insensitive

2860799

Remove features

2075ed3

change cpp function name

fe5f27d

change functions name

f23966b

Change the size of counts_bit for non-ordered

0d88fe0

Count cn in B-J algorithm

1bd105b

Added unigram subtuple method for calculating lamda and sigma

8fb0424

Disable `ordered` for now

Add all subtuples method

e476094

Fix some errors

0341146

Add dynamic link to quanteda_qatd_cpp_sequences2

5857a60

correct function name

4762f11

update examples

8669a81

Add sequences2

ddd1d0b

tidy up

985b79d

merge master

3929067

update init.c

617b7c9

Remove argument ordered from sequence2

0b723c8

Add argument min_size to allow returning collocations of size betwe…

46f0436

…en [min_size, max_size]

Add test on min_size for sequence2

a7a19a8

update markdown file according to argument change

fed5e78

typo

8298fb5

Add tests for sequences2

3ed3461

Rename the returned class of sequences2

8a4b8a4

Add test for as.tokens.sequences

05d2ac8

Add test for is.sequences

90e632a

Update NAMESPACE etc.

d276ed9

HaiyanLW requested a review from kbenoit May 25, 2017 17:17

HaiyanLW self-assigned this May 25, 2017

HaiyanLW added bug design labels May 25, 2017

kbenoit added 5 commits May 31, 2017 07:20

Comment out redundant supporting code for sequences objects

1859c98

Update man page

25ebcb0

Add Haiyan to author list

b9e1e0c

Updae Rcpp

f55babb

Merge branch 'master' into dev-collocations

e0ecb15

kbenoit requested changes May 31, 2017

View reviewed changes

HaiyanLW added 4 commits May 31, 2017 18:13

Rename sequences.R etc. to sequences_old.R

3475074

Rename sequences2.R etc. to sequences.R

1d7ceae

Update textstat_collocations.R according to the change of sequences.R

d3f81dd

Updates configuration functions

eecc204

kbenoit approved these changes Jun 1, 2017

View reviewed changes

HaiyanLW added 2 commits June 1, 2017 15:33

Add bj_uni and bj_all to methods

1cb9284

Updates NEWS.md

ca79e4c

HaiyanLW merged commit ca79e4c into master Jun 1, 2017

HaiyanLW deleted the dev-collocations branch June 1, 2017 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev collocations #753

Dev collocations #753

HaiyanLW commented May 25, 2017

codecov bot commented May 25, 2017

codecov bot commented May 25, 2017 •

edited

kbenoit left a comment •

edited

kbenoit commented Jun 1, 2017

HaiyanLW commented Jun 1, 2017

Dev collocations #753

Dev collocations #753

Conversation

HaiyanLW commented May 25, 2017

codecov bot commented May 25, 2017

Codecov Report

codecov bot commented May 25, 2017 • edited

Codecov Report

kbenoit left a comment • edited

Choose a reason for hiding this comment

kbenoit commented Jun 1, 2017

HaiyanLW commented Jun 1, 2017

codecov bot commented May 25, 2017 •

edited

kbenoit left a comment •

edited