check calculations of textstat_collocations() #803

kbenoit · 2017-06-15T10:20:03Z

Here's a stab at a log-linear model equivalency. The focus is on "capital gains tax".

require(quanteda)

txt <- "A gains capital B C capital gains A B capital C capital gains tax gains tax gains B gains C capital gains tax"
toks <- as.character(tokens(txt, ngrams = 3, concatenator = " "))

require(data.table)
toks_df <- data.table(do.call(rbind, strsplit(toks, " ")), stringsAsFactors = FALSE)
names(toks_df) <- paste0("word", 1:3)

# replace non-target words with "other"
targets <- c("capital", "gains", "tax")
toks_df[word1 != targets[1], word1 := "other"]
toks_df[word2 != targets[2], word2 := "other"]
toks_df[word3 != targets[3], word3 := "other"]
toks_df[, n := 1]
toks_df_n <- toks_df[, list(n = sum(n)), by = c("word1", "word2", "word3")]
toks_df_n
     # word1 word2 word3  n
# 1:   other gains other  3
# 2:   other other other 12
# 3: capital other other  2
# 4: capital gains other  1
# 5: capital gains   tax  2
# 6:   other gains   tax  1


# MASS::loglm(n ~ word1:word2:word3, data = toks_df)

stats::loglin(table(toks_df[, 1:3]), margin = 1:3)
# 2 iterations: deviation 1.776357e-15 
# $lrt
# [1] 10.91587
# 
# $pearson
# [1] 16.93125
# 
# $df
# [1] 4
# 
# $margin
# [1] "word1" NA      NA  

textstat_collocations(tokens(txt), method = "lr", min_size = 3, max_size = 3)[1:3, ]
#         collocation length count       G2
# 1   C capital gains      3     3 20.17961
# 2   gains tax gains      3     2 11.18707
# 3 capital gains tax      3     2 10.91587

textstat_collocations(tokens(txt), method = "chi2", min_size = 3, max_size = 3)[1:3, ]
#         collocation length count       X2
# 1   C capital gains      3     3 43.95563
# 2   gains tax gains      3     2 23.64158
# 3 capital gains tax      3     2 16.93125

HaiyanLW · 2017-06-15T11:52:18Z

For the coefficient of Dice, McInnes uses 2 for trigrams and Petrovic uses 3 (so n for n-grams collocation). Currently we adopted Petrovic's way in sequences.R and McInnes's way in collocations2.R.

kbenoit · 2017-06-15T14:28:48Z

OK - It turns out that we were computing the model that included lower order interactions, which means that the likelihood ratio we are computing for ngrams with n>2 are for the wrong model. What we should be computing (for trigrams) is the likelihood of seeing the third word after having seen the first two words in order.

In contrast to the above, this has the following equivalent formulations in log-linear models:

MASS::loglm(~ word1*word2 + word2*word3 + word1*word3, xtabs(~ ., toks_df[, 1:3]))
# Call:
#     MASS::loglm(formula = ~word1 * word2 + word2 * word3 + word1 * 
#                     word3, data = xtabs(~., toks_df[, 1:3]))
# 
# Statistics:
#     X^2 df  P(> X^2)
# Likelihood Ratio 0.0006825132  1 0.9791577
# Pearson                   NaN  1       NaN

stats::loglin(table(toks_df[, 1:3]), margin = list(1:2, 2:3, c(1,3)))
# 4 iterations: deviation 0.05924525 
# $lrt
# [1] 0.0006825132
# 
# $pearson
# [1] NaN
# 
# $df
# [1] 1
# 
# $margin
# $margin[[1]]
# [1] "word1" "word2"
# 
# $margin[[2]]
# [1] "word2" "word3"
# 
# $margin[[3]]
# [1] "word1" "word3"

As a binary logistic regression, what we are computing is the coefficient for the interaction term when ther data are expressed this way:

summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"), 
            family = "binomial", data = toks_df_n))

# Call:
#     glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 == 
#                                                                     "gains"), family = "binomial", data = toks_df_n)
# 
# Deviance Residuals: 
#        1         2         3         4         5         6  
# -1.17741  -0.00008  -0.00008  -1.17741   1.17741   1.17741  
# 
# Coefficients:
#                                                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)                                       -1.957e+01  1.075e+04  -0.002    0.999
# I(word1 == "capital")TRUE                         -9.633e-09  1.521e+04   0.000    1.000
# I(word2 == "gains")TRUE                            1.957e+01  1.075e+04   0.002    0.999
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE  9.633e-09  1.521e+04   0.000    1.000
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 7.6382  on 5  degrees of freedom
# Residual deviance: 5.5452  on 2  degrees of freedom
# AIC: 13.545
# 
# Number of Fisher Scoring iterations: 18

Since Blaheta and Johnson smooth their counts, I've tried to approximate their model:

# add zeroes for the two unobserved categories
toks_df_n_smoothed <- rbind(toks_df_n, 
                            data.table(word1 = c("other", "other"),
                                       word2 = c("capital", "gains"),
                                       word3 = c("tax", "tax"),
                                       n = 0))
# smooth the frequencies as per B&J
toks_df_n_smoothed$n <- toks_df_n_smoothed$n + 0.5
setorder(toks_df_n_smoothed, word3, word2, word1)
toks_df_n_smoothed
#      word1   word2 word3    n
# 1: capital   gains other  1.5
# 2:   other   gains other  3.5
# 3: capital   other other  2.5
# 4:   other   other other 12.5
# 5:   other capital   tax  0.5
# 6: capital   gains   tax  2.5
# 7:   other   gains   tax  1.5
# 8:   other   gains   tax  0.5

summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"), 
            family = "binomial", data = toks_df_n_smoothed))

# Call:
#     glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 == 
#                                                                     "gains"), family = "binomial", data = toks_df_n_smoothed)
# 
# Deviance Residuals: 
#        1         2         3         4         5         6         7         8  
# -1.17741  -1.48230  -0.00022  -1.17741   1.17741   1.17741   0.90052   0.90052  
# 
# Coefficients:
#                                                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)                                        2.064e-15  1.414e+00   0.000    1.000
# I(word1 == "capital")TRUE                         -1.757e+01  3.956e+03  -0.004    0.996
# I(word2 == "gains")TRUE                            6.931e-01  1.871e+00   0.371    0.711
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE  1.687e+01  3.956e+03   0.004    0.997
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 11.0904  on 7  degrees of freedom
# Residual deviance:  9.3643  on 4  degrees of freedom
# AIC: 17.364
# 
# Number of Fisher Scoring iterations: 16

kbenoit · 2017-06-15T15:28:51Z

From sequences():

seqs <- sequences(tokens(txt), size = 3, min_count = 1)
seqs[seqs$collocation == "capital gains tax", ]
#         collocation count length   lambda    sigma      dice      pmi  logratio     chi2        z           p
# 6 capital gains tax     2      3 4.491842 1.845458 0.3571429 1.932499 0.9867511 16.09538 2.433999 0.007466527

HaiyanLW · 2017-06-19T14:15:40Z

According to the 2x2 table in slack on last Friday:

         C     ~C
     +-----+----+
 (AB)|     |    | N_(AB)
     +-----+----+
~(AB)|     |    | N_(~(AB))
     +-----+----+ 
       N_C  N_~C  N

My understanding is as the following:

tt <- toks_df[, cg :=paste(word1,word2)]
tt <- tt[cg !="capital gains", cg:="other"]
toks_df_2x2 <- table(tt[,c(3,5)])
toks_df_2x2
#       cg
#word3   capital gains other
 # other             1    17
 # tax               2     1

stats::loglin(toks_df_2x2, margin=1:2)
#2 iterations: deviation 0 
#$lrt
#[1] 5.681671
#
#$pearson
#[1] 7.842593
#
#$df
#[1] 1
#
#$margin
#[1] "word3" NA

and from the updated sequences() now,

sequences(tokens(txt), size =3 , smoothing = 0)

HaiyanLW · 2017-06-19T14:16:48Z

#         collocation count length lambda sigma      dice      pmi logratio     chi2   z
#5    gains tax gains     2      3    NaN   Inf 0.4285714 1.252763 5.570502 5.526316 NaN
#13   C capital gains     3      3    NaN   Inf 0.6428571 1.252763 8.907119 8.750000 NaN
#15 capital gains tax     2      3    NaN   Inf 0.4000000 1.540445 5.681671 7.842593 NaN

kbenoit · 2017-06-19T14:32:48Z

That's it. There are faster algebraic methods to compute all of these, incl lambda and sigma, without the iterative approach in stats::loglin(). See my code in textstat_keyness() for instance for the quick computations of the chi^2 and G^2 for the 2x2 tables. (They might be using the Yates correction of adding 0.5 to each cell...)

It's all so much easier with 2x2 tables!

HaiyanLW · 2017-06-19T15:31:21Z

Do you think we should move the calculation of the chi^2 and G^2 (and maybe pmi) outside c++ code? I mean these are computed paralleled now. And once we move it to R, we still need to calculate them separately for 2-, 3- ...n-grams.

kbenoit · 2017-06-19T15:42:52Z

If the calculation is vectorised in R, it will be as fast as in C++, and easier to maintain within the package, so I'm in favour. This would mean we are just using the C++ to form the matches and return the tables.

HaiyanLW · 2017-06-19T15:55:25Z

Now the chi^2, G^2 ,pmi and Dice are returned with lambda and lambda1 , and are tested against loglin() for tri-grams, shall we leave it as such until we decide how to update textstat_collocations.R?

kbenoit · 2017-06-19T16:08:55Z

Fine with me.

HaiyanLW · 2017-06-19T16:10:57Z

:) seems need a break from this branch.

HaiyanLW · 2017-06-19T16:13:21Z

Do you want me do a PR or leave it for now?

kbenoit · 2017-06-19T16:15:33Z

If the tests are there to verify the calculations, we use them to test other implementations, such as the one that @koheiw is working on). So if it's complete, go ahead and issue a PR. Merge the latest master changes into the branch first, if you have not yet done that.

Solve issue #803

kbenoit · 2017-08-02T11:41:51Z

@HaiyanLW here some some code that computes, using a slow R way, the expected values. The figures are taken from Jouni's example, and I have not implemented smoothing, but I think this will be already a part of the "observed" values that I have computed, so it should work as is. If you can integrate this temporarily into the collocations_verify branch textstat_collocations(), we can use that to verify the figures.

df2 <- data.frame(collocation = "united states", n00 = 120722, n01 = 174, n10 = 42, n11 = 153)
df3 <- data.frame(
    collocation = c("bill of rights", "men and women"),
    n000 = c(102006, 10539),
    n001 = c(124, 5),
    n010 = c(6947, 3630),
    n011 = c(5, 0),
    n100 = c(7, 76),
    n101 = c(0, 0),
    n110 = c(0, 7),
    n111 = c(3, 25)
)

# function to get lower-order interactions for k-grams
marginalfun <- function(k) {
    utils::combn(k, k-1, simplify = FALSE)
}

get_expected_values <- function(df, size) {
    # get the columns of the data.frame that are the n* counts
    counts <- df[, grep("^n\\d+", names(df))]
    # sort the counts alphabetically
    counts <- df[, sort(names(counts))]
    
    expected_counts_list <- apply(counts, 1, function(x) {
        countsnum <- as.numeric(x)
        names(countsnum) <- names(counts)
        array_dimnames <- c(rep(list(c("0", "1")), size))
        names(array_dimnames) <- paste0("W", size:1)
        counts_table <- array(countsnum, dim = rep(2, size), dimnames = array_dimnames)
        counts_expected <- stats::loglin(counts_table,
                                         margin =  marginalfun(size),
                                         fit = TRUE, print = FALSE)$fit
        counts_expected <- as.numeric(counts_expected)
        names(counts_expected) <- gsub("e", "n", names(counts))
        counts_expected
    })
    
    data.frame(t(expected_counts_list))
}

get_expected_values(df3, size = 3)
#        n000        n001     n010     n011      n100     n101      n110       n111
# 1 102008.17 121.8345128 6944.835 7.165488  4.836326 2.163674  2.163756  0.8362441
# 2  10543.53   0.4733728 3625.473 4.526712 71.467282 4.532718 11.578213 20.4217869

cbind(df2, get_expected_values(df2, size = 2))
#     collocation    n00 n01 n10 n11      n00      n01      n10       n11
# 1 united states 120722 174  42 153 120569.5 326.4734 194.4734 0.5265874

HaiyanLW · 2017-08-02T14:58:20Z

For now (specify size to scalar and only 2 or 3), hasn't changed G2 etc for 3-, 4-, 5-grams:

> seqs <- textstat_collocations(toks, size = 3, show_counts = TRUE)
> head(seqs)
                 collocation count length     lambda    sigma       dice     gensim      pmi       G2
1     United States Congress     2      3  -2.114393 2.123664 0.01366120 0.01275545 2.994408 10.31363
2 President Carter President     2      3  -3.435144 2.565448 0.04464286 2.33411598 7.212004 34.35112
3    Chief Justice President     2      3  -3.459484 2.190133 0.06000000 1.50927800 5.671559 23.92940
4   President Bush President     2      3  -5.686073 2.191910 0.04360465 1.16705799 6.364706 27.93465
5   President Vice President     2      3  -6.637474 2.163159 0.04213483 0.66689028 7.212004 34.35112
6        Vice President Bush     2      3 -11.454414 2.590716 0.06818182 5.41773356 7.935923 36.11702
        chi2       LFMD     n000 n001  n010  n011  n100 n101 n110 n111     e000 e001  e010  e011  e100
1   45.14942 -18.261501 124848.5 11.5 150.5 131.5 114.5  0.5  0.5  2.5 124848.2 11.8 150.8 131.2 114.8
2 3386.17964 -10.745883 125104.5 76.5   0.5   0.5  68.5  6.5  0.5  2.5 125104.2 76.8   0.8   0.2  68.8
3  721.75044 -11.374900 125152.5 13.5   4.5  11.5  74.5  0.5  0.5  2.5 125152.1 13.9   4.9  11.1  74.8
4 1448.41065 -11.745883 125104.5 72.5   0.5   4.5  68.5  6.5  0.5  2.5 125104.0 73.0   1.0   4.0  69.0
5 3386.17964 -12.553238 125104.5 76.5   0.5   0.5  58.5  6.5 10.5  2.5 125104.0 77.0   1.0   0.0  59.0
6 6986.12739  -9.531069 125173.5  0.5  64.5  13.5   0.5  0.5  4.5  2.5 125173.0  1.0  65.0  13.0   1.0
  e101 e110 e111          z         p
1  0.2  0.2  2.8 -0.9956346 0.8402862
2  6.2  0.1  2.9 -1.3390036 0.9097153
3  0.2  0.1  2.9 -1.5795770 0.9428981
4  6.0  0.0  3.0 -2.5941184 0.9952583
5  6.0 10.0  3.0 -3.0684166 0.9989240
6  0.0  4.0  3.0 -4.4213319 0.9999951

kbenoit · 2017-08-02T15:02:07Z

OK when you push it to the branch I will run some tests. Please merge master into your branch first.

kbenoit added collocations textstat labels Jun 15, 2017

kbenoit assigned kbenoit, koheiw and HaiyanLW Jun 15, 2017

HaiyanLW added a commit that referenced this issue Jun 29, 2017

Merge pull request #809 from kbenoit/hw-collocations

648425f

Solve issue #803

koheiw mentioned this issue Jul 3, 2017

Collocation counts randomly vary #831

Closed

kbenoit modified the milestone: CRAN v0.9.9.9000 Jul 18, 2017

This was referenced Aug 3, 2017

Re-implement textstat_collocations() #901

Merged

Stripped down version of textstat_collocations() #906

Closed

kbenoit modified the milestone: CRAN v0.9.9.9000 Aug 5, 2017

kbenoit closed this as completed Aug 10, 2017

kbenoit mentioned this issue Aug 10, 2017

Implement fast computation of expected counts in C++ #924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check calculations of textstat_collocations() #803

check calculations of textstat_collocations() #803

kbenoit commented Jun 15, 2017

HaiyanLW commented Jun 15, 2017

kbenoit commented Jun 15, 2017

kbenoit commented Jun 15, 2017

HaiyanLW commented Jun 19, 2017 •

edited

Loading

HaiyanLW commented Jun 19, 2017 •

edited

Loading

kbenoit commented Jun 19, 2017 •

edited

Loading

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

kbenoit commented Aug 2, 2017

HaiyanLW commented Aug 2, 2017

kbenoit commented Aug 2, 2017

check calculations of textstat_collocations() #803

check calculations of textstat_collocations() #803

Comments

kbenoit commented Jun 15, 2017

HaiyanLW commented Jun 15, 2017

kbenoit commented Jun 15, 2017

kbenoit commented Jun 15, 2017

HaiyanLW commented Jun 19, 2017 • edited Loading

HaiyanLW commented Jun 19, 2017 • edited Loading

kbenoit commented Jun 19, 2017 • edited Loading

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

HaiyanLW commented Jun 19, 2017

kbenoit commented Jun 19, 2017

kbenoit commented Aug 2, 2017

HaiyanLW commented Aug 2, 2017

kbenoit commented Aug 2, 2017

HaiyanLW commented Jun 19, 2017 •

edited

Loading

HaiyanLW commented Jun 19, 2017 •

edited

Loading

kbenoit commented Jun 19, 2017 •

edited

Loading