Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check calculations of textstat_collocations() #803

Closed
kbenoit opened this issue Jun 15, 2017 · 16 comments
Closed

check calculations of textstat_collocations() #803

kbenoit opened this issue Jun 15, 2017 · 16 comments

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 15, 2017

Here's a stab at a log-linear model equivalency. The focus is on "capital gains tax".

require(quanteda)

txt <- "A gains capital B C capital gains A B capital C capital gains tax gains tax gains B gains C capital gains tax"
toks <- as.character(tokens(txt, ngrams = 3, concatenator = " "))

require(data.table)
toks_df <- data.table(do.call(rbind, strsplit(toks, " ")), stringsAsFactors = FALSE)
names(toks_df) <- paste0("word", 1:3)

# replace non-target words with "other"
targets <- c("capital", "gains", "tax")
toks_df[word1 != targets[1], word1 := "other"]
toks_df[word2 != targets[2], word2 := "other"]
toks_df[word3 != targets[3], word3 := "other"]
toks_df[, n := 1]
toks_df_n <- toks_df[, list(n = sum(n)), by = c("word1", "word2", "word3")]
toks_df_n
     # word1 word2 word3  n
# 1:   other gains other  3
# 2:   other other other 12
# 3: capital other other  2
# 4: capital gains other  1
# 5: capital gains   tax  2
# 6:   other gains   tax  1


# MASS::loglm(n ~ word1:word2:word3, data = toks_df)

stats::loglin(table(toks_df[, 1:3]), margin = 1:3)
# 2 iterations: deviation 1.776357e-15 
# $lrt
# [1] 10.91587
# 
# $pearson
# [1] 16.93125
# 
# $df
# [1] 4
# 
# $margin
# [1] "word1" NA      NA  

textstat_collocations(tokens(txt), method = "lr", min_size = 3, max_size = 3)[1:3, ]
#         collocation length count       G2
# 1   C capital gains      3     3 20.17961
# 2   gains tax gains      3     2 11.18707
# 3 capital gains tax      3     2 10.91587

textstat_collocations(tokens(txt), method = "chi2", min_size = 3, max_size = 3)[1:3, ]
#         collocation length count       X2
# 1   C capital gains      3     3 43.95563
# 2   gains tax gains      3     2 23.64158
# 3 capital gains tax      3     2 16.93125
@HaiyanLW
Copy link
Collaborator

For the coefficient of Dice, McInnes uses 2 for trigrams and Petrovic uses 3 (so n for n-grams collocation). Currently we adopted Petrovic's way in sequences.R and McInnes's way in collocations2.R.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 15, 2017

OK - It turns out that we were computing the model that included lower order interactions, which means that the likelihood ratio we are computing for ngrams with n>2 are for the wrong model. What we should be computing (for trigrams) is the likelihood of seeing the third word after having seen the first two words in order.

In contrast to the above, this has the following equivalent formulations in log-linear models:

MASS::loglm(~ word1*word2 + word2*word3 + word1*word3, xtabs(~ ., toks_df[, 1:3]))
# Call:
#     MASS::loglm(formula = ~word1 * word2 + word2 * word3 + word1 * 
#                     word3, data = xtabs(~., toks_df[, 1:3]))
# 
# Statistics:
#     X^2 df  P(> X^2)
# Likelihood Ratio 0.0006825132  1 0.9791577
# Pearson                   NaN  1       NaN

stats::loglin(table(toks_df[, 1:3]), margin = list(1:2, 2:3, c(1,3)))
# 4 iterations: deviation 0.05924525 
# $lrt
# [1] 0.0006825132
# 
# $pearson
# [1] NaN
# 
# $df
# [1] 1
# 
# $margin
# $margin[[1]]
# [1] "word1" "word2"
# 
# $margin[[2]]
# [1] "word2" "word3"
# 
# $margin[[3]]
# [1] "word1" "word3"

As a binary logistic regression, what we are computing is the coefficient for the interaction term when ther data are expressed this way:

summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"), 
            family = "binomial", data = toks_df_n))

# Call:
#     glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 == 
#                                                                     "gains"), family = "binomial", data = toks_df_n)
# 
# Deviance Residuals: 
#        1         2         3         4         5         6  
# -1.17741  -0.00008  -0.00008  -1.17741   1.17741   1.17741  
# 
# Coefficients:
#                                                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)                                       -1.957e+01  1.075e+04  -0.002    0.999
# I(word1 == "capital")TRUE                         -9.633e-09  1.521e+04   0.000    1.000
# I(word2 == "gains")TRUE                            1.957e+01  1.075e+04   0.002    0.999
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE  9.633e-09  1.521e+04   0.000    1.000
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 7.6382  on 5  degrees of freedom
# Residual deviance: 5.5452  on 2  degrees of freedom
# AIC: 13.545
# 
# Number of Fisher Scoring iterations: 18

Since Blaheta and Johnson smooth their counts, I've tried to approximate their model:

# add zeroes for the two unobserved categories
toks_df_n_smoothed <- rbind(toks_df_n, 
                            data.table(word1 = c("other", "other"),
                                       word2 = c("capital", "gains"),
                                       word3 = c("tax", "tax"),
                                       n = 0))
# smooth the frequencies as per B&J
toks_df_n_smoothed$n <- toks_df_n_smoothed$n + 0.5
setorder(toks_df_n_smoothed, word3, word2, word1)
toks_df_n_smoothed
#      word1   word2 word3    n
# 1: capital   gains other  1.5
# 2:   other   gains other  3.5
# 3: capital   other other  2.5
# 4:   other   other other 12.5
# 5:   other capital   tax  0.5
# 6: capital   gains   tax  2.5
# 7:   other   gains   tax  1.5
# 8:   other   gains   tax  0.5

summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"), 
            family = "binomial", data = toks_df_n_smoothed))

# Call:
#     glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 == 
#                                                                     "gains"), family = "binomial", data = toks_df_n_smoothed)
# 
# Deviance Residuals: 
#        1         2         3         4         5         6         7         8  
# -1.17741  -1.48230  -0.00022  -1.17741   1.17741   1.17741   0.90052   0.90052  
# 
# Coefficients:
#                                                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)                                        2.064e-15  1.414e+00   0.000    1.000
# I(word1 == "capital")TRUE                         -1.757e+01  3.956e+03  -0.004    0.996
# I(word2 == "gains")TRUE                            6.931e-01  1.871e+00   0.371    0.711
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE  1.687e+01  3.956e+03   0.004    0.997
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 11.0904  on 7  degrees of freedom
# Residual deviance:  9.3643  on 4  degrees of freedom
# AIC: 17.364
# 
# Number of Fisher Scoring iterations: 16

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 15, 2017

From sequences():

seqs <- sequences(tokens(txt), size = 3, min_count = 1)
seqs[seqs$collocation == "capital gains tax", ]
#         collocation count length   lambda    sigma      dice      pmi  logratio     chi2        z           p
# 6 capital gains tax     2      3 4.491842 1.845458 0.3571429 1.932499 0.9867511 16.09538 2.433999 0.007466527

@HaiyanLW
Copy link
Collaborator

HaiyanLW commented Jun 19, 2017

According to the 2x2 table in slack on last Friday:

         C     ~C
     +-----+----+
 (AB)|     |    | N_(AB)
     +-----+----+
~(AB)|     |    | N_(~(AB))
     +-----+----+ 
       N_C  N_~C  N

My understanding is as the following:

tt <- toks_df[, cg :=paste(word1,word2)]
tt <- tt[cg !="capital gains", cg:="other"]
toks_df_2x2 <- table(tt[,c(3,5)])
toks_df_2x2
#       cg
#word3   capital gains other
 # other             1    17
 # tax               2     1
stats::loglin(toks_df_2x2, margin=1:2)
#2 iterations: deviation 0 
#$lrt
#[1] 5.681671
#
#$pearson
#[1] 7.842593
#
#$df
#[1] 1
#
#$margin
#[1] "word3" NA 

and from the updated sequences() now,

sequences(tokens(txt), size =3 , smoothing = 0)

@HaiyanLW
Copy link
Collaborator

HaiyanLW commented Jun 19, 2017

#         collocation count length lambda sigma      dice      pmi logratio     chi2   z
#5    gains tax gains     2      3    NaN   Inf 0.4285714 1.252763 5.570502 5.526316 NaN
#13   C capital gains     3      3    NaN   Inf 0.6428571 1.252763 8.907119 8.750000 NaN
#15 capital gains tax     2      3    NaN   Inf 0.4000000 1.540445 5.681671 7.842593 NaN

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 19, 2017

That's it. There are faster algebraic methods to compute all of these, incl lambda and sigma, without the iterative approach in stats::loglin(). See my code in textstat_keyness() for instance for the quick computations of the chi^2 and G^2 for the 2x2 tables. (They might be using the Yates correction of adding 0.5 to each cell...)

It's all so much easier with 2x2 tables!

@HaiyanLW
Copy link
Collaborator

Do you think we should move the calculation of the chi^2 and G^2 (and maybe pmi) outside c++ code? I mean these are computed paralleled now. And once we move it to R, we still need to calculate them separately for 2-, 3- ...n-grams.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 19, 2017

If the calculation is vectorised in R, it will be as fast as in C++, and easier to maintain within the package, so I'm in favour. This would mean we are just using the C++ to form the matches and return the tables.

@HaiyanLW
Copy link
Collaborator

Now the chi^2, G^2 ,pmi and Dice are returned with lambda and lambda1 , and are tested against loglin() for tri-grams, shall we leave it as such until we decide how to update textstat_collocations.R?

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 19, 2017

Fine with me.

@HaiyanLW
Copy link
Collaborator

:) seems need a break from this branch.

@HaiyanLW
Copy link
Collaborator

Do you want me do a PR or leave it for now?

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 19, 2017

If the tests are there to verify the calculations, we use them to test other implementations, such as the one that @koheiw is working on). So if it's complete, go ahead and issue a PR. Merge the latest master changes into the branch first, if you have not yet done that.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Aug 2, 2017

@HaiyanLW here some some code that computes, using a slow R way, the expected values. The figures are taken from Jouni's example, and I have not implemented smoothing, but I think this will be already a part of the "observed" values that I have computed, so it should work as is. If you can integrate this temporarily into the collocations_verify branch textstat_collocations(), we can use that to verify the figures.

df2 <- data.frame(collocation = "united states", n00 = 120722, n01 = 174, n10 = 42, n11 = 153)
df3 <- data.frame(
    collocation = c("bill of rights", "men and women"),
    n000 = c(102006, 10539),
    n001 = c(124, 5),
    n010 = c(6947, 3630),
    n011 = c(5, 0),
    n100 = c(7, 76),
    n101 = c(0, 0),
    n110 = c(0, 7),
    n111 = c(3, 25)
)

# function to get lower-order interactions for k-grams
marginalfun <- function(k) {
    utils::combn(k, k-1, simplify = FALSE)
}

get_expected_values <- function(df, size) {
    # get the columns of the data.frame that are the n* counts
    counts <- df[, grep("^n\\d+", names(df))]
    # sort the counts alphabetically
    counts <- df[, sort(names(counts))]
    
    expected_counts_list <- apply(counts, 1, function(x) {
        countsnum <- as.numeric(x)
        names(countsnum) <- names(counts)
        array_dimnames <- c(rep(list(c("0", "1")), size))
        names(array_dimnames) <- paste0("W", size:1)
        counts_table <- array(countsnum, dim = rep(2, size), dimnames = array_dimnames)
        counts_expected <- stats::loglin(counts_table,
                                         margin =  marginalfun(size),
                                         fit = TRUE, print = FALSE)$fit
        counts_expected <- as.numeric(counts_expected)
        names(counts_expected) <- gsub("e", "n", names(counts))
        counts_expected
    })
    
    data.frame(t(expected_counts_list))
}

get_expected_values(df3, size = 3)
#        n000        n001     n010     n011      n100     n101      n110       n111
# 1 102008.17 121.8345128 6944.835 7.165488  4.836326 2.163674  2.163756  0.8362441
# 2  10543.53   0.4733728 3625.473 4.526712 71.467282 4.532718 11.578213 20.4217869

cbind(df2, get_expected_values(df2, size = 2))
#     collocation    n00 n01 n10 n11      n00      n01      n10       n11
# 1 united states 120722 174  42 153 120569.5 326.4734 194.4734 0.5265874

@HaiyanLW
Copy link
Collaborator

HaiyanLW commented Aug 2, 2017

For now (specify size to scalar and only 2 or 3), hasn't changed G2 etc for 3-, 4-, 5-grams:

> seqs <- textstat_collocations(toks, size = 3, show_counts = TRUE)
> head(seqs)
                 collocation count length     lambda    sigma       dice     gensim      pmi       G2
1     United States Congress     2      3  -2.114393 2.123664 0.01366120 0.01275545 2.994408 10.31363
2 President Carter President     2      3  -3.435144 2.565448 0.04464286 2.33411598 7.212004 34.35112
3    Chief Justice President     2      3  -3.459484 2.190133 0.06000000 1.50927800 5.671559 23.92940
4   President Bush President     2      3  -5.686073 2.191910 0.04360465 1.16705799 6.364706 27.93465
5   President Vice President     2      3  -6.637474 2.163159 0.04213483 0.66689028 7.212004 34.35112
6        Vice President Bush     2      3 -11.454414 2.590716 0.06818182 5.41773356 7.935923 36.11702
        chi2       LFMD     n000 n001  n010  n011  n100 n101 n110 n111     e000 e001  e010  e011  e100
1   45.14942 -18.261501 124848.5 11.5 150.5 131.5 114.5  0.5  0.5  2.5 124848.2 11.8 150.8 131.2 114.8
2 3386.17964 -10.745883 125104.5 76.5   0.5   0.5  68.5  6.5  0.5  2.5 125104.2 76.8   0.8   0.2  68.8
3  721.75044 -11.374900 125152.5 13.5   4.5  11.5  74.5  0.5  0.5  2.5 125152.1 13.9   4.9  11.1  74.8
4 1448.41065 -11.745883 125104.5 72.5   0.5   4.5  68.5  6.5  0.5  2.5 125104.0 73.0   1.0   4.0  69.0
5 3386.17964 -12.553238 125104.5 76.5   0.5   0.5  58.5  6.5 10.5  2.5 125104.0 77.0   1.0   0.0  59.0
6 6986.12739  -9.531069 125173.5  0.5  64.5  13.5   0.5  0.5  4.5  2.5 125173.0  1.0  65.0  13.0   1.0
  e101 e110 e111          z         p
1  0.2  0.2  2.8 -0.9956346 0.8402862
2  6.2  0.1  2.9 -1.3390036 0.9097153
3  0.2  0.1  2.9 -1.5795770 0.9428981
4  6.0  0.0  3.0 -2.5941184 0.9952583
5  6.0 10.0  3.0 -3.0684166 0.9989240
6  0.0  4.0  3.0 -4.4213319 0.9999951

@kbenoit
Copy link
Collaborator Author

kbenoit commented Aug 2, 2017

OK when you push it to the branch I will run some tests. Please merge master into your branch first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants