-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check calculations of textstat_collocations() #803
Comments
OK - It turns out that we were computing the model that included lower order interactions, which means that the likelihood ratio we are computing for ngrams with n>2 are for the wrong model. What we should be computing (for trigrams) is the likelihood of seeing the third word after having seen the first two words in order. In contrast to the above, this has the following equivalent formulations in log-linear models: MASS::loglm(~ word1*word2 + word2*word3 + word1*word3, xtabs(~ ., toks_df[, 1:3]))
# Call:
# MASS::loglm(formula = ~word1 * word2 + word2 * word3 + word1 *
# word3, data = xtabs(~., toks_df[, 1:3]))
#
# Statistics:
# X^2 df P(> X^2)
# Likelihood Ratio 0.0006825132 1 0.9791577
# Pearson NaN 1 NaN
stats::loglin(table(toks_df[, 1:3]), margin = list(1:2, 2:3, c(1,3)))
# 4 iterations: deviation 0.05924525
# $lrt
# [1] 0.0006825132
#
# $pearson
# [1] NaN
#
# $df
# [1] 1
#
# $margin
# $margin[[1]]
# [1] "word1" "word2"
#
# $margin[[2]]
# [1] "word2" "word3"
#
# $margin[[3]]
# [1] "word1" "word3" As a binary logistic regression, what we are computing is the coefficient for the interaction term when ther data are expressed this way: summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"),
family = "binomial", data = toks_df_n))
# Call:
# glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 ==
# "gains"), family = "binomial", data = toks_df_n)
#
# Deviance Residuals:
# 1 2 3 4 5 6
# -1.17741 -0.00008 -0.00008 -1.17741 1.17741 1.17741
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.957e+01 1.075e+04 -0.002 0.999
# I(word1 == "capital")TRUE -9.633e-09 1.521e+04 0.000 1.000
# I(word2 == "gains")TRUE 1.957e+01 1.075e+04 0.002 0.999
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE 9.633e-09 1.521e+04 0.000 1.000
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7.6382 on 5 degrees of freedom
# Residual deviance: 5.5452 on 2 degrees of freedom
# AIC: 13.545
#
# Number of Fisher Scoring iterations: 18 Since Blaheta and Johnson smooth their counts, I've tried to approximate their model: # add zeroes for the two unobserved categories
toks_df_n_smoothed <- rbind(toks_df_n,
data.table(word1 = c("other", "other"),
word2 = c("capital", "gains"),
word3 = c("tax", "tax"),
n = 0))
# smooth the frequencies as per B&J
toks_df_n_smoothed$n <- toks_df_n_smoothed$n + 0.5
setorder(toks_df_n_smoothed, word3, word2, word1)
toks_df_n_smoothed
# word1 word2 word3 n
# 1: capital gains other 1.5
# 2: other gains other 3.5
# 3: capital other other 2.5
# 4: other other other 12.5
# 5: other capital tax 0.5
# 6: capital gains tax 2.5
# 7: other gains tax 1.5
# 8: other gains tax 0.5
summary(glm(I(word3=="tax") ~ I(word1=="capital") * I(word2=="gains"),
family = "binomial", data = toks_df_n_smoothed))
# Call:
# glm(formula = I(word3 == "tax") ~ I(word1 == "capital") * I(word2 ==
# "gains"), family = "binomial", data = toks_df_n_smoothed)
#
# Deviance Residuals:
# 1 2 3 4 5 6 7 8
# -1.17741 -1.48230 -0.00022 -1.17741 1.17741 1.17741 0.90052 0.90052
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.064e-15 1.414e+00 0.000 1.000
# I(word1 == "capital")TRUE -1.757e+01 3.956e+03 -0.004 0.996
# I(word2 == "gains")TRUE 6.931e-01 1.871e+00 0.371 0.711
# I(word1 == "capital")TRUE:I(word2 == "gains")TRUE 1.687e+01 3.956e+03 0.004 0.997
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 11.0904 on 7 degrees of freedom
# Residual deviance: 9.3643 on 4 degrees of freedom
# AIC: 17.364
#
# Number of Fisher Scoring iterations: 16 |
From seqs <- sequences(tokens(txt), size = 3, min_count = 1)
seqs[seqs$collocation == "capital gains tax", ]
# collocation count length lambda sigma dice pmi logratio chi2 z p
# 6 capital gains tax 2 3 4.491842 1.845458 0.3571429 1.932499 0.9867511 16.09538 2.433999 0.007466527 |
According to the
My understanding is as the following: tt <- toks_df[, cg :=paste(word1,word2)]
tt <- tt[cg !="capital gains", cg:="other"]
toks_df_2x2 <- table(tt[,c(3,5)])
toks_df_2x2
# cg
#word3 capital gains other
# other 1 17
# tax 2 1 stats::loglin(toks_df_2x2, margin=1:2)
#2 iterations: deviation 0
#$lrt
#[1] 5.681671
#
#$pearson
#[1] 7.842593
#
#$df
#[1] 1
#
#$margin
#[1] "word3" NA and from the updated sequences(tokens(txt), size =3 , smoothing = 0) |
# collocation count length lambda sigma dice pmi logratio chi2 z
#5 gains tax gains 2 3 NaN Inf 0.4285714 1.252763 5.570502 5.526316 NaN
#13 C capital gains 3 3 NaN Inf 0.6428571 1.252763 8.907119 8.750000 NaN
#15 capital gains tax 2 3 NaN Inf 0.4000000 1.540445 5.681671 7.842593 NaN |
That's it. There are faster algebraic methods to compute all of these, incl lambda and sigma, without the iterative approach in It's all so much easier with 2x2 tables! |
Do you think we should move the calculation of the chi^2 and G^2 (and maybe pmi) outside c++ code? I mean these are computed paralleled now. And once we move it to R, we still need to calculate them separately for 2-, 3- ...n-grams. |
If the calculation is vectorised in R, it will be as fast as in C++, and easier to maintain within the package, so I'm in favour. This would mean we are just using the C++ to form the matches and return the tables. |
Now the |
Fine with me. |
:) seems need a break from this branch. |
Do you want me do a PR or leave it for now? |
If the tests are there to verify the calculations, we use them to test other implementations, such as the one that @koheiw is working on). So if it's complete, go ahead and issue a PR. Merge the latest |
@HaiyanLW here some some code that computes, using a slow R way, the expected values. The figures are taken from Jouni's example, and I have not implemented smoothing, but I think this will be already a part of the "observed" values that I have computed, so it should work as is. If you can integrate this temporarily into the df2 <- data.frame(collocation = "united states", n00 = 120722, n01 = 174, n10 = 42, n11 = 153)
df3 <- data.frame(
collocation = c("bill of rights", "men and women"),
n000 = c(102006, 10539),
n001 = c(124, 5),
n010 = c(6947, 3630),
n011 = c(5, 0),
n100 = c(7, 76),
n101 = c(0, 0),
n110 = c(0, 7),
n111 = c(3, 25)
)
# function to get lower-order interactions for k-grams
marginalfun <- function(k) {
utils::combn(k, k-1, simplify = FALSE)
}
get_expected_values <- function(df, size) {
# get the columns of the data.frame that are the n* counts
counts <- df[, grep("^n\\d+", names(df))]
# sort the counts alphabetically
counts <- df[, sort(names(counts))]
expected_counts_list <- apply(counts, 1, function(x) {
countsnum <- as.numeric(x)
names(countsnum) <- names(counts)
array_dimnames <- c(rep(list(c("0", "1")), size))
names(array_dimnames) <- paste0("W", size:1)
counts_table <- array(countsnum, dim = rep(2, size), dimnames = array_dimnames)
counts_expected <- stats::loglin(counts_table,
margin = marginalfun(size),
fit = TRUE, print = FALSE)$fit
counts_expected <- as.numeric(counts_expected)
names(counts_expected) <- gsub("e", "n", names(counts))
counts_expected
})
data.frame(t(expected_counts_list))
}
get_expected_values(df3, size = 3)
# n000 n001 n010 n011 n100 n101 n110 n111
# 1 102008.17 121.8345128 6944.835 7.165488 4.836326 2.163674 2.163756 0.8362441
# 2 10543.53 0.4733728 3625.473 4.526712 71.467282 4.532718 11.578213 20.4217869
cbind(df2, get_expected_values(df2, size = 2))
# collocation n00 n01 n10 n11 n00 n01 n10 n11
# 1 united states 120722 174 42 153 120569.5 326.4734 194.4734 0.5265874 |
For now (specify size to scalar and only 2 or 3), hasn't changed > seqs <- textstat_collocations(toks, size = 3, show_counts = TRUE)
> head(seqs)
collocation count length lambda sigma dice gensim pmi G2
1 United States Congress 2 3 -2.114393 2.123664 0.01366120 0.01275545 2.994408 10.31363
2 President Carter President 2 3 -3.435144 2.565448 0.04464286 2.33411598 7.212004 34.35112
3 Chief Justice President 2 3 -3.459484 2.190133 0.06000000 1.50927800 5.671559 23.92940
4 President Bush President 2 3 -5.686073 2.191910 0.04360465 1.16705799 6.364706 27.93465
5 President Vice President 2 3 -6.637474 2.163159 0.04213483 0.66689028 7.212004 34.35112
6 Vice President Bush 2 3 -11.454414 2.590716 0.06818182 5.41773356 7.935923 36.11702
chi2 LFMD n000 n001 n010 n011 n100 n101 n110 n111 e000 e001 e010 e011 e100
1 45.14942 -18.261501 124848.5 11.5 150.5 131.5 114.5 0.5 0.5 2.5 124848.2 11.8 150.8 131.2 114.8
2 3386.17964 -10.745883 125104.5 76.5 0.5 0.5 68.5 6.5 0.5 2.5 125104.2 76.8 0.8 0.2 68.8
3 721.75044 -11.374900 125152.5 13.5 4.5 11.5 74.5 0.5 0.5 2.5 125152.1 13.9 4.9 11.1 74.8
4 1448.41065 -11.745883 125104.5 72.5 0.5 4.5 68.5 6.5 0.5 2.5 125104.0 73.0 1.0 4.0 69.0
5 3386.17964 -12.553238 125104.5 76.5 0.5 0.5 58.5 6.5 10.5 2.5 125104.0 77.0 1.0 0.0 59.0
6 6986.12739 -9.531069 125173.5 0.5 64.5 13.5 0.5 0.5 4.5 2.5 125173.0 1.0 65.0 13.0 1.0
e101 e110 e111 z p
1 0.2 0.2 2.8 -0.9956346 0.8402862
2 6.2 0.1 2.9 -1.3390036 0.9097153
3 0.2 0.1 2.9 -1.5795770 0.9428981
4 6.0 0.0 3.0 -2.5941184 0.9952583
5 6.0 10.0 3.0 -3.0684166 0.9989240
6 0.0 4.0 3.0 -4.4213319 0.9999951 |
OK when you push it to the branch I will run some tests. Please merge |
Here's a stab at a log-linear model equivalency. The focus is on "capital gains tax".
The text was updated successfully, but these errors were encountered: