New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Chi-square and Hamming distance measure #1442
Comments
Good points. The chi-squared distance varies with the matrix size because across the dimension not computed, there is a quantity in the formula called the "average profile" that is the average proportion for that dimension. So if we compare the three- and four-row subsets above, for computing the distance between documents, then the average feature proportion varies when the fourth document is included. > dfm_weight(mt, "prop")
Document-feature matrix of: 4 documents, 5 features (50% sparse).
4 x 5 sparse Matrix of class "dfm"
features
docs a b c d e
text1 0.50 0.25 0.25 0 0
text2 0.25 0.50 0 0.25 0
text3 0 0.20 0 0.40 0.4
text4 0 0 0 0 1.0
> dfm_weight(mt[1:3, ], "prop") %>% colSums()
a b c d e
0.75 0.95 0.25 0.65 0.40
> dfm_weight(mt[1:4, ], "prop") %>% colSums()
a b c d e
0.75 0.95 0.25 0.65 1.40 That's why the quantities are different. But this function is important in correspondence analysis and anyone using it ought to be aware of this "feature" of its behaviour. So, it's not a bug. For the Hamming distance, you are right in that this not appropriate.
docs <- c(d1 = "a b c d e",
d2 = "a b c e f",
d3 = "b a d e c")
toks <- tokens(docs)
d <- dfm(toks)
d
## Document-feature matrix of: 3 documents, 6 features (16.7% sparse).
## 3 x 6 sparse Matrix of class "dfm"
## features
## docs a b c d e f
## d1 1 1 1 1 1 0
## d2 1 1 1 0 1 1
## d3 1 1 1 1 1 0
# correct hamming distances
sum(toks[["d1"]] != toks[["d2"]])
## [1] 2
sum(toks[["d1"]] != toks[["d3"]])
## [1] 5
# wrong
textstat_dist(d, method = "hamming")
## d1 d2
## d2 2
## d3 0 2
e1071::hamming.distance(1:3, rep(1:3, 2))
## [1] 0
e1071::hamming.distance(1:3, 1:4)
## [1] 1
## Warning message:
## In x != y : longer object length is not a multiple of shorter object length So I propose we solve this by removing this measure altogether, rather than write a method for tokens objects. |
I know that it isn't a bug, but seems better to compute chi-square disntace in two steps: (1) perform average profile weighting by We can drop |
OK so agreed on Hamming. On chi-squared distance however I am not sure what you mean by separating the steps, since the average profile depends on the dimensions of the matrix, which means that two documents will have a different distance on this measure if a third document is included or not. I don't see a way around that for this measure - but anyone using it (which is probably very few, for text!) should know this. |
First step: the shape of entire corpus matters (e.g. Lines 322 to 335 in e2d8005
Second step: operations are on pairs of document/features Lines 361 to 365 in e2d8005
|
Here is an example of two-step approach: dfm_avgp <- function(x, margin = "documents") { # should be dfm_weight
if (margin == "features")
x <- t(x)
name <- colnames(x)
# compute average-profile
avg <- sqrt(colSums(x) / sum(x))
x <- x / rowSums(x)
x <- x %*% diag(1 / avg)
colnames(x) <- name
if (margin == "features")
x <- t(x)
return(as.dfm(x))
}
mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
textstat_dist(mt, method = "chisquared")
# text1 text2 text3
# text2 1.916667
# text3 3.708667 1.325333
# text4 5.783333 4.866667 2.165333
mt2 <- dfm_avgp(mt)
textstat_simil2(mt2, method = "chisquared")
# 4 x 4 sparse Matrix of class "dsTMatrix"
# text1 text2 text3 text4
# text1 . 1.916667 3.708667 5.783333
# text2 1.916667 . 1.325333 4.866667
# text3 3.708667 1.325333 . 2.165333
# text4 5.783333 4.866667 2.165333 .
textstat_simil2(mt2[1:2,], method = "chisquared")
# 2 x 2 sparse Matrix of class "dsTMatrix"
# text1 text2
# text1 . 1.916667
# text2 1.916667 . If a DFM is weighted before being passed to |
Above code replicates current ExPosition::chi2Dist(as.matrix(mt))$D
docs
docs text1 text2 text3 text4
text1 0.000000 1.916667 3.708667 5.783333
text2 1.916667 0.000000 1.325333 4.866667
text3 3.708667 1.325333 0.000000 2.165333
text4 5.783333 4.866667 2.165333 0.000000 Another mystery is that the as.matrix(proxy::dist(as.matrix(mt), method = "Chi-squared", by_rows = TRUE))
text1 text2 text3 text4
text1 0.000 -2.750 -5.250 -0.875
text2 -2.750 0.000 -4.000 -0.875
text3 -5.250 -4.000 0.000 -0.875
text4 -0.875 -0.875 -0.875 0.000 Better to check. |
The proxy function is a different formula, and it's a similarity rather than a distance measure. > pr_DB$get_entry("Chi-squared")
names Chi-squared
FUN pr_ChiSquared
distance FALSE
PREFUN NA
POSTFUN NA
convert pr_simil2dist
type nominal
loop TRUE
C_FUN FALSE
PACKAGE proxy
abcd FALSE
formula sum_ij (o_i - e_i)^2 / e_i
reference Anderberg, M.R. (1973). Cluster Analysis for Applicaitons. Academic Press.
description Sum of standardized squared deviations from observed and expected values in
a cross-tab for x and y. |
Not sure why the Greenacre pdf has a different value, but it does. mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
mtprof <- dfm_weight(mt, "prop")
avgp <- colMeans(mtprof)
# canned methods
#
textstat_dist(mt, method = "chisquared")
## text1 text2 text3
## text2 1.916667
## text3 3.708667 1.325333
## text4 5.783333 4.866667 2.165333
as.dist(ExPosition::chi2Dist(as.matrix(mt))$D)
## text1 text2 text3
## text2 1.916667
## text3 3.708667 1.325333
## text4 5.783333 4.866667 2.165333
# based on http://www.econ.upf.edu/~michael/stanford/maeb4.pdf Eq 4.8
#
# difference between text1 and text2
sqrt(sum((mtprof[1, ] - mtprof[2, ])^2 / avgp))
## [1] 1.407518
# difference between text1 and text3
sqrt(sum((mtprof[1, ] - mtprof[3, ])^2 / avgp))
## [1] 1.945666
# difference between text1 and text4
sqrt(sum((mtprof[1, ] - mtprof[4, ])^2 / avgp))
## [1] 2.335302 |
Moved to #1444. |
From Greenacre handout chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/http://www.econ.upf.edu/~michael/stanford/maeb4.pdf In case we wish to return to #1442
In relation o #1210, we should remove Chi-square and Hamming measures.
Chi-square
I don't think this is a good measure, because distance between the same pairs of documents changes depending on size of the corpus. This does not seem to be a bug as
ExPosition::chi2Dist()
behaves the same.Hamming
According to https://en.wikipedia.org/wiki/Hamming_distance, positions of elements matter in Hamming distance (similar to Levenshtein distance). This is more suitable to compute on tokens objects by making
textstat_dist.tokens()
method.The text was updated successfully, but these errors were encountered: