Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Chi-square and Hamming distance measure #1442

Closed
koheiw opened this issue Oct 3, 2018 · 9 comments
Closed

Remove Chi-square and Hamming distance measure #1442

koheiw opened this issue Oct 3, 2018 · 9 comments
Assignees
Labels

Comments

@koheiw
Copy link
Collaborator

koheiw commented Oct 3, 2018

In relation o #1210, we should remove Chi-square and Hamming measures.

Chi-square

I don't think this is a good measure, because distance between the same pairs of documents changes depending on size of the corpus. This does not seem to be a bug as ExPosition::chi2Dist() behaves the same.

> mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
> textstat_dist(mt[1:3,], method = "chisquared")
         text1    text2
text2 1.557292         
text3 3.637292 1.700833

> textstat_dist(mt[1:4,], method = "chisquared")
         text1    text2    text3
text2 1.916667                  
text3 3.708667 1.325333         
text4 5.783333 4.866667 2.165333

> as.matrix(ExPosition::chi2Dist(as.matrix(mt[1:3,]))$D)
       docs
docs       text1    text2    text3
  text1 0.000000 1.557292 3.637292
  text2 1.557292 0.000000 1.700833
  text3 3.637292 1.700833 0.000000

> as.matrix(ExPosition::chi2Dist(as.matrix(mt[1:4,]))$D)
       docs
docs       text1    text2    text3    text4
  text1 0.000000 1.916667 3.708667 5.783333
  text2 1.916667 0.000000 1.325333 4.866667
  text3 3.708667 1.325333 0.000000 2.165333
  text4 5.783333 4.866667 2.165333 0.000000

Hamming

According to https://en.wikipedia.org/wiki/Hamming_distance, positions of elements matter in Hamming distance (similar to Levenshtein distance). This is more suitable to compute on tokens objects by making textstat_dist.tokens() method.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 3, 2018

Good points. The chi-squared distance varies with the matrix size because across the dimension not computed, there is a quantity in the formula called the "average profile" that is the average proportion for that dimension. So if we compare the three- and four-row subsets above, for computing the distance between documents, then the average feature proportion varies when the fourth document is included.

> dfm_weight(mt, "prop")
Document-feature matrix of: 4 documents, 5 features (50% sparse).
4 x 5 sparse Matrix of class "dfm"
       features
docs       a    b    c    d   e
  text1 0.50 0.25 0.25 0    0  
  text2 0.25 0.50 0    0.25 0  
  text3 0    0.20 0    0.40 0.4
  text4 0    0    0    0    1.0
> dfm_weight(mt[1:3, ], "prop") %>% colSums()
   a    b    c    d    e 
0.75 0.95 0.25 0.65 0.40 
> dfm_weight(mt[1:4, ], "prop") %>% colSums()
   a    b    c    d    e 
0.75 0.95 0.25 0.65 1.40 

That's why the quantities are different. But this function is important in correspondence analysis and anyone using it ought to be aware of this "feature" of its behaviour. So, it's not a bug.

For the Hamming distance, you are right in that this not appropriate.

  1. The vectors are unordered in a dfm, and therefore do not capture the essence of the Hamming distance, which is the number of places between two vectors where the values differ. (In the textstat_dist() code, we convert all features to binary.) So below, we get the Hamming distance for d1 and d3 completely wrong.
docs <- c(d1 = "a b c d e", 
          d2 = "a b c e f",
          d3 = "b a d e c")
toks <- tokens(docs)
d <- dfm(toks)
d
## Document-feature matrix of: 3 documents, 6 features (16.7% sparse).
## 3 x 6 sparse Matrix of class "dfm"
##     features
## docs a b c d e f
##   d1 1 1 1 1 1 0
##   d2 1 1 1 0 1 1
##   d3 1 1 1 1 1 0

# correct hamming distances 
sum(toks[["d1"]] != toks[["d2"]])
## [1] 2
sum(toks[["d1"]] != toks[["d3"]])
## [1] 5

# wrong
textstat_dist(d, method = "hamming")
##    d1 d2
## d2  2   
## d3  0  2
  1. Another constraint on Hamming distance between vectors is that they be of the same length. tokens streams almost never are, although the dfm construction makes all dimensions of equal length. The "R" way to fix this would be element recycling for the shorter vector, but that can create really oddball results.
e1071::hamming.distance(1:3, rep(1:3, 2))
## [1] 0
e1071::hamming.distance(1:3, 1:4)
## [1] 1
## Warning message:
## In x != y : longer object length is not a multiple of shorter object length

So I propose we solve this by removing this measure altogether, rather than write a method for tokens objects.

@koheiw
Copy link
Collaborator Author

koheiw commented Oct 3, 2018

I know that it isn't a bug, but seems better to compute chi-square disntace in two steps: (1) perform average profile weighting by dfm_weight() and (2) compute distance between pairs of documents/features in textstat_simil2(). This is not only to keep the structure of the parallelized similarity function simple, but also to make the distance measure's dependency on the shape of entire corpus explicit.

We can drop hamming entirely. I am interested in those positional similarity measures, so I will bring that back along with Levenshtein distance etc. to quanteda or an extension package in the future.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 3, 2018

OK so agreed on Hamming.

On chi-squared distance however I am not sure what you mean by separating the steps, since the average profile depends on the dimensions of the matrix, which means that two documents will have a different distance on this measure if a third document is included or not. I don't see a way around that for this measure - but anyone using it (which is probably very few, for text!) should know this.

@koheiw
Copy link
Collaborator Author

koheiw commented Oct 3, 2018

First step: the shape of entire corpus matters (e.g. sum(x))

quanteda/R/textstat_dist.R

Lines 322 to 335 in e2d8005

avg <- if (margin == 2) sqrt(rowSums(x) / sum(x)) else sqrt(colSums(x) / sum(x))
n <- if (margin == 2) ncol(x) else nrow(x)
rowname <- func_name(x)
colname <- func_name(x)
if (margin == 1 ) {
# convert into profiles
x <- x/func_sum(x)
# weighted by the average profiles
x <- x %*% diag(1/avg)
} else {
x <- x %*% diag(1/func_sum(x))
x <- x / avg
}

Second step: operations are on pairs of document/features

quanteda/R/textstat_dist.R

Lines 361 to 365 in e2d8005

an <- func_sum(x ^ 2)
tmp <- matrix(rep(an, n), nrow = n)
tmp <- tmp + matrix(rep(an, n), nrow = n, byrow = TRUE)
chimat <- tmp - 2 * as.matrix(func_cp(x))
#chimat <- sqrt(round(chimat, 2))

@koheiw
Copy link
Collaborator Author

koheiw commented Oct 4, 2018

Here is an example of two-step approach:

dfm_avgp <- function(x, margin = "documents") { # should be dfm_weight
    
    if (margin == "features")
        x <- t(x)
    name <- colnames(x)
    
    # compute average-profile
    avg <- sqrt(colSums(x) / sum(x))
    x <- x / rowSums(x)
    x <- x %*% diag(1 / avg)
    
    colnames(x) <- name
    if (margin == "features")
        x <- t(x)
    return(as.dfm(x))
}

mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
textstat_dist(mt, method = "chisquared")

#          text1    text2    text3
# text2 1.916667                  
# text3 3.708667 1.325333         
# text4 5.783333 4.866667 2.165333

mt2 <- dfm_avgp(mt)
textstat_simil2(mt2, method = "chisquared")
# 4 x 4 sparse Matrix of class "dsTMatrix"
#          text1    text2    text3    text4
# text1 .        1.916667 3.708667 5.783333
# text2 1.916667 .        1.325333 4.866667
# text3 3.708667 1.325333 .        2.165333
# text4 5.783333 4.866667 2.165333 .       

textstat_simil2(mt2[1:2,], method = "chisquared")
# 2 x 2 sparse Matrix of class "dsTMatrix"
#          text1    text2
# text1 .        1.916667
# text2 1.916667 .       

If a DFM is weighted before being passed to textstat_simil2(), the distance between documents does not change by subsetting. We can assure prior weighting by checking x@weightTf[["scheme"]].

kbenoit added a commit that referenced this issue Oct 4, 2018
kbenoit added a commit that referenced this issue Oct 4, 2018
@kbenoit kbenoit closed this as completed Oct 4, 2018
@kbenoit kbenoit reopened this Oct 4, 2018
@koheiw
Copy link
Collaborator Author

koheiw commented Oct 5, 2018

Above code replicates current textstat_dist() and ExPosition::chi2Dist() but I am not sure why we are taking square root of column proportions... There is nothing like that in computing profiles as far as I understand from http://www.econ.upf.edu/~michael/stanford/maeb4.pdf.

ExPosition::chi2Dist(as.matrix(mt))$D
       docs
docs       text1    text2    text3    text4
  text1 0.000000 1.916667 3.708667 5.783333
  text2 1.916667 0.000000 1.325333 4.866667
  text3 3.708667 1.325333 0.000000 2.165333
  text4 5.783333 4.866667 2.165333 0.000000

Another mystery is that the proxy::dist() produces completely different scores:

as.matrix(proxy::dist(as.matrix(mt), method = "Chi-squared", by_rows = TRUE))
       text1  text2  text3  text4
text1  0.000 -2.750 -5.250 -0.875
text2 -2.750  0.000 -4.000 -0.875
text3 -5.250 -4.000  0.000 -0.875
text4 -0.875 -0.875 -0.875  0.000 

Better to check.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 5, 2018

The proxy function is a different formula, and it's a similarity rather than a distance measure.

> pr_DB$get_entry("Chi-squared")
      names Chi-squared
        FUN pr_ChiSquared
   distance FALSE
     PREFUN NA
    POSTFUN NA
    convert pr_simil2dist
       type nominal
       loop TRUE
      C_FUN FALSE
    PACKAGE proxy
       abcd FALSE
    formula sum_ij (o_i - e_i)^2 / e_i
  reference Anderberg, M.R. (1973). Cluster Analysis for Applicaitons. Academic Press.
description Sum of standardized squared deviations from observed and expected values in
            a cross-tab for x and y.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 5, 2018

Not sure why the Greenacre pdf has a different value, but it does.

mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
mtprof <- dfm_weight(mt, "prop")
avgp <- colMeans(mtprof)

# canned methods
#
textstat_dist(mt, method = "chisquared")
##          text1    text2    text3
## text2 1.916667                  
## text3 3.708667 1.325333         
## text4 5.783333 4.866667 2.165333
as.dist(ExPosition::chi2Dist(as.matrix(mt))$D)
##          text1    text2    text3
## text2 1.916667                  
## text3 3.708667 1.325333         
## text4 5.783333 4.866667 2.165333

# based on http://www.econ.upf.edu/~michael/stanford/maeb4.pdf Eq 4.8
#
# difference between text1 and text2
sqrt(sum((mtprof[1, ] - mtprof[2, ])^2 / avgp))
## [1] 1.407518
# difference between text1 and text3
sqrt(sum((mtprof[1, ] - mtprof[3, ])^2 / avgp))
## [1] 1.945666
# difference between text1 and text4
sqrt(sum((mtprof[1, ] - mtprof[4, ])^2 / avgp))
## [1] 2.335302

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 5, 2018

Moved to #1444.

@kbenoit kbenoit closed this as completed Oct 5, 2018
kbenoit added a commit that referenced this issue Oct 17, 2018
From Greenacre handout chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/http://www.econ.upf.edu/~michael/stanford/maeb4.pdf

In case we wish to return to #1442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants