Remove Chi-square and Hamming distance measure #1442

koheiw · 2018-10-03T01:11:49Z

In relation o #1210, we should remove Chi-square and Hamming measures.

Chi-square

I don't think this is a good measure, because distance between the same pairs of documents changes depending on size of the corpus. This does not seem to be a bug as ExPosition::chi2Dist() behaves the same.

> mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
> textstat_dist(mt[1:3,], method = "chisquared")
         text1    text2
text2 1.557292         
text3 3.637292 1.700833

> textstat_dist(mt[1:4,], method = "chisquared")
         text1    text2    text3
text2 1.916667                  
text3 3.708667 1.325333         
text4 5.783333 4.866667 2.165333

> as.matrix(ExPosition::chi2Dist(as.matrix(mt[1:3,]))$D)
       docs
docs       text1    text2    text3
  text1 0.000000 1.557292 3.637292
  text2 1.557292 0.000000 1.700833
  text3 3.637292 1.700833 0.000000

> as.matrix(ExPosition::chi2Dist(as.matrix(mt[1:4,]))$D)
       docs
docs       text1    text2    text3    text4
  text1 0.000000 1.916667 3.708667 5.783333
  text2 1.916667 0.000000 1.325333 4.866667
  text3 3.708667 1.325333 0.000000 2.165333
  text4 5.783333 4.866667 2.165333 0.000000

Hamming

According to https://en.wikipedia.org/wiki/Hamming_distance, positions of elements matter in Hamming distance (similar to Levenshtein distance). This is more suitable to compute on tokens objects by making textstat_dist.tokens() method.

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-10-03T21:21:38Z

Good points. The chi-squared distance varies with the matrix size because across the dimension not computed, there is a quantity in the formula called the "average profile" that is the average proportion for that dimension. So if we compare the three- and four-row subsets above, for computing the distance between documents, then the average feature proportion varies when the fourth document is included.

> dfm_weight(mt, "prop")
Document-feature matrix of: 4 documents, 5 features (50% sparse).
4 x 5 sparse Matrix of class "dfm"
       features
docs       a    b    c    d   e
  text1 0.50 0.25 0.25 0    0  
  text2 0.25 0.50 0    0.25 0  
  text3 0    0.20 0    0.40 0.4
  text4 0    0    0    0    1.0
> dfm_weight(mt[1:3, ], "prop") %>% colSums()
   a    b    c    d    e 
0.75 0.95 0.25 0.65 0.40 
> dfm_weight(mt[1:4, ], "prop") %>% colSums()
   a    b    c    d    e 
0.75 0.95 0.25 0.65 1.40

That's why the quantities are different. But this function is important in correspondence analysis and anyone using it ought to be aware of this "feature" of its behaviour. So, it's not a bug.

For the Hamming distance, you are right in that this not appropriate.

The vectors are unordered in a dfm, and therefore do not capture the essence of the Hamming distance, which is the number of places between two vectors where the values differ. (In the textstat_dist() code, we convert all features to binary.) So below, we get the Hamming distance for d1 and d3 completely wrong.

docs <- c(d1 = "a b c d e", 
          d2 = "a b c e f",
          d3 = "b a d e c")
toks <- tokens(docs)
d <- dfm(toks)
d
## Document-feature matrix of: 3 documents, 6 features (16.7% sparse).
## 3 x 6 sparse Matrix of class "dfm"
##     features
## docs a b c d e f
##   d1 1 1 1 1 1 0
##   d2 1 1 1 0 1 1
##   d3 1 1 1 1 1 0

# correct hamming distances 
sum(toks[["d1"]] != toks[["d2"]])
## [1] 2
sum(toks[["d1"]] != toks[["d3"]])
## [1] 5

# wrong
textstat_dist(d, method = "hamming")
##    d1 d2
## d2  2   
## d3  0  2

Another constraint on Hamming distance between vectors is that they be of the same length. tokens streams almost never are, although the dfm construction makes all dimensions of equal length. The "R" way to fix this would be element recycling for the shorter vector, but that can create really oddball results.

e1071::hamming.distance(1:3, rep(1:3, 2))
## [1] 0
e1071::hamming.distance(1:3, 1:4)
## [1] 1
## Warning message:
## In x != y : longer object length is not a multiple of shorter object length

So I propose we solve this by removing this measure altogether, rather than write a method for tokens objects.

koheiw · 2018-10-03T22:07:18Z

I know that it isn't a bug, but seems better to compute chi-square disntace in two steps: (1) perform average profile weighting by dfm_weight() and (2) compute distance between pairs of documents/features in textstat_simil2(). This is not only to keep the structure of the parallelized similarity function simple, but also to make the distance measure's dependency on the shape of entire corpus explicit.

We can drop hamming entirely. I am interested in those positional similarity measures, so I will bring that back along with Levenshtein distance etc. to quanteda or an extension package in the future.

kbenoit · 2018-10-03T22:12:02Z

OK so agreed on Hamming.

On chi-squared distance however I am not sure what you mean by separating the steps, since the average profile depends on the dimensions of the matrix, which means that two documents will have a different distance on this measure if a third document is included or not. I don't see a way around that for this measure - but anyone using it (which is probably very few, for text!) should know this.

koheiw · 2018-10-03T22:26:17Z

First step: the shape of entire corpus matters (e.g. sum(x))

quanteda/R/textstat_dist.R

Lines 322 to 335 in e2d8005

    
           avg <- if (margin == 2) sqrt(rowSums(x) / sum(x)) else sqrt(colSums(x) / sum(x)) 
        
           n <- if (margin == 2) ncol(x) else nrow(x) 
        
           rowname <- func_name(x) 
        
           colname <- func_name(x) 
        
           if (margin == 1 ) { 
        
               # convert into profiles 
        
               x <- x/func_sum(x) 
        
               # weighted by the average profiles 
        
               x <- x %*% diag(1/avg) 
        
           } else { 
        
               x <- x %*% diag(1/func_sum(x)) 
        
               x <- x / avg 
        
           }

Second step: operations are on pairs of document/features

quanteda/R/textstat_dist.R

Lines 361 to 365 in e2d8005

    
           an <- func_sum(x ^ 2) 
        
           tmp <- matrix(rep(an, n), nrow = n)  
        
           tmp <- tmp + matrix(rep(an, n), nrow = n, byrow = TRUE) 
        
           chimat <- tmp - 2 * as.matrix(func_cp(x)) 
        
           #chimat <-  sqrt(round(chimat, 2))

koheiw · 2018-10-04T01:57:01Z

Here is an example of two-step approach:

dfm_avgp <- function(x, margin = "documents") { # should be dfm_weight
    
    if (margin == "features")
        x <- t(x)
    name <- colnames(x)
    
    # compute average-profile
    avg <- sqrt(colSums(x) / sum(x))
    x <- x / rowSums(x)
    x <- x %*% diag(1 / avg)
    
    colnames(x) <- name
    if (margin == "features")
        x <- t(x)
    return(as.dfm(x))
}

mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
textstat_dist(mt, method = "chisquared")

#          text1    text2    text3
# text2 1.916667                  
# text3 3.708667 1.325333         
# text4 5.783333 4.866667 2.165333

mt2 <- dfm_avgp(mt)
textstat_simil2(mt2, method = "chisquared")
# 4 x 4 sparse Matrix of class "dsTMatrix"
#          text1    text2    text3    text4
# text1 .        1.916667 3.708667 5.783333
# text2 1.916667 .        1.325333 4.866667
# text3 3.708667 1.325333 .        2.165333
# text4 5.783333 4.866667 2.165333 .       

textstat_simil2(mt2[1:2,], method = "chisquared")
# 2 x 2 sparse Matrix of class "dsTMatrix"
#          text1    text2
# text1 .        1.916667
# text2 1.916667 .

If a DFM is weighted before being passed to textstat_simil2(), the distance between documents does not change by subsetting. We can assure prior weighting by checking x@weightTf[["scheme"]].

Remove hamming to address #1442

koheiw · 2018-10-05T00:10:01Z

Above code replicates current textstat_dist() and ExPosition::chi2Dist() but I am not sure why we are taking square root of column proportions... There is nothing like that in computing profiles as far as I understand from http://www.econ.upf.edu/~michael/stanford/maeb4.pdf.

ExPosition::chi2Dist(as.matrix(mt))$D
       docs
docs       text1    text2    text3    text4
  text1 0.000000 1.916667 3.708667 5.783333
  text2 1.916667 0.000000 1.325333 4.866667
  text3 3.708667 1.325333 0.000000 2.165333
  text4 5.783333 4.866667 2.165333 0.000000

Another mystery is that the proxy::dist() produces completely different scores:

as.matrix(proxy::dist(as.matrix(mt), method = "Chi-squared", by_rows = TRUE))
       text1  text2  text3  text4
text1  0.000 -2.750 -5.250 -0.875
text2 -2.750  0.000 -4.000 -0.875
text3 -5.250 -4.000  0.000 -0.875
text4 -0.875 -0.875 -0.875  0.000

Better to check.

kbenoit · 2018-10-05T19:56:18Z

The proxy function is a different formula, and it's a similarity rather than a distance measure.

> pr_DB$get_entry("Chi-squared")
      names Chi-squared
        FUN pr_ChiSquared
   distance FALSE
     PREFUN NA
    POSTFUN NA
    convert pr_simil2dist
       type nominal
       loop TRUE
      C_FUN FALSE
    PACKAGE proxy
       abcd FALSE
    formula sum_ij (o_i - e_i)^2 / e_i
  reference Anderberg, M.R. (1973). Cluster Analysis for Applicaitons. Academic Press.
description Sum of standardized squared deviations from observed and expected values in
            a cross-tab for x and y.

kbenoit · 2018-10-05T20:04:47Z

Not sure why the Greenacre pdf has a different value, but it does.

mt <- dfm(c("a a b c", "a b b d", "b d d e e", "e e e"))
mtprof <- dfm_weight(mt, "prop")
avgp <- colMeans(mtprof)

# canned methods
#
textstat_dist(mt, method = "chisquared")
##          text1    text2    text3
## text2 1.916667                  
## text3 3.708667 1.325333         
## text4 5.783333 4.866667 2.165333
as.dist(ExPosition::chi2Dist(as.matrix(mt))$D)
##          text1    text2    text3
## text2 1.916667                  
## text3 3.708667 1.325333         
## text4 5.783333 4.866667 2.165333

# based on http://www.econ.upf.edu/~michael/stanford/maeb4.pdf Eq 4.8
#
# difference between text1 and text2
sqrt(sum((mtprof[1, ] - mtprof[2, ])^2 / avgp))
## [1] 1.407518
# difference between text1 and text3
sqrt(sum((mtprof[1, ] - mtprof[3, ])^2 / avgp))
## [1] 1.945666
# difference between text1 and text4
sqrt(sum((mtprof[1, ] - mtprof[4, ])^2 / avgp))
## [1] 2.335302

kbenoit · 2018-10-05T20:07:08Z

Moved to #1444.

From Greenacre handout chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/http://www.econ.upf.edu/~michael/stanford/maeb4.pdf In case we wish to return to #1442

koheiw added the design label Oct 3, 2018

koheiw assigned kbenoit and koheiw Oct 3, 2018

koheiw added a commit that referenced this issue Oct 4, 2018

Remove hamming to address #1442

5a1cc5d

koheiw added a commit that referenced this issue Oct 4, 2018

Implement chisquare for average-profile weighted matrix for #1442

7c873c8

kbenoit added a commit that referenced this issue Oct 4, 2018

Increment version, add news for #1442 fix

a3b5355

kbenoit added a commit that referenced this issue Oct 4, 2018

Merge pull request #1443 from quanteda/remove-hamming

1c69735

Remove hamming to address #1442

kbenoit closed this as completed Oct 4, 2018

kbenoit reopened this Oct 4, 2018

kbenoit mentioned this issue Oct 5, 2018

What is correct computation of "chi-squared distance"? #1444

Closed

kbenoit closed this as completed Oct 5, 2018

kbenoit added a commit that referenced this issue Oct 17, 2018

Add some sites test data

c5d1b02

From Greenacre handout chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/http://www.econ.upf.edu/~michael/stanford/maeb4.pdf In case we wish to return to #1442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Chi-square and Hamming distance measure #1442

Remove Chi-square and Hamming distance measure #1442

koheiw commented Oct 3, 2018 •

edited

kbenoit commented Oct 3, 2018

koheiw commented Oct 3, 2018

kbenoit commented Oct 3, 2018

koheiw commented Oct 3, 2018

koheiw commented Oct 4, 2018 •

edited

koheiw commented Oct 5, 2018

kbenoit commented Oct 5, 2018

kbenoit commented Oct 5, 2018

kbenoit commented Oct 5, 2018

Remove Chi-square and Hamming distance measure #1442

Remove Chi-square and Hamming distance measure #1442

Comments

koheiw commented Oct 3, 2018 • edited

Chi-square

Hamming

kbenoit commented Oct 3, 2018

koheiw commented Oct 3, 2018

kbenoit commented Oct 3, 2018

koheiw commented Oct 3, 2018

koheiw commented Oct 4, 2018 • edited

koheiw commented Oct 5, 2018

kbenoit commented Oct 5, 2018

kbenoit commented Oct 5, 2018

kbenoit commented Oct 5, 2018

koheiw commented Oct 3, 2018 •

edited

koheiw commented Oct 4, 2018 •

edited