tf-idf weighting needs careful cross-checking #997

kbenoit · 2017-09-26T06:37:12Z

Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.

We also had an error in the tfidf.dfm() signature, but this is now corrected in branch hotfix-tfidf. Previously, the generic default was scheme_tf = "prop" but in tfidf.dfm() it was scheme_tf = "count". There might be a reason why I defaulted this to "count" - pls check the first source below.

See:

library("quanteda")
packageVersion("quanteda")

data_dfm_ukimmig2010 <- dfm(data_char_ukimmig2010)

# quanteda tf-idf
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop"))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration            :           an unparalleled       crisis        which
#   BNP                    0 0.0009339238 2.339292e-04 0.0002909276 0.0005818552 0.0017923514
#   Coalition              0 0            1.967405e-04 0            0            0           
#   Conservative           0 0.0010231363 2.050201e-04 0            0            0           
#   Greens                 0 0.0007519072 7.533508e-05 0            0            0           
#   Labour                 0 0            1.497878e-04 0            0            0.0009563873
#   LibDem                 0 0.0010570290 1.059058e-04 0            0            0           
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop", k = 1, smoothing = 1))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration           :           an unparalleled       crisis        which
#   BNP          0.001784703 0.001455878 0.0013766616 0.0002257203 0.0004514407 0.0016519939
#   Coalition    0.006432775 0           0.0011578077 0            0            0           
#   Conservative 0.001675873 0.001594950 0.0012065330 0            0            0           
#   Greens       0.003284284 0.001172136 0.0004433431 0            0            0           
#   Labour       0.005305705 0           0.0008814934 0            0            0.0008814934
#   LibDem       0.002885648 0.001647785 0.0006232505 0            0            0           



# tm tf-idf
library("tm")
data_DTM_ukimmig2010 <- convert(data_dfm_ukimmig2010, to = "tm")
as.matrix(weightTfIdf(data_DTM_ukimmig2010))[1:6, 1:6]
#               Terms
# Docs           immigration           :           an unparalleled      crisis       which
#   BNP                    0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
#   Coalition              0 0.000000000 0.0006535577 0.0000000000 0.000000000 0.000000000
#   Conservative           0 0.003398785 0.0006810621 0.0000000000 0.000000000 0.000000000
#   Greens                 0 0.002497782 0.0002502577 0.0000000000 0.000000000 0.000000000
#   Labour                 0 0.000000000 0.0004975842 0.0000000000 0.000000000 0.003177050
#   LibDem                 0 0.003511374 0.0003518116 0.0000000000 0.000000000 0.000000000

The text was updated successfully, but these errors were encountered:

- See #997

HaiyanLW · 2017-09-26T15:42:22Z

Seems just need to set the base of logarithm to 2

> head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop",base=2))
Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
6 x 6 sparse Matrix of class "dfmSparse"
              features
docs           immigration           :           an unparalleled      crisis       which
  BNP                    0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
  Coalition              0 0           0.0006535577 0            0           0          
  Conservative           0 0.003398785 0.0006810621 0            0           0          
  Greens                 0 0.002497782 0.0002502577 0            0           0          
  Labour                 0 0           0.0004975842 0            0           0.003177050
  LibDem                 0 0.003511374 0.0003518116 0            0           0

kbenoit · 2017-09-26T15:45:39Z

OK, so tm::weightTfIdf() uses relative term frequencies, and log base 2, while quanteda::tfidf() by default uses term counts, and log base 10. Good to have that settled!

Based on your reading of the literature, does it look like our defaults, including those passed to docfreq() (for smoothing and k) are defensible?

HaiyanLW · 2017-09-26T16:03:40Z

Yes, k = 0 is for sure. And I saw you have implemented plenty for smoothing (and k) in dfm_weight().

kbenoit · 2017-09-26T18:11:58Z

Great, thanks. What’s the “consensus” (if any!) about the logarithm base for idf, and for the form of the term frequency? Is term frequency typically a relative frequency, or a count, in most formulations of tf-idf?

HaiyanLW · 2017-09-26T18:49:02Z

I think it's typically a count for tf. And I doubt there will be any "consensus" for the log base and don't think it's important for comparisons as long as it is the same for a collection.

kbenoit added dfm testing labels Sep 26, 2017

kbenoit assigned HaiyanLW Sep 26, 2017

kbenoit added a commit that referenced this issue Sep 26, 2017

Fix man page for tfidf(), change tfidf.dfm default

ffe0600

- See #997

kbenoit mentioned this issue Sep 27, 2017

Improve tfidf documentation #999

Merged

kbenoit closed this as completed Sep 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf-idf weighting needs careful cross-checking #997

tf-idf weighting needs careful cross-checking #997

kbenoit commented Sep 26, 2017 •

edited

Loading

HaiyanLW commented Sep 26, 2017

kbenoit commented Sep 26, 2017

HaiyanLW commented Sep 26, 2017

kbenoit commented Sep 26, 2017

HaiyanLW commented Sep 26, 2017

tf-idf weighting needs careful cross-checking #997

tf-idf weighting needs careful cross-checking #997

Comments

kbenoit commented Sep 26, 2017 • edited Loading

HaiyanLW commented Sep 26, 2017

kbenoit commented Sep 26, 2017

HaiyanLW commented Sep 26, 2017

kbenoit commented Sep 26, 2017

HaiyanLW commented Sep 26, 2017

kbenoit commented Sep 26, 2017 •

edited

Loading