Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf-idf weighting needs careful cross-checking #997

Closed
kbenoit opened this issue Sep 26, 2017 · 5 comments
Closed

tf-idf weighting needs careful cross-checking #997

kbenoit opened this issue Sep 26, 2017 · 5 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 26, 2017

Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.

We also had an error in the tfidf.dfm() signature, but this is now corrected in branch hotfix-tfidf. Previously, the generic default was scheme_tf = "prop" but in tfidf.dfm() it was scheme_tf = "count". There might be a reason why I defaulted this to "count" - pls check the first source below.

See:

library("quanteda")
packageVersion("quanteda")

data_dfm_ukimmig2010 <- dfm(data_char_ukimmig2010)

# quanteda tf-idf
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop"))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration            :           an unparalleled       crisis        which
#   BNP                    0 0.0009339238 2.339292e-04 0.0002909276 0.0005818552 0.0017923514
#   Coalition              0 0            1.967405e-04 0            0            0           
#   Conservative           0 0.0010231363 2.050201e-04 0            0            0           
#   Greens                 0 0.0007519072 7.533508e-05 0            0            0           
#   Labour                 0 0            1.497878e-04 0            0            0.0009563873
#   LibDem                 0 0.0010570290 1.059058e-04 0            0            0           
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop", k = 1, smoothing = 1))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration           :           an unparalleled       crisis        which
#   BNP          0.001784703 0.001455878 0.0013766616 0.0002257203 0.0004514407 0.0016519939
#   Coalition    0.006432775 0           0.0011578077 0            0            0           
#   Conservative 0.001675873 0.001594950 0.0012065330 0            0            0           
#   Greens       0.003284284 0.001172136 0.0004433431 0            0            0           
#   Labour       0.005305705 0           0.0008814934 0            0            0.0008814934
#   LibDem       0.002885648 0.001647785 0.0006232505 0            0            0           



# tm tf-idf
library("tm")
data_DTM_ukimmig2010 <- convert(data_dfm_ukimmig2010, to = "tm")
as.matrix(weightTfIdf(data_DTM_ukimmig2010))[1:6, 1:6]
#               Terms
# Docs           immigration           :           an unparalleled      crisis       which
#   BNP                    0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
#   Coalition              0 0.000000000 0.0006535577 0.0000000000 0.000000000 0.000000000
#   Conservative           0 0.003398785 0.0006810621 0.0000000000 0.000000000 0.000000000
#   Greens                 0 0.002497782 0.0002502577 0.0000000000 0.000000000 0.000000000
#   Labour                 0 0.000000000 0.0004975842 0.0000000000 0.000000000 0.003177050
#   LibDem                 0 0.003511374 0.0003518116 0.0000000000 0.000000000 0.000000000
@HaiyanLW
Copy link
Collaborator

Seems just need to set the base of logarithm to 2

> head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop",base=2))
Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
6 x 6 sparse Matrix of class "dfmSparse"
              features
docs           immigration           :           an unparalleled      crisis       which
  BNP                    0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
  Coalition              0 0           0.0006535577 0            0           0          
  Conservative           0 0.003398785 0.0006810621 0            0           0          
  Greens                 0 0.002497782 0.0002502577 0            0           0          
  Labour                 0 0           0.0004975842 0            0           0.003177050
  LibDem                 0 0.003511374 0.0003518116 0            0           0          

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 26, 2017

OK, so tm::weightTfIdf() uses relative term frequencies, and log base 2, while quanteda::tfidf() by default uses term counts, and log base 10. Good to have that settled!

Based on your reading of the literature, does it look like our defaults, including those passed to docfreq() (for smoothing and k) are defensible?

@HaiyanLW
Copy link
Collaborator

Yes, k = 0 is for sure. And I saw you have implemented plenty for smoothing (and k) in dfm_weight().

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 26, 2017

Great, thanks. What’s the “consensus” (if any!) about the logarithm base for idf, and for the form of the term frequency? Is term frequency typically a relative frequency, or a count, in most formulations of tf-idf?

@HaiyanLW
Copy link
Collaborator

I think it's typically a count for tf. And I doubt there will be any "consensus" for the log base and don't think it's important for comparisons as long as it is the same for a collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants