Skip to content

tf-idf weighting needs careful cross-checking #997

@kbenoit

Description

@kbenoit

Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.

We also had an error in the tfidf.dfm() signature, but this is now corrected in branch hotfix-tfidf. Previously, the generic default was scheme_tf = "prop" but in tfidf.dfm() it was scheme_tf = "count". There might be a reason why I defaulted this to "count" - pls check the first source below.

See:

library("quanteda")
packageVersion("quanteda")

data_dfm_ukimmig2010 <- dfm(data_char_ukimmig2010)

# quanteda tf-idf
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop"))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration            :           an unparalleled       crisis        which
#   BNP                    0 0.0009339238 2.339292e-04 0.0002909276 0.0005818552 0.0017923514
#   Coalition              0 0            1.967405e-04 0            0            0           
#   Conservative           0 0.0010231363 2.050201e-04 0            0            0           
#   Greens                 0 0.0007519072 7.533508e-05 0            0            0           
#   Labour                 0 0            1.497878e-04 0            0            0.0009563873
#   LibDem                 0 0.0010570290 1.059058e-04 0            0            0           
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop", k = 1, smoothing = 1))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
#               features
# docs           immigration           :           an unparalleled       crisis        which
#   BNP          0.001784703 0.001455878 0.0013766616 0.0002257203 0.0004514407 0.0016519939
#   Coalition    0.006432775 0           0.0011578077 0            0            0           
#   Conservative 0.001675873 0.001594950 0.0012065330 0            0            0           
#   Greens       0.003284284 0.001172136 0.0004433431 0            0            0           
#   Labour       0.005305705 0           0.0008814934 0            0            0.0008814934
#   LibDem       0.002885648 0.001647785 0.0006232505 0            0            0           



# tm tf-idf
library("tm")
data_DTM_ukimmig2010 <- convert(data_dfm_ukimmig2010, to = "tm")
as.matrix(weightTfIdf(data_DTM_ukimmig2010))[1:6, 1:6]
#               Terms
# Docs           immigration           :           an unparalleled      crisis       which
#   BNP                    0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
#   Coalition              0 0.000000000 0.0006535577 0.0000000000 0.000000000 0.000000000
#   Conservative           0 0.003398785 0.0006810621 0.0000000000 0.000000000 0.000000000
#   Greens                 0 0.002497782 0.0002502577 0.0000000000 0.000000000 0.000000000
#   Labour                 0 0.000000000 0.0004975842 0.0000000000 0.000000000 0.003177050
#   LibDem                 0 0.003511374 0.0003518116 0.0000000000 0.000000000 0.000000000

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions