Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.
We also had an error in the tfidf.dfm() signature, but this is now corrected in branch hotfix-tfidf. Previously, the generic default was scheme_tf = "prop" but in tfidf.dfm() it was scheme_tf = "count". There might be a reason why I defaulted this to "count" - pls check the first source below.
See:
library("quanteda")
packageVersion("quanteda")
data_dfm_ukimmig2010 <- dfm(data_char_ukimmig2010)
# quanteda tf-idf
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop"))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
# features
# docs immigration : an unparalleled crisis which
# BNP 0 0.0009339238 2.339292e-04 0.0002909276 0.0005818552 0.0017923514
# Coalition 0 0 1.967405e-04 0 0 0
# Conservative 0 0.0010231363 2.050201e-04 0 0 0
# Greens 0 0.0007519072 7.533508e-05 0 0 0
# Labour 0 0 1.497878e-04 0 0 0.0009563873
# LibDem 0 0.0010570290 1.059058e-04 0 0 0
head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop", k = 1, smoothing = 1))
# Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
# 6 x 6 sparse Matrix of class "dfm"
# features
# docs immigration : an unparalleled crisis which
# BNP 0.001784703 0.001455878 0.0013766616 0.0002257203 0.0004514407 0.0016519939
# Coalition 0.006432775 0 0.0011578077 0 0 0
# Conservative 0.001675873 0.001594950 0.0012065330 0 0 0
# Greens 0.003284284 0.001172136 0.0004433431 0 0 0
# Labour 0.005305705 0 0.0008814934 0 0 0.0008814934
# LibDem 0.002885648 0.001647785 0.0006232505 0 0 0
# tm tf-idf
library("tm")
data_DTM_ukimmig2010 <- convert(data_dfm_ukimmig2010, to = "tm")
as.matrix(weightTfIdf(data_DTM_ukimmig2010))[1:6, 1:6]
# Terms
# Docs immigration : an unparalleled crisis which
# BNP 0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
# Coalition 0 0.000000000 0.0006535577 0.0000000000 0.000000000 0.000000000
# Conservative 0 0.003398785 0.0006810621 0.0000000000 0.000000000 0.000000000
# Greens 0 0.002497782 0.0002502577 0.0000000000 0.000000000 0.000000000
# Labour 0 0.000000000 0.0004975842 0.0000000000 0.000000000 0.003177050
# LibDem 0 0.003511374 0.0003518116 0.0000000000 0.000000000 0.000000000
Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.
We also had an error in the
tfidf.dfm()signature, but this is now corrected in branchhotfix-tfidf. Previously, the generic default wasscheme_tf = "prop"but intfidf.dfm()it wasscheme_tf = "count". There might be a reason why I defaulted this to "count" - pls check the first source below.See: