-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf-idf weighting needs careful cross-checking #997
Comments
Seems just need to set the base of logarithm to 2 > head(tfidf(data_dfm_ukimmig2010, scheme_tf = "prop",base=2))
Document-feature matrix of: 6 documents, 6 features (44.4% sparse).
6 x 6 sparse Matrix of class "dfmSparse"
features
docs immigration : an unparalleled crisis which
BNP 0 0.003102428 0.0007770960 0.0009664405 0.001932881 0.005954063
Coalition 0 0 0.0006535577 0 0 0
Conservative 0 0.003398785 0.0006810621 0 0 0
Greens 0 0.002497782 0.0002502577 0 0 0
Labour 0 0 0.0004975842 0 0 0.003177050
LibDem 0 0.003511374 0.0003518116 0 0 0 |
OK, so Based on your reading of the literature, does it look like our defaults, including those passed to |
Yes, |
Great, thanks. What’s the “consensus” (if any!) about the logarithm base for idf, and for the form of the term frequency? Is term frequency typically a relative frequency, or a count, in most formulations of tf-idf? |
I think it's typically a count for |
Our computation of tf-idf needs to be carefully checked. There are differences in how inverse document frequency is computed.
We also had an error in the
tfidf.dfm()
signature, but this is now corrected in branchhotfix-tfidf
. Previously, the generic default wasscheme_tf = "prop"
but intfidf.dfm()
it wasscheme_tf = "count"
. There might be a reason why I defaulted this to "count" - pls check the first source below.See:
The text was updated successfully, but these errors were encountered: