Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
tf*idf is non-standard in scikit-learn #2998
Several months ago @tpeng pointed me to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L966 - why is 1.0 added to idf?
This 1.0 makes idf positive when n_samples==df, and the comment is suggesting it is to avoid some division errors. What I don't understand is what are these division errors - idf is a multiplier, not a divisor in tf*idf, and we're calculating logarithm for idf - why divide by logarithm?
When this 1.0 summand is commented out
What fails is normalization checks, and the inverse transform test. If I comment out normalization checks the rest of test_text.test_tfidf_no_smoothing as well as
By the way, SkipTest exceptions in these tests are likely useless because they should happen before
It is not clear for me what do these failing normalization tests mean. But the comment about zero division errors doesn't explain why is the formula non-standard. There are smoothing terms inside logarithm, but what is +1 outside logaritm for? Maybe it is explained in Yates2011, but I don't have an access to it, and it is better to add some more notes to the source code then.
I'm not really sure why this was done, but I see two effects: one is that we actually compute
The other is that tf-idf is never exactly zero when tf is non-zero, so that
We currently have two mechanisms in place that modify idf:
I pushed some documentation for the latter in fbe974b.