Skip to content
Please note that GitHub no longer supports Internet Explorer.

We recommend upgrading to the latest Microsoft Edge, Google Chrome, or Firefox.

Learn more
Permalink
Browse files

DOC Correct TF-IDF formula in TfidfTransformer comments. (#13054)

  • Loading branch information
vishaalkapoor authored and jnothman committed Jan 29, 2019
1 parent 1deb95a commit fdf2f3834e4772b401728ef8ebd280792dfdaf65
Showing with 27 additions and 25 deletions.
  1. +15 −14 doc/modules/feature_extraction.rst
  2. +12 −11 sklearn/feature_extraction/text.py
@@ -436,11 +436,12 @@ Using the ``TfidfTransformer``'s default settings,
the term frequency, the number of times a term occurs in a given document,
is multiplied with idf component, which is computed as

:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`,
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`,

where :math:`n_d` is the total number of documents, and :math:`\text{df}(d,t)`
is the number of documents that contain term :math:`t`. The resulting tf-idf
vectors are then normalized by the Euclidean norm:
where :math:`n` is the total number of documents in the document set, and
:math:`\text{df}(t)` is the number of documents in the document set that
contain term :math:`t`. The resulting tf-idf vectors are then normalized by the
Euclidean norm:

:math:`v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
v{_2}^2 + \dots + v{_n}^2}}`.
@@ -455,14 +456,14 @@ computed in scikit-learn's :class:`TfidfTransformer`
and :class:`TfidfVectorizer` differ slightly from the standard textbook
notation that defines the idf as

:math:`\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.`
:math:`\text{idf}(t) = \log{\frac{n}{1+\text{df}(t)}}.`


In the :class:`TfidfTransformer` and :class:`TfidfVectorizer`
with ``smooth_idf=False``, the
"1" count is added to the idf instead of the idf's denominator:

:math:`\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1`
:math:`\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1`

This normalization is implemented by the :class:`TfidfTransformer`
class::
@@ -509,21 +510,21 @@ v{_2}^2 + \dots + v{_n}^2}}`
For example, we can compute the tf-idf of the first term in the first
document in the `counts` array as follows:

:math:`n_{d} = 6`
:math:`n = 6`

:math:`\text{df}(d, t)_{\text{term1}} = 6`
:math:`\text{df}(t)_{\text{term1}} = 6`

:math:`\text{idf}(d, t)_{\text{term1}} =
log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1`
:math:`\text{idf}(t)_{\text{term1}} =
\log \frac{n}{\text{df}(t)} + 1 = \log(1)+1 = 1`

:math:`\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3`

Now, if we repeat this computation for the remaining 2 terms in the document,
we get

:math:`\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0`
:math:`\text{tf-idf}_{\text{term2}} = 0 \times (\log(6/1)+1) = 0`

:math:`\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986`
:math:`\text{tf-idf}_{\text{term3}} = 1 \times (\log(6/2)+1) \approx 2.0986`

and the vector of raw tf-idfs:

@@ -540,12 +541,12 @@ Furthermore, the default parameter ``smooth_idf=True`` adds "1" to the numerator
and denominator as if an extra document was seen containing every term in the
collection exactly once, which prevents zero divisions:

:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`

Using this modification, the tf-idf of the third term in document 1 changes to
1.8473:

:math:`\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473`
:math:`\text{tf-idf}_{\text{term3}} = 1 \times \log(7/3)+1 \approx 1.8473`

And the L2-normalized tf-idf changes to

@@ -1146,17 +1146,18 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
informative than features that occur in a small fraction of the training
corpus.
The formula that is used to compute the tf-idf of term t is
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``),
where n is the total number of documents and df(d, t) is the
document frequency; the document frequency is the number of documents d
that contain term t. The effect of adding "1" to the idf in the equation
above is that terms with zero idf, i.e., terms that occur in all documents
in a training set, will not be entirely ignored.
(Note that the idf formula above differs from the standard
textbook notation that defines the idf as
idf(d, t) = log [ n / (df(d, t) + 1) ]).
The formula that is used to compute the tf-idf for a term t of a document d
in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is
computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where
n is the total number of documents in the document set and df(t) is the
document frequency of t; the document frequency is the number of documents
in the document set that contain the term t. The effect of adding "1" to
the idf in the equation above is that terms with zero idf, i.e., terms
that occur in all documents in a training set, will not be entirely
ignored.
(Note that the idf formula above differs from the standard textbook
notation that defines the idf as
idf(t) = log [ n / (df(t) + 1) ]).
If ``smooth_idf=True`` (the default), the constant "1" is added to the
numerator and denominator of the idf as if an extra document was seen

0 comments on commit fdf2f38

Please sign in to comment.
You can’t perform that action at this time.