**TF-IDF for a word in a document is calculated by multiplying two different metrics:**

    Step 1. The term frequency of a word in a document d.
    TF is individual to each document and word, hence we can formulate TF as follows.

    tf(t,d) = count of t in d / number of words in d

    Step 2. 
    The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. 
    The closer it is to 0, the more common a word is

    This metric can be calculated by taking the total number of documents, dividing it by the number of documents 
    that contain a word, and calculating the logarithm.

**CountVectorizer**

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
docs = ["the house had a tiny little mouse", 
        "the cat saw the mouse", 
        "the mouse ran away from the house", 
        "the cat finally ate the mouse", 
        "the end of the mouse story"
        ]

In [3]:
cv = CountVectorizer(stop_words='english')

In [4]:
cv

In [5]:
cv.fit_transform(docs)

<5x12 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [6]:
sparse_matrix = cv.fit_transform(docs)

In [7]:
sparse_matrix.todense()

matrix([[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0],
        [1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]])

In [8]:
import pandas as pd

In [9]:
# cv.vocabulary_.items()
cv.get_feature_names_out()

array(['ate', 'away', 'cat', 'end', 'finally', 'house', 'little', 'mouse',
       'ran', 'saw', 'story', 'tiny'], dtype=object)

In [10]:
pd.DataFrame(sparse_matrix.todense(), columns= cv.get_feature_names_out())

Unnamed: 0,ate,away,cat,end,finally,house,little,mouse,ran,saw,story,tiny
0,0,0,0,0,0,1,1,1,0,0,0,1
1,0,0,1,0,0,0,0,1,0,1,0,0
2,0,1,0,0,0,1,0,1,1,0,0,0
3,1,0,1,0,1,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,1,0,0,1,0


**TfidfTransformer**

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

In [12]:
tfidf_transformer = TfidfTransformer()

In [13]:
tfidf_transformer

In [14]:
tfidf_transformer.fit_transform(sparse_matrix)

<5x12 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [15]:
tfidf_transformer.idf_

array([2.09861229, 2.09861229, 1.69314718, 2.09861229, 2.09861229,
       1.69314718, 2.09861229, 1.        , 2.09861229, 2.09861229,
       2.09861229, 2.09861229])

In [16]:
pd.DataFrame(tfidf_transformer.idf_, index = cv.get_feature_names_out(), columns=['idf_weights'])\
.sort_values(by='idf_weights', ascending=False)

Unnamed: 0,idf_weights
ate,2.098612
away,2.098612
end,2.098612
finally,2.098612
little,2.098612
ran,2.098612
saw,2.098612
story,2.098612
tiny,2.098612
cat,1.693147


**Notice that the word ‘mouse’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.**

#### TFIDF score for your documents
In practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do cv.transform(your_new_docs) to generate the matrix.

In [17]:
sparse_matrix = cv.transform(docs)

Then, by invoking tfidf_transformer.transform(sparse_matrix) you will finally be computing the tf-idf scores for your docs.

Internally this is computing the tf * idf where your term frequency is weighted by its IDF values.

In [18]:
tfidf_transformer.transform(sparse_matrix)

<5x12 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [19]:
tfidf_transformer.transform(sparse_matrix).T.todense()

matrix([[0.        , 0.        , 0.        , 0.58946308, 0.        ],
        [0.        , 0.        , 0.58946308, 0.        , 0.        ],
        [0.        , 0.58873218, 0.        , 0.4755751 , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.67009179],
        [0.        , 0.        , 0.        , 0.58946308, 0.        ],
        [0.4755751 , 0.        , 0.4755751 , 0.        , 0.        ],
        [0.58946308, 0.        , 0.        , 0.        , 0.        ],
        [0.28088232, 0.34771471, 0.28088232, 0.28088232, 0.31930233],
        [0.        , 0.        , 0.58946308, 0.        , 0.        ],
        [0.        , 0.72971837, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.67009179],
        [0.58946308, 0.        , 0.        , 0.        , 0.        ]])

In [20]:
tfidf_transformer.transform(sparse_matrix)[0].T.todense()

matrix([[0.        ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.4755751 ],
        [0.58946308],
        [0.28088232],
        [0.        ],
        [0.        ],
        [0.        ],
        [0.58946308]])

 **First document is “the house had a tiny little mouse”  all the words in this  document have a tf-idf score and everything else show up as zeroes**

In [21]:
pd.DataFrame(tfidf_transformer.transform(sparse_matrix)[0].T.todense(), index=cv.get_feature_names_out(), columns=['tfidf'])\
.sort_values(by='tfidf', ascending=False)

Unnamed: 0,tfidf
little,0.589463
tiny,0.589463
house,0.475575
mouse,0.280882
ate,0.0
away,0.0
cat,0.0
end,0.0
finally,0.0
ran,0.0


In [22]:
pd.DataFrame(tfidf_transformer.transform(sparse_matrix).T.todense(), index=cv.get_feature_names_out(), 
             columns = ['d1', 'd2', 'd3', 'd4', 'd5'])

Unnamed: 0,d1,d2,d3,d4,d5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finally,0.0,0.0,0.0,0.589463,0.0
house,0.475575,0.0,0.475575,0.0,0.0
little,0.589463,0.0,0.0,0.0,0.0
mouse,0.280882,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0


####Tfidf Vectorizer

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [25]:
tfidf_vectorizer

In [26]:
tfidf_vectorizer.fit_transform(docs)

<5x12 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [27]:
# tfidf_vectorizer.fit_transform(docs).T.A

In [28]:
tfidf_vectorizer.idf_

array([2.09861229, 2.09861229, 1.69314718, 2.09861229, 2.09861229,
       1.69314718, 2.09861229, 1.        , 2.09861229, 2.09861229,
       2.09861229, 2.09861229])

In [29]:
tfidf_vectorizer.fit_transform(docs).T.todense()

matrix([[0.        , 0.        , 0.        , 0.58946308, 0.        ],
        [0.        , 0.        , 0.58946308, 0.        , 0.        ],
        [0.        , 0.58873218, 0.        , 0.4755751 , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.67009179],
        [0.        , 0.        , 0.        , 0.58946308, 0.        ],
        [0.4755751 , 0.        , 0.4755751 , 0.        , 0.        ],
        [0.58946308, 0.        , 0.        , 0.        , 0.        ],
        [0.28088232, 0.34771471, 0.28088232, 0.28088232, 0.31930233],
        [0.        , 0.        , 0.58946308, 0.        , 0.        ],
        [0.        , 0.72971837, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.67009179],
        [0.58946308, 0.        , 0.        , 0.        , 0.        ]])

In [30]:
tfidf_vectorizer.get_feature_names_out()

array(['ate', 'away', 'cat', 'end', 'finally', 'house', 'little', 'mouse',
       'ran', 'saw', 'story', 'tiny'], dtype=object)

In [31]:
# pd.DataFrame(tfidf_vectorizer.fit_transform(docs).todense(), columns = tfidf_vectorizer.get_feature_names_out(),
#              index=['d1', 'd2', 'd3', 'd4', 'd5'])

In [32]:
pd.DataFrame(tfidf_vectorizer.fit_transform(docs).T.todense(), index = tfidf_vectorizer.get_feature_names_out(),
             columns=['d1', 'd2', 'd3', 'd4', 'd5'])

Unnamed: 0,d1,d2,d3,d4,d5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finally,0.0,0.0,0.0,0.589463,0.0
house,0.475575,0.0,0.475575,0.0,0.0
little,0.589463,0.0,0.0,0.0,0.0
mouse,0.280882,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0


**With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.**

**With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.**


In [33]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [34]:
train = ['The sky is blue.','The sun is bright.']
test = ['The sun in the sky is bright', 'We can see the shining sun, the bright sun.']

In [35]:
cv = CountVectorizer(stop_words='english')

In [36]:
cv.fit_transform(train)

<2x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [37]:
cv.fit_transform(train).todense()

matrix([[1, 0, 1, 0],
        [0, 1, 0, 1]])

In [38]:
cv.transform(test).todense()

matrix([[0, 1, 1, 1],
        [0, 1, 0, 2]])

In [39]:
tfidf_transformer = TfidfTransformer()

In [40]:
tfidf_transformer

In [41]:
tfidf_transformer.fit_transform(cv.transform(test)).todense()

matrix([[0.        , 0.50154891, 0.70490949, 0.50154891],
        [0.        , 0.4472136 , 0.        , 0.89442719]])

**Let's understand the calclation of idf**

In [42]:
import numpy as np

In [43]:
np.log(3)    # Natural Log

1.0986122886681098

d3 - 'The sun in the sky is bright',  
d4 - 'We can see the shining sun, the bright sun.'


|| blue | bright | sky | sun |
|-|------|--------|-----|-----|
||t1|t2|t3|t4|
d3    |0|1|1|1|  
d4    |0|1|0|2|
df(t) |0|2|1|2|


**Inverse-Document-Frequency:      idf(t) = ln((1 + n) / (1 + df(t)) + 1**

n = 2 here i.e. (number of documents)  
df(t) - number of documents that contains the word


    idf(t1) = ln((1 + 2) / (1 + 0)) + 1   = ln(3/1) + 1 => (1.098612288 + 1)

    idf(t2) = ln((1 + 2) / (1 + 2)) + 1   = ln(3/3) + 1 =>  0  + 1 = 1

    idf(t3) = ln((1 + 2)) / (1 + 1)) + 1 = ln(3/2) + 1 => 0.405465108 + 1  =  1.405465108

    idf(t4) = ln((1 + 2)) / (1 + 2)) + 1 = ln(3/3) + 1 => 0 + 1 = 1







In [44]:
tfidf_transformer.idf_

array([2.09861229, 1.        , 1.40546511, 1.        ])

In [45]:
tfidf_transformer.idf_.shape

(4,)

In [46]:
cv.transform(test).toarray()           # 2 * 4

array([[0, 1, 1, 1],
       [0, 1, 0, 2]])

In [47]:
cv.transform(test).toarray() * tfidf_transformer.idf_     # array operations execute element by element operations

array([[0.        , 1.        , 1.40546511, 1.        ],
       [0.        , 1.        , 0.        , 2.        ]])

**L2 - Normalization**

In [48]:
tfidf_transformer.transform(cv.transform(test)).toarray()    # After L2 - Normalization        

array([[0.        , 0.50154891, 0.70490949, 0.50154891],
       [0.        , 0.4472136 , 0.        , 0.89442719]])

**Let's figure out how we are getting the above matrix**

In [49]:
from numpy.linalg import norm

In [50]:
tf_idf_matrix = cv.transform(test).toarray() * tfidf_transformer.idf_ 

In [51]:
tf_idf_matrix                   # Term frequency * Idf

array([[0.        , 1.        , 1.40546511, 1.        ],
       [0.        , 1.        , 0.        , 2.        ]])

In [52]:
norm(tf_idf_matrix, axis=1)    # Normalise across columns

array([1.99382351, 2.23606798])

In [53]:
tf_idf_matrix.T        

array([[0.        , 0.        ],
       [1.        , 1.        ],
       [1.40546511, 0.        ],
       [1.        , 2.        ]])

In [54]:
(1 ** 2 + 1.40546511 ** 2 + 1 ** 2) ** 0.5,  (1 ** 2 + 2 ** 2) ** 0.5   

(1.9938235065891143, 2.23606797749979)

In [55]:
tf_idf_matrix.T / norm(tf_idf_matrix, axis=1)

array([[0.        , 0.        ],
       [0.50154891, 0.4472136 ],
       [0.70490949, 0.        ],
       [0.50154891, 0.89442719]])

In [56]:
pd.DataFrame(tfidf_transformer.transform(cv.transform(test)).todense(), columns=cv.get_feature_names_out(), index = ['d3', 'd4'])

Unnamed: 0,blue,bright,sky,sun
d3,0.0,0.501549,0.704909,0.501549
d4,0.0,0.447214,0.0,0.894427
