<b>TfidfTransformer</b> :

* Transform a count matrix to a normalized tf or tf-idf representation
* Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

* A <b>Term Frequency</b> is a count of how many times a word occurs in a given document (synonymous with bag of words).
* The <b>Inverse Document Frequency</b> is the the number of times a word occurs in a corpus of documents.

The first step is to create our training and testing document set and computing the term frequency matrix

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

### Creating a count vector



In [3]:
from sklearn.feature_extraction.text import CountVectorizer

train_text = ["A bird in hand is worth two in the bush.",
              "Good things come to those who wait.",
              "These watches cost $1500! ",
              "There are other fish in the sea.",
              "The ball is in your court.",
              "Mr. Smith Goes to Washington ",
              "Doogie Howser M.D."]

In [12]:
count_vectorizer = CountVectorizer()

frequency_term_matrix = count_vectorizer.fit_transform(train_text)

In [13]:
len(count_vectorizer.vocabulary_)

33

In [14]:
frequency_term_matrix.shape

(7, 33)

In [15]:
frequency_term_matrix.toarray()

array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

### Building the tf-idf matrix



In [41]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

In [42]:
tfidf_vector1 = tfidf_transformer.fit_transform(frequency_term_matrix)

tfidf_vector1.shape

(7, 33)

In [43]:
tfidf_vector1.toarray()

array([[0.        , 0.        , 0.        , 0.34908308, 0.34908308,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.34908308, 0.        , 0.49536976,
        0.28976893, 0.        , 0.        , 0.        , 0.        ,
        0.24768488, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.34908308, 0.        , 0.        , 0.        ,
        0.        , 0.34908308, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.38665001, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.38665001, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.38665001, 0.38665001,
        0.32095271, 0.        , 0.38665001, 0.        , 0.        ,
        0.38665001, 0.        , 0.        ],
       [0.5       , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5   

In [44]:
print(tfidf_vector1.toarray())

[[0.         0.         0.         0.34908308 0.34908308 0.
  0.         0.         0.         0.         0.         0.
  0.34908308 0.         0.49536976 0.28976893 0.         0.
  0.         0.         0.24768488 0.         0.         0.
  0.         0.         0.34908308 0.         0.         0.
  0.         0.34908308 0.        ]
 [0.         0.         0.         0.         0.         0.38665001
  0.         0.         0.         0.         0.         0.38665001
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.38665001
  0.38665001 0.32095271 0.         0.38665001 0.         0.
  0.38665001 0.         0.        ]
 [0.5        0.         0.         0.         0.         0.
  0.5        0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.5        0.
  0.         0.         0.         0.         0.         0.5
  0

## TfidfVectorizer = CountVectorizer + TfidfTransformer

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

In [45]:
tfidf_vector2 = tfidf_vectorizer.fit_transform(train_text)

tfidf_vectorizer.vocabulary_

{'bird': 3,
 'in': 14,
 'hand': 12,
 'is': 15,
 'worth': 31,
 'two': 26,
 'the': 20,
 'bush': 4,
 'good': 11,
 'things': 23,
 'come': 5,
 'to': 25,
 'those': 24,
 'who': 30,
 'wait': 27,
 'these': 22,
 'watches': 29,
 'cost': 6,
 '1500': 0,
 'there': 21,
 'are': 1,
 'other': 17,
 'fish': 9,
 'sea': 18,
 'ball': 2,
 'your': 32,
 'court': 7,
 'mr': 16,
 'smith': 19,
 'goes': 10,
 'washington': 28,
 'doogie': 8,
 'howser': 13}

In [46]:
tfidf_vector2.shape

(7, 33)

In [47]:
tfidf_vectorizer.idf_

array([2.38629436, 2.38629436, 2.38629436, 2.38629436, 2.38629436,
       2.38629436, 2.38629436, 2.38629436, 2.38629436, 2.38629436,
       2.38629436, 2.38629436, 2.38629436, 2.38629436, 1.69314718,
       1.98082925, 2.38629436, 2.38629436, 2.38629436, 2.38629436,
       1.69314718, 2.38629436, 2.38629436, 2.38629436, 2.38629436,
       1.98082925, 2.38629436, 2.38629436, 2.38629436, 2.38629436,
       2.38629436, 2.38629436, 2.38629436])

In [48]:
dict(zip(tfidf_vectorizer.get_feature_names(), tfidf_vectorizer.idf_))

{'1500': 2.386294361119891,
 'are': 2.386294361119891,
 'ball': 2.386294361119891,
 'bird': 2.386294361119891,
 'bush': 2.386294361119891,
 'come': 2.386294361119891,
 'cost': 2.386294361119891,
 'court': 2.386294361119891,
 'doogie': 2.386294361119891,
 'fish': 2.386294361119891,
 'goes': 2.386294361119891,
 'good': 2.386294361119891,
 'hand': 2.386294361119891,
 'howser': 2.386294361119891,
 'in': 1.6931471805599454,
 'is': 1.9808292530117262,
 'mr': 2.386294361119891,
 'other': 2.386294361119891,
 'sea': 2.386294361119891,
 'smith': 2.386294361119891,
 'the': 1.6931471805599454,
 'there': 2.386294361119891,
 'these': 2.386294361119891,
 'things': 2.386294361119891,
 'those': 2.386294361119891,
 'to': 1.9808292530117262,
 'two': 2.386294361119891,
 'wait': 2.386294361119891,
 'washington': 2.386294361119891,
 'watches': 2.386294361119891,
 'who': 2.386294361119891,
 'worth': 2.386294361119891,
 'your': 2.386294361119891}

### Final scorings of each word from the other words in the vocabulary.
* The scores are normalized to values between 0 and 1

In [49]:
tfidf_vector2.toarray()

array([[0.        , 0.        , 0.        , 0.34908308, 0.34908308,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.34908308, 0.        , 0.49536976,
        0.28976893, 0.        , 0.        , 0.        , 0.        ,
        0.24768488, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.34908308, 0.        , 0.        , 0.        ,
        0.        , 0.34908308, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.38665001, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.38665001, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.38665001, 0.38665001,
        0.32095271, 0.        , 0.38665001, 0.        , 0.        ,
        0.38665001, 0.        , 0.        ],
       [0.5       , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5   

In [50]:
print(tfidf_vector2)

  (0, 3)	0.3490830767264469
  (0, 14)	0.49536975552604884
  (0, 12)	0.3490830767264469
  (0, 15)	0.2897689326921819
  (0, 31)	0.3490830767264469
  (0, 26)	0.3490830767264469
  (0, 20)	0.24768487776302442
  (0, 4)	0.3490830767264469
  (1, 11)	0.386650005027498
  (1, 23)	0.386650005027498
  (1, 5)	0.386650005027498
  (1, 25)	0.32095270940344806
  (1, 24)	0.386650005027498
  (1, 30)	0.386650005027498
  (1, 27)	0.386650005027498
  (2, 22)	0.5
  (2, 29)	0.5
  (2, 6)	0.5
  (2, 0)	0.5
  (3, 14)	0.2894987873064995
  (3, 20)	0.2894987873064995
  (3, 21)	0.40801492725049476
  (3, 1)	0.40801492725049476
  (3, 17)	0.40801492725049476
  (3, 9)	0.40801492725049476
  (3, 18)	0.40801492725049476
  (4, 14)	0.3274243027464032
  (4, 15)	0.38305685676572565
  (4, 20)	0.3274243027464032
  (4, 2)	0.4614665377636916
  (4, 32)	0.4614665377636916
  (4, 7)	0.4614665377636916
  (5, 25)	0.38333717539523177
  (5, 16)	0.4618042361109319
  (5, 19)	0.4618042361109319
  (5, 10)	0.4618042361109319
  (5, 28)	0.461804236

In [51]:
print(tfidf_vector1)

  (0, 31)	0.34908307672644684
  (0, 26)	0.34908307672644684
  (0, 20)	0.2476848777630244
  (0, 15)	0.2897689326921819
  (0, 14)	0.4953697555260488
  (0, 12)	0.34908307672644684
  (0, 4)	0.34908307672644684
  (0, 3)	0.34908307672644684
  (1, 30)	0.386650005027498
  (1, 27)	0.386650005027498
  (1, 25)	0.32095270940344806
  (1, 24)	0.386650005027498
  (1, 23)	0.386650005027498
  (1, 11)	0.386650005027498
  (1, 5)	0.386650005027498
  (2, 29)	0.5
  (2, 22)	0.5
  (2, 6)	0.5
  (2, 0)	0.5
  (3, 21)	0.40801492725049476
  (3, 20)	0.2894987873064995
  (3, 18)	0.40801492725049476
  (3, 17)	0.40801492725049476
  (3, 14)	0.2894987873064995
  (3, 9)	0.40801492725049476
  (3, 1)	0.40801492725049476
  (4, 32)	0.4614665377636916
  (4, 20)	0.3274243027464032
  (4, 15)	0.38305685676572565
  (4, 14)	0.3274243027464032
  (4, 7)	0.4614665377636916
  (4, 2)	0.4614665377636916
  (5, 28)	0.4618042361109319
  (5, 25)	0.38333717539523177
  (5, 19)	0.4618042361109319
  (5, 16)	0.4618042361109319
  (5, 10)	0.461804