**What does tf-idf mean?**

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

Corpus

In [0]:
corpus = [
     'this is sam',
     'sam is bad in boys',
     'girls love bad boys',
     'sam and girls',
]

Implementation

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [0]:
# sklearn feature names, they are sorted in alphabetic order by default.

print(vectorizer.get_feature_names())

['and', 'bad', 'boys', 'girls', 'in', 'is', 'love', 'sam', 'this']


In [0]:
# Here we will print the sklearn tfidf vectorizer idf values after applying the fit method
# After using the fit function on the corpus the vocab has 9 words in it, and each has its idf value.

print(vectorizer.idf_)

[1.91629073 1.51082562 1.51082562 1.51082562 1.91629073 1.51082562
 1.91629073 1.22314355 1.91629073]


In [0]:
# shape of sklearn tfidf vectorizer output after applying transform method.

skl_output.shape

(4, 9)

In [0]:
# sklearn tfidf values for first line of the above corpus.
# Here the output is a sparse matrix

print(skl_output[0])

  (0, 8)	0.7020348194149619
  (0, 7)	0.4480997313625986
  (0, 5)	0.5534923152870045


In [0]:
# sklearn tfidf values for first line of the above corpus.
# To understand the output better, here we are converting the sparse output matrix to dense matrix and printing it.
# Notice that this output is normalized using L2 normalization. sklearn does this by default.

print(skl_output[0].toarray())

[[0.         0.         0.         0.         0.         0.55349232
  0.         0.44809973 0.70203482]]


In [0]:
print(skl_output[1])

  (0, 7)	0.3528554929793508
  (0, 5)	0.43584673254990375
  (0, 4)	0.5528163151092931
  (0, 2)	0.43584673254990375
  (0, 1)	0.43584673254990375
