In [1]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
# Sklearn provided many text processing tools out of which above is to calculate the term frequency

In [8]:
# assign documents
d0 = 'This is about text mining'
d1 = 'Text Mining is about text'
d2 = 'This is all about text'

# merge documents into a single corpus
string = [d0, d1, d2]

# The three document described by the d0, d1 and d3 are merged into single list where they can be used to find the scores

In [6]:
# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

In [9]:
# get idf values
print('idf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
	print(ele1, ':', ele2)

idf values:
about : 1.0
all : 1.6931471805599454
is : 1.0
mining : 1.2876820724517808
text : 1.0
this : 1.2876820724517808


The IDF values are computed for each unique word in the corpus. The IDF measures the importance of a word in the corpus by considering its presence across all documents. Words that appear frequently across all documents get lower IDF scores, as they are considered less important. Conversely, words that appear in only a few documents get higher IDF scores, indicating their importance in distinguishing those documents.

In [10]:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)


Word indexes:
{'this': 5, 'is': 2, 'about': 0, 'text': 4, 'mining': 3, 'all': 1}


The vocabulary_ attribute of the TfidfVectorizer provides the word-to-index mapping. It assigns an index to each unique word in the corpus. Here, the index corresponds to the position of the word in the IDF values and TF-IDF matrix.

In [11]:
# display tf-idf values
print('\ntf-idf value:')
print(result)


tf-idf value:
  (0, 3)	0.5123644459248041
  (0, 4)	0.39789669894933666
  (0, 0)	0.39789669894933666
  (0, 2)	0.39789669894933666
  (0, 5)	0.5123644459248041
  (1, 3)	0.46531539401123007
  (1, 4)	0.7227178260317894
  (1, 0)	0.3613589130158947
  (1, 2)	0.3613589130158947
  (2, 1)	0.6172273175654565
  (2, 4)	0.3645443967613799
  (2, 0)	0.3645443967613799
  (2, 2)	0.3645443967613799
  (2, 5)	0.4694172843223779


The TF-IDF values are represented as a sparse matrix, where each row corresponds to a document, and each column corresponds to a word index from the vocabulary. The values represent the TF-IDF score for each word in each document. Non-zero entries indicate that a particular word appears in the corresponding document, and the value represents its TF-IDF score.

In [12]:
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


tf-idf values in matrix form:
[[0.3978967  0.         0.3978967  0.51236445 0.3978967  0.51236445]
 [0.36135891 0.         0.36135891 0.46531539 0.72271783 0.        ]
 [0.3645444  0.61722732 0.3645444  0.         0.3645444  0.46941728]]


Here, the TF-IDF values are represented in dense matrix form (numpy array) for better readability. Each row corresponds to a document, and each column corresponds to a word index from the vocabulary. The values represent the TF-IDF score for each word in each document. Zero values indicate that a particular word does not appear in the corresponding document.