## Demystifying TF-IDF Vectorization Concept
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing (NLP) and information retrieval to determine how important a word is in a document relative to a collection (or corpus) of documents. It helps identify key terms that best represent the content of a document.

- TF (Term Frequency) measures how often a word appears in a document.    
- IDF (Inverse Document Frequency) reduces the weight of common words and highlights unique ones.
- TF-IDF Score = TF × IDF → Higher scores mean the word is important in that document but rare across others.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
corpus = ['Term Frequency measures how often a word appears in a document',
         'Inverse Document Frequency reduces the weight of common words and highlights unique ones',
         'Higher scores mean the word is important in that document but rare across others']

In [3]:
tfidf = TfidfVectorizer()

In [4]:
# Pass the corpus and transform the words into vectors
# The response is in Compressed Sparse Row (CSR)
csr_transformed_vector = tfidf.fit_transform(corpus)

In [5]:
# Each unique word is assigned a number from 0 to n ( n = Number of unique words in the corpus). 
# The order of numbering is based on alphabetical order of the words
print(tfidf.vocabulary_)

{'term': 23, 'frequency': 6, 'measures': 15, 'how': 9, 'often': 17, 'word': 28, 'appears': 2, 'in': 11, 'document': 5, 'inverse': 12, 'reduces': 21, 'the': 25, 'weight': 27, 'of': 16, 'common': 4, 'words': 29, 'and': 1, 'highlights': 8, 'unique': 26, 'ones': 18, 'higher': 7, 'scores': 22, 'mean': 14, 'is': 13, 'important': 10, 'that': 24, 'but': 3, 'rare': 20, 'across': 0, 'others': 19}


In [6]:
# Coords => The row indicate sentence position in the corpus, column indicates the unique word (vocabulary)
# Values => The weight assigned to the word based on Tfidf statistical calculation
print(csr_transformed_vector[:1])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9 stored elements and shape (1, 30)>
  Coords	Values
  (0, 23)	0.3757162113174268
  (0, 6)	0.28574186296253085
  (0, 15)	0.3757162113174268
  (0, 9)	0.3757162113174268
  (0, 17)	0.3757162113174268
  (0, 28)	0.28574186296253085
  (0, 2)	0.3757162113174268
  (0, 11)	0.28574186296253085
  (0, 5)	0.221904046872743


### Print the CSR Transformed Vector 
Below steps are just for displaying the transformed vector for better understanding the relationship between the words and vectors in a given corpus

In [7]:
# Reversed the key: value of the vocabulary for displaying the word instead of number for better understanding
reversed_vocabulary_dict = {v: k for k, v in tfidf.vocabulary_.items()}
print(reversed_vocabulary_dict)

{23: 'term', 6: 'frequency', 15: 'measures', 9: 'how', 17: 'often', 28: 'word', 2: 'appears', 11: 'in', 5: 'document', 12: 'inverse', 21: 'reduces', 25: 'the', 27: 'weight', 16: 'of', 4: 'common', 29: 'words', 1: 'and', 8: 'highlights', 26: 'unique', 18: 'ones', 7: 'higher', 22: 'scores', 14: 'mean', 13: 'is', 10: 'important', 24: 'that', 3: 'but', 20: 'rare', 0: 'across', 19: 'others'}


In [8]:
# Convert CSR to COO format for easy access      
coo = csr_transformed_vector.tocoo()
# Print nonzero elements in (row, col) format
for r, c, v in zip(coo.row, coo.col, coo.data):
    print(f"({r}, {reversed_vocabulary_dict[c]}) -> {v}")

(0, term) -> 0.3757162113174268
(0, frequency) -> 0.28574186296253085
(0, measures) -> 0.3757162113174268
(0, how) -> 0.3757162113174268
(0, often) -> 0.3757162113174268
(0, word) -> 0.28574186296253085
(0, appears) -> 0.3757162113174268
(0, in) -> 0.28574186296253085
(0, document) -> 0.221904046872743
(1, frequency) -> 0.22421197567463094
(1, document) -> 0.1741206004737842
(1, inverse) -> 0.2948118037695924
(1, reduces) -> 0.2948118037695924
(1, the) -> 0.22421197567463094
(1, weight) -> 0.2948118037695924
(1, of) -> 0.2948118037695924
(1, common) -> 0.2948118037695924
(1, words) -> 0.2948118037695924
(1, and) -> 0.2948118037695924
(1, highlights) -> 0.2948118037695924
(1, unique) -> 0.2948118037695924
(1, ones) -> 0.2948118037695924
(2, word) -> 0.2187802511223572
(2, in) -> 0.2187802511223572
(2, document) -> 0.16990238180903908
(2, the) -> 0.2187802511223572
(2, higher) -> 0.28766973873039386
(2, scores) -> 0.28766973873039386
(2, mean) -> 0.28766973873039386
(2, is) -> 0.28766973