# Python Implementation of TF-IDF (Term Frequency ‚Äì Inverse Document Frequency) 

### Step 1: Import necessary libraries

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

### Step 2: Define a small corpus (3 documents)

In [2]:
docs = [
    "Cancer is a deadly disease",
    "Cancer treatment is expensive",
    "Regular exercise prevents disease"
]



### Step 3: Initialize TF-IDF Vectorizer

In [9]:
vect = TfidfVectorizer()

### Step 4: Fit and transform the text

In [10]:
X = vect.fit_transform(docs)

In [12]:
X

<3x9 sparse matrix of type '<class 'numpy.float64'>'
	with 12 stored elements in Compressed Sparse Row format>

### Step 5: Convert to DataFrame for better readability

In [13]:
df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names_out())

### Step 6: Display TF-IDF values

In [14]:
print(df.round(2))

   cancer  deadly  disease  exercise  expensive    is  prevents  regular  \
0    0.46     0.6     0.46      0.00       0.00  0.46      0.00     0.00   
1    0.43     0.0     0.00      0.00       0.56  0.43      0.00     0.00   
2    0.00     0.0     0.40      0.53       0.00  0.00      0.53     0.53   

   treatment  
0       0.00  
1       0.56  
2       0.00  


In [15]:
docs

['Cancer is a deadly disease',
 'Cancer treatment is expensive',
 'Regular exercise prevents disease']

![Screenshot%202025-07-22%20102958.png](attachment:Screenshot%202025-07-22%20102958.png)

![Screenshot%202025-07-22%20103110.png](attachment:Screenshot%202025-07-22%20103110.png)

### üîç What You Can Observe 

Words like "cancer" appear in multiple documents

üî∏ Even though "cancer" is an important word, its IDF is lower because it's not rare across the corpus.

üî∏ This results in a moderate TF-IDF score.

Words like "deadly", "expensive", and "prevents" appear only once

üî∏ These are unique to individual documents ‚Äî making them highly specific.

üî∏ Their IDF is high, so even with a low TF, the TF-IDF score is high.