# CS2103 / Lab-05 - Assignment 06 - `01-09-2025`

**Topic**: TD-IDF Document Processing

**Instructions**: Complete all the tasks below and submit your code as a notebook file named `A06.ipynb`

---


## Objective:
To understand and apply TF-IDF (Term Frequency–Inverse Document Frequency) for extracting keywords and measuring similarity between documents.

Tasks
1.	Dataset :
* Use the 20 Newsgroups dataset (`sklearn.datasets.fetch_20newsgroups`)
OR
* Create a small dataset of 10–15 text documents (can be news articles, reviews, or research abstracts).

2.	Preprocessing:
* Convert text to lowercase
* Remove stopwords and punctuation
* (Optional) Apply stemming/lemmatization

3.	TF-IDF Implementation:
* Use TfidfVectorizer from sklearn to transform the documents into TF-IDF vectors.
* Print the top 10 keywords with the highest TF-IDF scores for any 2 selected documents.

4.	Document Similarity:
* Compute cosine similarity between 3 different pairs of documents.
* Identify which pair of documents is most similar and which is least similar.

5.	Mini-Analysis:
* In 5–6 sentences, explain:
* Why some words got higher TF-IDF scores.
* How similarity changes when documents are from the same topic vs. different topics.



In [1]:
# Assignment A06: TF-IDF Based Text Analysis
# Duration: 2 Hours

import string
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download necessary resources
nltk.download('stopwords')
nltk.download('wordnet')

# -------------------------
# Step 1: Create Dataset
# -------------------------
documents = [
    "The new iPhone was released today with advanced features and better camera quality.",
    "Samsung launched a new smartphone with high resolution display and long battery life.",
    "The cricket world cup is starting next month with teams preparing for the tournament.",
    "India won the cricket match after a thrilling performance by the batting lineup.",
    "Artificial Intelligence is transforming industries and shaping the future of technology.",
    "Machine Learning algorithms are widely used in data analysis and predictions.",
    "The movie received positive reviews for its storyline and strong acting performances.",
    "Critics appreciated the film for its visuals and emotional depth.",
    "Climate change is one of the biggest challenges facing the world today.",
    "Global warming and environmental issues require urgent attention from all countries.",
]

# -------------------------
# Step 2: Preprocessing
# -------------------------
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens)

processed_docs = [preprocess(doc) for doc in documents]

# -------------------------
# Step 3: TF-IDF Implementation
# -------------------------
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_docs)
feature_names = vectorizer.get_feature_names_out()

def print_top_keywords(doc_index, top_n=10):
    vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = vector.argsort()[-top_n:][::-1]
    top_keywords = [(feature_names[i], vector[i]) for i in top_indices]
    print(f"\nTop {top_n} keywords for Document {doc_index+1}:")
    for word, score in top_keywords:
        print(f"{word}: {score:.4f}")

# Print top 10 keywords for two selected documents
print_top_keywords(0, 10)  # iPhone release
print_top_keywords(4, 10)  # AI transforming industries

# -------------------------
# Step 4: Document Similarity
# -------------------------
similarity_matrix = cosine_similarity(tfidf_matrix)

pairs = [(0, 1), (2, 3), (4, 5)]  # selecting 3 pairs for comparison
print("\nDocument Similarities:")
for (i, j) in pairs:
    sim = similarity_matrix[i, j]
    print(f"Similarity between Document {i+1} and Document {j+1}: {sim:.4f}")

pair_sims = {pair: similarity_matrix[pair] for pair in pairs}
most_similar = max(pair_sims, key=pair_sims.get)
least_similar = min(pair_sims, key=pair_sims.get)

print(f"\nMost Similar Pair: Document {most_similar[0]+1} and Document {most_similar[1]+1}")
print(f"Least Similar Pair: Document {least_similar[0]+1} and Document {least_similar[1]+1}")

# -------------------------
# Step 5: Mini-Analysis
# -------------------------
print("\nMini-Analysis:")
print("""
1. Words that received higher TF-IDF scores were typically unique to a document and relevant to its topic, such as 'iphone', 'camera', and 'algorithm'. 
2. These words are less frequent across other documents, which boosts their importance due to the inverse document frequency component. 
3. General terms that appear in many documents like 'new', 'today', or 'released' had lower TF-IDF scores. 
4. The similarity scores clearly show that documents on similar topics (e.g., Documents 2 and 3 on cricket, or 5 and 6 on AI) have higher cosine similarity. 
5. In contrast, documents from different domains (like technology vs. climate) showed lower similarity. 
6. Overall, TF-IDF effectively identifies topic-relevant terms and quantifies document similarity based on semantic content.
""")




[nltk_data] Downloading package stopwords to
[nltk_data]     /home/student/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/student/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Top 10 keywords for Document 1:
iphone: 0.3441
released: 0.3441
quality: 0.3441
feature: 0.3441
advanced: 0.3441
camera: 0.3441
better: 0.3441
new: 0.2925
today: 0.2925
widely: 0.0000

Top 10 keywords for Document 5:
transforming: 0.3780
technology: 0.3780
shaping: 0.3780
industry: 0.3780
future: 0.3780
intelligence: 0.3780
artificial: 0.3780
warming: 0.0000
widely: 0.0000
used: 0.0000

Document Similarities:
Similarity between Document 1 and Document 2: 0.0798
Similarity between Document 3 and Document 4: 0.0979
Similarity between Document 5 and Document 6: 0.0000

Most Similar Pair: Document 3 and Document 4
Least Similar Pair: Document 5 and Document 6

Mini-Analysis:

1. Words that received higher TF-IDF scores were typically unique to a document and relevant to its topic, such as 'iphone', 'camera', and 'algorithm'. 
2. These words are less frequent across other documents, which boosts their importance due to the inverse document frequency component. 
3. General terms that appear

In [2]:
import string
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download necessary resources
nltk.download('stopwords')
nltk.download('wordnet')

# -------------------------
# Step 1: Load 20 Newsgroups Dataset (subset for speed)
# -------------------------
categories = ['comp.graphics', 'rec.sport.baseball', 'sci.med', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

# For demo, select first 10 documents
documents = newsgroups.data[:10]

# -------------------------
# Step 2: Preprocessing
# -------------------------
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens)

processed_docs = [preprocess(doc) for doc in documents]

# -------------------------
# Step 3: TF-IDF Implementation
# -------------------------
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_docs)
feature_names = vectorizer.get_feature_names_out()

def print_top_keywords(doc_index, top_n=10):
    vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = vector.argsort()[-top_n:][::-1]
    top_keywords = [(feature_names[i], vector[i]) for i in top_indices]
    print(f"\nTop {top_n} keywords for Document {doc_index+1}:")
    for word, score in top_keywords:
        print(f"{word}: {score:.4f}")

# Print top 10 keywords for 2 selected documents (for example, docs 0 and 3)
print_top_keywords(0, 10)
print_top_keywords(3, 10)

# -------------------------
# Step 4: Document Similarity
# -------------------------
similarity_matrix = cosine_similarity(tfidf_matrix)

# Select 3 pairs for similarity comparison
pairs = [(0, 1), (2, 3), (4, 5)]

print("\nDocument Similarities:")
for (i, j) in pairs:
    sim = similarity_matrix[i, j]
    print(f"Similarity between Document {i+1} and Document {j+1}: {sim:.4f}")

pair_sims = {pair: similarity_matrix[pair] for pair in pairs}
most_similar = max(pair_sims, key=pair_sims.get)
least_similar = min(pair_sims, key=pair_sims.get)

print(f"\nMost Similar Pair: Document {most_similar[0]+1} and Document {most_similar[1]+1}")
print(f"Least Similar Pair: Document {least_similar[0]+1} and Document {least_similar[1]+1}")

# -------------------------
# Step 5: Mini-Analysis
# -------------------------
print("\nMini-Analysis:")
print("""
1. Words that received higher TF-IDF scores were typically unique to a document and relevant to its topic, reflecting key terms that distinguish it from others.
2. These terms are less common across the dataset, increasing their inverse document frequency and thus their weight.
3. Common words that appear across many documents (like 'get', 'use', or 'think') received lower TF-IDF scores due to their ubiquity.
4. Documents from similar newsgroup categories tend to have higher cosine similarity, reflecting overlapping vocabulary and topics.
5. Conversely, documents from different categories show lower similarity, indicating distinct thematic content.
6. Overall, TF-IDF effectively highlights important, discriminative words and enables meaningful comparison between documents.
""")


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/student/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/student/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Top 10 keywords for Document 1:
speculum: 0.4399
gynecologist: 0.2199
familiar: 0.2199
little: 0.2199
otoscope: 0.2199
female: 0.2199
fit: 0.2199
cone: 0.2199
vaginal: 0.2199
bank: 0.1870

Top 10 keywords for Document 4:
corel: 0.5502
bitmap: 0.2358
via: 0.2358
solution: 0.2358
scodal: 0.1572
traced: 0.1572
trace: 0.1572
using: 0.1572
hand: 0.1572
utility: 0.1572

Document Similarities:
Similarity between Document 1 and Document 2: 0.0000
Similarity between Document 3 and Document 4: 0.0337
Similarity between Document 5 and Document 6: 0.0000

Most Similar Pair: Document 3 and Document 4
Least Similar Pair: Document 1 and Document 2

Mini-Analysis:

1. Words that received higher TF-IDF scores were typically unique to a document and relevant to its topic, reflecting key terms that distinguish it from others.
2. These terms are less common across the dataset, increasing their inverse document frequency and thus their weight.
3. Common words that appear across many documents (like 'get',