<a href="https://colab.research.google.com/github/sanjeevtrivedi/pgd-dsai/blob/main/NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to today’s session on text processing using NLTK and Gensim. We’ll cover core preprocessing steps, feature extraction, and building Word2Vec embeddings.

In [None]:
# Install libraries (run once per session)
!pip install --quiet nltk gensim
!pip install --quiet numpy scipy scikit-learn
# Download NLTK data
import nltk
nltk.download('punkt_tab')          # Pre-trained tokenizers for splitting text into words and sentences.
nltk.download('stopwords')     #  collection of common stopwords in many languages (like “the”, “and”, “is”)
nltk.download('wordnet')       # Used for lemmatization — converting words to their base form (e.g., "running" → "run").
nltk.download('averaged_perceptron_tagger')  # Tags words as nouns, verbs, adjectives, etc., which helps in better lemmatization and grammar analysis.
nltk.download('tagsets')    # Detailed descriptions of POS tag symbols (e.g., “NN” = Noun, “VB” = Verb).
# Download the specific resource if needed
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

# 2. NLTK Basics: Tokenization & Text Cleanup
Tokenization splits text into meaningful units. We’ll then remove stopwords, apply stemming/lemmatization, and tag parts‑of‑speech.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

text = """
Natural Language Processing (NLP) enables computers to understand human language.
In this tutorial, we'll see how to preprocess text with NLTK.
"""

# Sentence & word tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)
print("Sentences:\n", sentences)
print("Words:\n", words)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w.lower() not in stop_words and w.isalnum()]
print("Filtered:\n", filtered)

# Stemming vs Lemmatization
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed = [ps.stem(w) for w in filtered]
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
print("Stemmed:\n", stemmed)
print("Lemmatized:\n", lemmatized)

# POS tagging
pos_tags = pos_tag(filtered)      # Tags each word with its grammatical role (noun, verb, adjective, etc.)
print("POS Tags:\n", pos_tags)

Sentences:
 ['\nNatural Language Processing (NLP) enables computers to understand human language.', "In this tutorial, we'll see how to preprocess text with NLTK."]
Words:
 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.', 'In', 'this', 'tutorial', ',', 'we', "'ll", 'see', 'how', 'to', 'preprocess', 'text', 'with', 'NLTK', '.']
Filtered:
 ['Natural', 'Language', 'Processing', 'NLP', 'enables', 'computers', 'understand', 'human', 'language', 'tutorial', 'see', 'preprocess', 'text', 'NLTK']
Stemmed:
 ['natur', 'languag', 'process', 'nlp', 'enabl', 'comput', 'understand', 'human', 'languag', 'tutori', 'see', 'preprocess', 'text', 'nltk']
Lemmatized:
 ['Natural', 'Language', 'Processing', 'NLP', 'enables', 'computer', 'understand', 'human', 'language', 'tutorial', 'see', 'preprocess', 'text', 'NLTK']
POS Tags:
 [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('NLP', 'NNP'), ('enables', 'VBZ'), ('comp

In [None]:
import nltk
from nltk.data import load

# Load the tagset
tagdict = load('help/tagsets/upenn_tagset.pickle')

# Print the list
for tag in sorted(tagdict):
    print(f"{tag}: {tagdict[tag][0]}")

$: dollar
'': closing quotation mark
(: opening parenthesis
): closing parenthesis
,: comma
--: dash
.: sentence terminator
:: colon or ellipsis
CC: conjunction, coordinating
CD: numeral, cardinal
DT: determiner
EX: existential there
FW: foreign word
IN: preposition or conjunction, subordinating
JJ: adjective or numeral, ordinal
JJR: adjective, comparative
JJS: adjective, superlative
LS: list item marker
MD: modal auxiliary
NN: noun, common, singular or mass
NNP: noun, proper, singular
NNPS: noun, proper, plural
NNS: noun, common, plural
PDT: pre-determiner
POS: genitive marker
PRP: pronoun, personal
PRP$: pronoun, possessive
RB: adverb
RBR: adverb, comparative
RBS: adverb, superlative
RP: particle
SYM: symbol
TO: "to" as preposition or infinitive marker
UH: interjection
VB: verb, base form
VBD: verb, past tense
VBG: verb, present participle or gerund
VBN: verb, past participle
VBP: verb, present tense, not 3rd person singular
VBZ: verb, present tense, 3rd person singular
WDT: WH-deter

In [None]:
for tup in pos_tags:
  print(tup[0],"----->",tagdict[tup[1]][0])

Natural -----> adjective or numeral, ordinal
Language -----> noun, proper, singular
Processing -----> noun, proper, singular
NLP -----> noun, proper, singular
enables -----> verb, present tense, 3rd person singular
computers -----> noun, common, plural
understand -----> verb, present tense, not 3rd person singular
human -----> adjective or numeral, ordinal
language -----> noun, common, singular or mass
tutorial -----> adjective or numeral, ordinal
see -----> noun, common, singular or mass
preprocess -----> adjective or numeral, ordinal
text -----> noun, common, singular or mass
NLTK -----> noun, proper, singular


In [None]:
example="The quick brown fox jumps over the lazy dog."
# Sentence & word tokenization
sentences = sent_tokenize(example)
words = word_tokenize(example)
print("Sentences:\n", sentences)
print("Words:\n", words)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w.lower() not in stop_words and w.isalnum()]
print("Filtered:\n", filtered)

# Stemming vs Lemmatization
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed = [ps.stem(w) for w in filtered]
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
print("Stemmed:\n", stemmed)
print("Lemmatized:\n", lemmatized)

# POS tagging
pos_tags = pos_tag(filtered)
print("POS Tags:\n", pos_tags)
print("\nPOS Tags in a bit detail:\n")
for tup in pos_tags:
  print(tup[0],"----->",tagdict[tup[1]][0])

Sentences:
 ['The quick brown fox jumps over the lazy dog.']
Words:
 ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Filtered:
 ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Stemmed:
 ['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']
Lemmatized:
 ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
POS Tags:
 [('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('lazy', 'JJ'), ('dog', 'NN')]

POS Tags in a bit detail:

quick -----> adjective or numeral, ordinal
brown -----> noun, common, singular or mass
fox -----> noun, common, singular or mass
jumps -----> noun, common, plural
lazy -----> adjective or numeral, ordinal
dog -----> noun, common, singular or mass


# Difference between Lemmatization & Stemming

In [None]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Sample text
text = "The striped bats are hanging on their feet for best"

# Tokenize the text
words = nltk.word_tokenize(text)

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# Function to get the part of speech tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

# Print results
print("Original Text: ", text)
print("Tokenized Words: ", words)
print("Stemmed Words: ", stemmed_words)
print("Lemmatized Words: ", lemmatized_words)


Original Text:  The striped bats are hanging on their feet for best
Tokenized Words:  ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
Stemmed Words:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatized Words:  ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


# When to Use Lemmatization vs. Stemming
The choice between lemmatization and stemming depends on the specific requirements of the NLP task at hand:

## Use Lemmatization When:
* Accuracy and context are crucial.
* The task involves complex language understanding, such as chatbots, sentiment analysis, and machine translation.
* The computational resources are sufficient to handle the additional complexity.
## Use Stemming When:
* Speed and efficiency are more important than accuracy.
* The task involves simple text normalization, such as search engines and information retrieval systems.
* The computational resources are limited

# 3. Text Vectorization (NLTK & sklearn)
Converting text into numeric features: Bag‑of‑Words, TF‑IDF. NLTK offers TextCollection, while sklearn provides CountVectorizer/TfidfVectorizer.

In [None]:
!pip install --upgrade numpy==1.23.5 scikit-learn==1.1.3



In [None]:
from nltk.text import TextCollection         # TextCollection turns a list of documents into an NLTK-readable corpus and allows tf-idf queries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample documents
docs = [
    "Climate change is affecting global temperatures and weather patterns.",
    "Renewable energy sources like solar and wind reduce carbon emissions.",
    "Fossil fuels contribute heavily to greenhouse gas emissions.",
    "Investing in solar energy helps combat climate change effectively.",
]

# --- NLTK TextCollection for TF-IDF ---
corpus = TextCollection(docs)
# Check TF-IDF of selected terms in document 4
target_doc = docs[3]
for word in ['climate', 'solar', 'emissions']:
    tfidf = corpus.tf_idf(word, target_doc)
    print(f"NLTK TF‑IDF('{word}') in doc4: {tfidf:.4f}")

# --- sklearn CountVectorizer & TfidfVectorizer ---
cv = CountVectorizer(stop_words='english')
X_counts = cv.fit_transform(docs)      # Builds a vocabulary of words and converts each document into a Bag-of-Words count vector.
print("Vocabulary:\n", cv.vocabulary_)
print("Count Matrix:\n", X_counts.toarray())

tfv = TfidfVectorizer(stop_words='english')
X_tfidf = tfv.fit_transform(docs)
print("TF‑IDF Matrix:\n", X_tfidf.toarray())
# print((X_tfidf.toarray()[0]))

NLTK TF‑IDF('climate') in doc4: 0.0210
NLTK TF‑IDF('solar') in doc4: 0.0105
NLTK TF‑IDF('emissions') in doc4: 0.0000
Vocabulary:
 {'climate': 3, 'change': 2, 'affecting': 0, 'global': 12, 'temperatures': 23, 'weather': 24, 'patterns': 18, 'renewable': 20, 'energy': 8, 'sources': 22, 'like': 17, 'solar': 21, 'wind': 25, 'reduce': 19, 'carbon': 1, 'emissions': 7, 'fossil': 9, 'fuels': 10, 'contribute': 5, 'heavily': 14, 'greenhouse': 13, 'gas': 11, 'investing': 16, 'helps': 15, 'combat': 4, 'effectively': 6}
Count Matrix:
 [[1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0]
 [0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0]]
TF‑IDF Matrix:
 [[0.40021825 0.         0.31553666 0.31553666 0.         0.
  0.         0.         0.         0.         0.         0.
  0.40021825 0.         0.         0.         0.         0.
  0.40021825 0.         0.         0.         0.        

# 4. Gensim Overview: Corpus, TF‑IDF, Word2Vec

Gensim excels at unsupervised topic modeling and vector embedding techniques. We’ll create a Dictionary & Corpus, apply TF‑IDF, and train a Word2Vec model.

In [None]:
from gensim import corpora, models
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

docs = [
    "Climate change is affecting global temperatures and weather patterns.",
    "Renewable energy sources like solar and wind reduce carbon emissions.",
    "Fossil fuels contribute heavily to greenhouse gas emissions.",
    "Investing in solar energy helps combat climate change effectively.",
]

# Preprocess docs for Gensim
processed = [[w.lower() for w in word_tokenize(doc) if w.isalpha()] for doc in docs]
print(processed)

# Create Dictionary & Corpus
dictionary = corpora.Dictionary(processed)
corpus = [dictionary.doc2bow(text) for text in processed]
print("Dictionary token2id:\n", dictionary.token2id)
print("Corpus (BoW):\n", corpus)      # This is Count Frequency for each word

# TF-IDF Model
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
print("TF‑IDF Corpus:\n", list(corpus_tfidf))

# Word2Vec Model
w2v_model = Word2Vec(sentences=processed, vector_size=50, window=5, min_count=1, workers=2)

# Check if the word is in the vocabulary before accessing it
target_word = 'language'
if target_word in w2v_model.wv:
    print(f"Vector for '{target_word}':\n", w2v_model.wv[target_word])
    print(f"Most similar to '{target_word}':\n", w2v_model.wv.most_similar(target_word))
else:
    print(f"'{target_word}' not found in the Word2Vec vocabulary.")

# Save & load model
w2v_model.save('w2v.model')
loaded = Word2Vec.load('w2v.model')
print("Loaded model vocab size:", len(loaded.wv))

[['climate', 'change', 'is', 'affecting', 'global', 'temperatures', 'and', 'weather', 'patterns'], ['renewable', 'energy', 'sources', 'like', 'solar', 'and', 'wind', 'reduce', 'carbon', 'emissions'], ['fossil', 'fuels', 'contribute', 'heavily', 'to', 'greenhouse', 'gas', 'emissions'], ['investing', 'in', 'solar', 'energy', 'helps', 'combat', 'climate', 'change', 'effectively']]
Dictionary token2id:
 {'affecting': 0, 'and': 1, 'change': 2, 'climate': 3, 'global': 4, 'is': 5, 'patterns': 6, 'temperatures': 7, 'weather': 8, 'carbon': 9, 'emissions': 10, 'energy': 11, 'like': 12, 'reduce': 13, 'renewable': 14, 'solar': 15, 'sources': 16, 'wind': 17, 'contribute': 18, 'fossil': 19, 'fuels': 20, 'gas': 21, 'greenhouse': 22, 'heavily': 23, 'to': 24, 'combat': 25, 'effectively': 26, 'helps': 27, 'in': 28, 'investing': 29}
Corpus (BoW):
 [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(1, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17,

#  Hands‑On Lab

Let’s build a small end-to-end pipeline:

* Load a sample corpus (e.g., 5 news headlines).

* Preprocess (tokenize, remove stopwords, lemmatize).

* Vectorize with TF‑IDF.

* Train a Word2Vec model and explore embeddings.

In [None]:
# 1. Sample data
headlines = [
    "Stock markets rally as tech shares surge",
    "New species of bird discovered in Amazon rainforest",
    "Global climate summit ends with landmark agreement",
    "Breakthrough in quantum computing announced",
    "Major cybersecurity breach affects millions"
]

# 2. Preprocessing
def preprocess(docs):
    tokens = []
    for doc in docs:
        ws = word_tokenize(doc)
        filt = [w.lower() for w in ws if w.isalpha() and w.lower() not in stop_words]
        lem = [lemmatizer.lemmatize(w) for w in filt]
        tokens.append(lem)
    return tokens


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

processed_headlines = preprocess(headlines)
print("Processed Headlines:\n", processed_headlines)

# 3. TF‑IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer()
X = tfv.fit_transform(headlines)
print("TF‑IDF Features shape:", X.shape)

# 4. Word2Vec
from gensim.models import Word2Vec
lab_w2v = Word2Vec(sentences=processed_headlines, vector_size=20, window=3, min_count=1)
print("Embedding for 'quantum':", lab_w2v.wv['quantum'])

Processed Headlines:
 [['stock', 'market', 'rally', 'tech', 'share', 'surge'], ['new', 'specie', 'bird', 'discovered', 'amazon', 'rainforest'], ['global', 'climate', 'summit', 'end', 'landmark', 'agreement'], ['breakthrough', 'quantum', 'computing', 'announced'], ['major', 'cybersecurity', 'breach', 'affect', 'million']]
TF‑IDF Features shape: (5, 31)
Embedding for 'quantum': [-0.03570098  0.00620068 -0.03589604 -0.01122253  0.01859439  0.0291792
  0.0059977   0.01050102 -0.02054807  0.03614411 -0.03154423  0.0232349
 -0.04109633  0.01017275 -0.02488567 -0.02123807 -0.01554217  0.02826661
  0.02899998 -0.02488009]


# Case Study 1 : Document Clustering

## What is document clustering?
- Unsupervised grouping of documents so that texts in the same cluster are more similar to each other than to those in other clusters.

## Why use it?

* Organize large corpora (e.g., news articles, customer feedback) into coherent themes
* Enable fast browsing, topic exploration, or downstream labeling
This case study shows:

## Preprocessing text with NLTK
* Building TF–IDF vectors with Gensim
* Clustering with scikit‑learn’s KMeans
* Interpreting clusters via top terms and sample docs

In [None]:
import nltk
nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import reuters, stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from gensim import corpora, models
from gensim.matutils import corpus2csc

from sklearn.cluster import KMeans

import numpy as np
import random


[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Load & Sample the Reuters Corpus

In [None]:
# We'll take a random sample of 500 Reuters articles for speed
all_ids = reuters.fileids()         # The original corpus has 10,369 documents and a vocabulary of 29,930 words.
sample_ids = random.sample(all_ids, 500)
raw_docs = [reuters.raw(fid) for fid in sample_ids]
print(f"Loaded {len(raw_docs)} documents.")


Loaded 500 documents.


In [None]:
for i in range(5):
  print(f"Document {i+1}:")
  print(raw_docs[i])

Document 1:
SILICON SYSTEMS INC &lt;SLCN> 2ND QTR MARCH 28
  Shr profit five cts vs profit two cts
      Net profit 325,000 vs profit 105,000
      Revs 19.5 mln vs 16.1 mln
      Six Mths
      Shr profit nine cts vs loss 35 cts
      Net profit 627,000 vs loss 2,280,000
      Revs 36.9 mln vs 27.4 mln
  


Document 2:
JAPAN REJECTS U.S. OBJECTIONS TO FAIRCHILD SALE
  A Foreign Ministry official dismissed
  arguments made by senior U.S. Government officials seeking to
  block the sale of a U.S. Microchip maker to a Japanese firm.
      "They appear to be linking completely unrelated issues,"
  Shuichi Takemoto of the Foreign Ministry's North American
  Division told Reuters.
      U.S. Commerce Secretary Malcolm Baldrige has asked the
  White House to consider blocking the sale of &lt;Fairchild
  Semiconductor Corp> to Japan's Fujitsu Ltd &lt;ITSU.T>, U.S.
  Officials said yesterday.
      Baldrige expressed concern that the sale would leave the
  U.S. Military dependent on a foreign 

# Preprocessing with NLTK

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

tokenized_docs = [preprocess(doc) for doc in raw_docs]


In [None]:
for i in range(5):
  print(f"Document {i+1}:")
  print(tokenized_docs[i])

Document 1:
['silicon', 'system', 'inc', 'lt', 'slcn', 'qtr', 'march', 'shr', 'profit', 'five', 'ct', 'v', 'profit', 'two', 'ct', 'net', 'profit', 'v', 'profit', 'rev', 'mln', 'v', 'mln', 'six', 'mths', 'shr', 'profit', 'nine', 'ct', 'v', 'loss', 'ct', 'net', 'profit', 'v', 'loss', 'rev', 'mln', 'v', 'mln']
Document 2:
['japan', 'reject', 'objection', 'fairchild', 'sale', 'foreign', 'ministry', 'official', 'dismissed', 'argument', 'made', 'senior', 'government', 'official', 'seeking', 'block', 'sale', 'microchip', 'maker', 'japanese', 'firm', 'appear', 'linking', 'completely', 'unrelated', 'issue', 'shuichi', 'takemoto', 'foreign', 'ministry', 'north', 'american', 'division', 'told', 'reuters', 'commerce', 'secretary', 'malcolm', 'baldrige', 'asked', 'white', 'house', 'consider', 'blocking', 'sale', 'lt', 'fairchild', 'semiconductor', 'corp', 'japan', 'fujitsu', 'ltd', 'lt', 'official', 'said', 'yesterday', 'baldrige', 'expressed', 'concern', 'sale', 'would', 'leave', 'military', 'depe

# Build a Gensim TF–IDF Representation

In [None]:
# 1. Create Dictionary & filter extremes
dictionary = corpora.Dictionary(tokenized_docs)
# dictionary.filter_extremes(no_below=5, no_above=0.5). # removes words that appear in fewer than 5 documents and removes words that appear in more than 50% of documents
print("Vocabulary Size",len(dictionary))
# 2. Convert to Bag‑of‑Words
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# 3. Fit TF–IDF model
tfidf_model = models.TfidfModel(bow_corpus). # TF-IDF helps highlight words that are unique to a document and downplays words that are common across many documents
tfidf_corpus = tfidf_model[bow_corpus]

# 4. Create document‑term matrix (sparse CSC) and transpose to (n_docs × n_terms)
sparse_matrix = corpus2csc(tfidf_corpus, num_terms=len(dictionary)).transpose()     #  converts the list of TF-IDF vectors into a Compressed Sparse Column (CSC) matrix
X = sparse_matrix.toarray()  # dense for KMeans
print("Feature matrix shape:", X.shape)


Vocabulary Size 5484
Feature matrix shape: (500, 5484)


# K‑Means Clustering

In [None]:
n_clusters = 8
km = KMeans(n_clusters=n_clusters, random_state=42)
labels = km.fit_predict(X)


#  Inspecting Clusters
## Top Terms per Cluster

In [None]:

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = dictionary

for i in range(n_clusters):
    top_terms = [terms[id] for id in order_centroids[i, :8]]
    print(f"Cluster {i+1} top terms: {', '.join(top_terms)}")


Cluster 1 top terms: share, company, pct, dlrs, stock, inc, offer, quarter
Cluster 2 top terms: v, net, ct, mln, shr, qtr, rev, dlrs
Cluster 3 top terms: qtly, div, ct, april, prior, record, pay, payout
Cluster 4 top terms: tonne, export, trade, wheat, corn, official, shipment, sugar
Cluster 5 top terms: gold, mine, ounce, acre, copper, property, mining, mineral
Cluster 6 top terms: bank, billion, rate, pct, fed, reserve, mark, dollar
Cluster 7 top terms: loss, v, profit, ct, net, rev, shr, mln
Cluster 8 top terms: oil, gas, crude, barrel, bpd, well, texas, exploration


## Sample Documents per Cluster

In [None]:
for i in range(n_clusters):
    print(f"\n--- Cluster {i+1} Examples ---")
    # pick 2 random docs from this cluster
    doc_indices = np.where(labels == i)[0]
    for idx in random.sample(list(doc_indices), 2):
        snippet = raw_docs[idx][:200].replace('\n',' ')
        print(f"• …{snippet}…")



--- Cluster 1 Examples ---
• …ROTTERDAM GRAIN HANDLER SAYS PORT BALANCE ROSE   Graan Elevator Mij, GEM, said its   balance in port of grains, oilseeds and derivatives rose to   146,000 tonnes on April 11 from 111,000 a week earlie…
• …EC EXPORT LICENCES FOR 59,000 TONNES WHITE  SUGAR AT REBATE 45.678 ECUS - FRENCH TRADERS    EC EXPORT LICENCES FOR 59,000 TONNES WHITE  SUGAR AT REBATE 45.678 ECUS - FRENCH TRADERS     …

--- Cluster 2 Examples ---
• …INTELLICORP &lt;INAI.O> 1ST QTR SEPT 30 LOSS   Shr loss nine cts vs loss 12 cts       Net loss 649,000 vs loss 850,000       Revs 5,059,000 vs 4,084,000       Avg shrs 7,041,000 vs 6,900,000       NOT…
• …RPC ENERGY SERVICES INC &lt;RES> 1ST QTR SEPT 30   Shr profit one cent vs loss 29 cts       Net profit 116,000 vs loss 4,195,000       Revs 20.2 mln vs 6,393,000     …

--- Cluster 3 Examples ---
• …J.W. MAYS INC &lt;MAYS> 2ND QTR JAN 31 NET   Shr 2.27 dlrs vs 74 cts       Net 4,945,989 vs 1,612,624       Revs 28.2 mln vs 27.9 mln       Si

# Case Study 2 : Sentiment Analysis for Movie Reviews

* Goal: Uncover the main themes (topics) in a large set of movie reviews and explore semantic relationships between words.
* Why it matters: Topic modeling helps condense thousands of customer opinions into a handful of interpretable themes (e.g., “acting,” “plot,” “special effects”). Word embeddings (Word2Vec) let us zoom into nuanced semantic clusters (e.g., “thrilling” ≈ “suspenseful”).

* Takeaways:

How to preprocess text at scale with NLTK
How to build TF–IDF and LDA topic models with Gensim
How to train and query a Word2Vec model

# 2. Dataset

In [None]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Load documents as lists of words
docs = [movie_reviews.words(fileid) for fileid in movie_reviews.fileids()]
print(f"Loaded {len(docs)} reviews, average length ~{sum(len(d) for d in docs)/len(docs):.0f} tokens.")


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Loaded 2000 reviews, average length ~792 tokens.


# 3. Preprocessing with NLTK

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(doc):
    # join tokens to string, then re-tokenize to handle punctuation properly
    text = " ".join(doc).lower()
    tokens = word_tokenize(text)
    tokens = [
        lemmatizer.lemmatize(tok)
        for tok in tokens
        if tok.isalpha() and tok not in stop_words
    ]
    return tokens

processed_docs = [preprocess(doc) for doc in docs]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 4. TF–IDF Analysis

In [None]:
from gensim import corpora, models

# 1. Build dictionary and filter extremes
dictionary = corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=5, no_above=0.5)

# 2. Convert to BOW corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# 3. Fit TF–IDF model
tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

# Display top‑5 TF–IDF words in the first review
doc0 = tfidf_corpus[0]
top5 = sorted(doc0, key=lambda x: -x[1])[:5]
print("Top 5 TF–IDF in review #1:")
for term_id, score in top5:
    print(f"  {dictionary[term_id]} — {score:.3f}")

Top 5 TF–IDF in review #1:
  strangeness — 0.229
  teen — 0.211
  fuck — 0.190
  highway — 0.178
  crow — 0.169


# 5. Topic Modeling with LDA

In [None]:
# Train a small LDA model
lda = models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=6,
    passes=10,
    random_state=42
)

# Show the 6 topics
for tid in range(6):
    terms = lda.show_topic(tid, topn=8)
    print(f"Topic {tid+1}: " + ", ".join([term for term, _ in terms]))


Topic 1: thing, really, life, bad, plot, know, take, could
Topic 2: bad, life, plot, really, director, actor, look, big
Topic 3: life, performance, man, u, work, many, action, come
Topic 4: star, action, effect, plot, bad, life, u, take
Topic 5: funny, comedy, thing, know, little, people, come, day
Topic 6: life, people, little, really, come, alien, never, know


# 6. Semantic Similarity with Word2Vec

In [None]:
from gensim.models import Word2Vec

# Train Word2Vec on the processed docs
w2v = Word2Vec(
    sentences=processed_docs,
    vector_size=100,
    window=5,
    min_count=10,
    epochs=20,
    seed=42
)

# Find similar words
for target in ["plot", "actor", "love", "horror"]:
    if target in w2v.wv:
        sims = w2v.wv.most_similar(target, topn=5)
        print(f"\nTop words similar to '{target}':")
        for word, score in sims:
            print(f"  {word} — {score:.3f}")
    else:
        print(f"'{target}' not in vocabulary.")



Top words similar to 'plot':
  storyline — 0.542
  gaping — 0.517
  premise — 0.487
  confusing — 0.484
  coincidence — 0.443

Top words similar to 'actor':
  performer — 0.642
  actress — 0.639
  talented — 0.522
  cast — 0.507
  acting — 0.460

Top words similar to 'love':
  asleep — 0.638
  flat — 0.546
  apart — 0.528
  prey — 0.473
  category — 0.455

Top words similar to 'horror':
  slasher — 0.690
  genre — 0.534
  scary — 0.532
  exorcist — 0.504
  erotic — 0.500
