In [1]:
!pip install nltk

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import wordnet as wn




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [36]:
word = "dog"
synsets = wn.synsets(word)

print("All senses of 'interest':\n")
for s in synsets:
    print(f"{s.name()}  -->  {s.definition()}")


All senses of 'interest':

dog.n.01  -->  a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
frump.n.01  -->  a dull unattractive unpleasant girl or woman
dog.n.03  -->  informal term for a man
cad.n.01  -->  someone who is morally reprehensible
frank.n.02  -->  a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
pawl.n.01  -->  a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
andiron.n.01  -->  metal supports for logs in a fireplace
chase.v.01  -->  go after with the intent to catch


In [3]:
s = wn.synset('bank.n.01')
print("Definition:", s.definition())
print("Examples:", s.examples())


Definition: sloping land (especially the slope beside a body of water)
Examples: ['they pulled the canoe up on the bank', 'he sat on the bank of the river and watched the currents']


In [4]:
s = wn.synset('dog.n.01')

print("Hypernyms:", s.hypernyms())
print("Hyponyms:", s.hyponyms())
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')
car = wn.synset('car.n.01')

print("cat vs dog similarity :", cat.path_similarity(dog))
print("cat vs car similarity :", cat.path_similarity(car))


Hypernyms: [Synset('domestic_animal.n.01'), Synset('canine.n.02')]
Hyponyms: [Synset('mexican_hairless.n.01'), Synset('puppy.n.01'), Synset('poodle.n.01'), Synset('newfoundland.n.01'), Synset('corgi.n.01'), Synset('dalmatian.n.02'), Synset('leonberg.n.01'), Synset('cur.n.01'), Synset('lapdog.n.01'), Synset('pooch.n.01'), Synset('pug.n.01'), Synset('griffon.n.02'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('basenji.n.01'), Synset('great_pyrenees.n.01'), Synset('working_dog.n.01'), Synset('hunting_dog.n.01')]
cat vs dog similarity : 0.2
cat vs car similarity : 0.05555555555555555


In [5]:
def print_tree(synset, depth=0):
    print("  " * depth + synset.name())
    for hyper in synset.hypernyms():
        print_tree(hyper, depth+1)

print_tree(wn.synset("dog.n.01"))


dog.n.01
  domestic_animal.n.01
    animal.n.01
      organism.n.01
        living_thing.n.01
          whole.n.02
            object.n.01
              physical_entity.n.01
                entity.n.01
  canine.n.02
    carnivore.n.01
      placental.n.01
        mammal.n.01
          vertebrate.n.01
            chordate.n.01
              animal.n.01
                organism.n.01
                  living_thing.n.01
                    whole.n.02
                      object.n.01
                        physical_entity.n.01
                          entity.n.01


In [6]:
# install & downloads
!pip install -q nltk scikit-learn

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')       # tokenizer for contexts
nltk.download('stopwords')   # optional stopwords for features
nltk.download('punkt_tab')

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

print("Setup done")


Setup done


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [7]:
# quick sanity: show WordNet senses for 'interest' and 'bank'
for w in ['interest','bank']:
    print(f"\nSenses for '{w}':")
    for s in wn.synsets(w):
        print(f" - {s.name():20} : {s.definition()}")



Senses for 'interest':
 - interest.n.01        : a sense of concern with and curiosity about someone or something
 - sake.n.01            : a reason for wanting something done
 - interest.n.03        : the power of attracting or holding one's attention (because it is unusual or exciting etc.)
 - interest.n.04        : a fixed charge for borrowing money; usually a percentage of the amount borrowed
 - interest.n.05        : (law) a right or legal share of something; a financial involvement with something
 - interest.n.06        : (usually plural) a social group whose members control some field of activity and who have common aims
 - pastime.n.01         : a diversion that occupies one's time and thoughts (usually pleasantly)
 - interest.v.01        : excite the curiosity of; engage the interest of
 - concern.v.02         : be on the mind of
 - matter_to.v.01       : be of importance or consequence

Senses for 'bank':
 - bank.n.01            : sloping land (especially the slope beside a 

In [8]:
# Lesk algorithm usage examples
sentences = [
    "She paid 3% interest on the loan.",
    "He showed a great interest in astronomy.",
    "I sat on the bank and watched the river flow.",
    "He went to the river bank to fish."
]

for sent in sentences:
    # simple lesk expects tokenized context and the ambiguous word
    tok = word_tokenize(sent)
    sense = lesk(tok, 'interest' if 'interest' in sent else 'bank')
    print("\nSentence:", sent)
    if sense:
        print("Lesk sense:", sense.name())
        print("Definition:", sense.definition())
    else:
        print("Lesk returned: None")



Sentence: She paid 3% interest on the loan.
Lesk sense: interest.v.01
Definition: excite the curiosity of; engage the interest of

Sentence: He showed a great interest in astronomy.
Lesk sense: sake.n.01
Definition: a reason for wanting something done

Sentence: I sat on the bank and watched the river flow.
Lesk sense: depository_financial_institution.n.01
Definition: a financial institution that accepts deposits and channels the money into lending activities

Sentence: He went to the river bank to fish.
Lesk sense: bank.v.07
Definition: cover with ashes so to control the rate of burning


In [33]:
!pip install pywsd
!python -m pywsd.download   # downloads WordNet data for pywsd

import nltk
nltk.download('averaged_perceptron_tagger_eng')


Warming up PyWSD (takes ~10 secs)... Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/usr/local/lib/python3.12/dist-packages/pywsd/__init__.py", line 34, in <module>
    simple_lesk('This is a foo bar sentence', 'bar')
  File "/usr/local/lib/python3.12/dist-packages/pywsd/lesk.py", line 241, in simple_lesk
    ambiguous_word = lemmatize(ambiguous_word, pos=pos)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pywsd/utils.py", line 92, in lemmatize
    pos = pos if pos else penn2morphy(pos_tag([ambiguous_word])[0][1],
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nltk/tag/__init__.py", line 168, in pos_tag
    tagger = _get_tagger(lang)
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nltk/tag/__init__.py", line 110, in _get_tagg

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [34]:
from pywsd.lesk import adapted_lesk

sentences = [
    "She paid 3% interest on the loan.",
    "He showed a great interest in astronomy.",
    "I sat on the bank and watched the river flow.",
    "He went to the river bank to fish."
]

def get_target_word(sentence):
    if "interest" in sentence.lower():
        return "interest"
    return "bank"

for sent in sentences:
    target = get_target_word(sent)
    sense = adapted_lesk(sent, target)
    print("\nSentence:", sent)
    print("Predicted Sense:", sense.name())
    print("Definition:", sense.definition())



Sentence: She paid 3% interest on the loan.
Predicted Sense: interest.n.04
Definition: a fixed charge for borrowing money; usually a percentage of the amount borrowed

Sentence: He showed a great interest in astronomy.
Predicted Sense: interest.n.03
Definition: the power of attracting or holding one's attention (because it is unusual or exciting etc.)

Sentence: I sat on the bank and watched the river flow.
Predicted Sense: bank.n.01
Definition: sloping land (especially the slope beside a body of water)

Sentence: He went to the river bank to fish.
Predicted Sense: bank.n.10
Definition: a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)




### **1. What does the code demonstrate about the word ‚Äúbank‚Äù?**

The code retrieves **all WordNet synsets** for *bank*, prints their definitions, and shows how the same word has multiple senses (river bank, financial bank). This demonstrates **polysemy** and sense inventory in WordNet.

### **2. How does the code compute semantic similarity between words?**

It uses **path similarity** between synsets (e.g., *cat.n.01* vs *dog.n.01*). Path similarity measures semantic closeness based on the distance between concepts in the WordNet hierarchy.

### **3. What is the purpose of the custom `print_tree()` function?**

`print_tree()` recursively prints the **hypernym hierarchy** for a synset (e.g., dog ‚Üí canine ‚Üí carnivore ‚Üí ‚Ä¶). This visualizes where a concept sits in WordNet‚Äôs semantic taxonomy.

### **4. How does the code perform Word Sense Disambiguation (WSD)?**

It applies the **Lesk algorithm** from NLTK, which disambiguates a word (interest/bank) based on overlapping words between context and dictionary definitions.

### **5. Why does the algorithm download nltk resources like 'wordnet', 'punkt', and 'stopwords'?**

* **wordnet** ‚Üí sense inventory
* **omw-1.4** ‚Üí multilingual WordNet
* **punkt** ‚Üí tokenizer required for Lesk
* **stopwords** ‚Üí useful for preprocessing context
  These components are necessary for semantic lookup and WSD.

-------------------------------------------

### **1. Why is WordNet useful for understanding lexical semantics?**

WordNet provides a structured semantic network (synonyms, hypernyms, hyponyms), helping learners understand how meaning is represented and related across words.

### **2. What is the significance of studying hypernyms and hyponyms?**

They encode **is-a relationships** (e.g., dog ‚Üí animal). Understanding these hierarchies is essential for tasks like ontology building, text classification, and semantic reasoning.

### **3. Why is path similarity meaningful?**

Path similarity quantifies how conceptually close two words are. It reflects **semantic distance** and serves as a classical baseline for similarity tasks before neural embeddings.

### **4. Why do we still teach the Lesk WSD algorithm?**

Lesk is simple, interpretable, and demonstrates the **core idea of sense disambiguation via context overlap**. It provides conceptual grounding before introducing transformer-based WSD.

### **5. How does this demo connect to modern NLP / LLMs?**

WordNet relations (synonymy, hypernymy) and WSD logic form the foundation of semantic modeling. Modern LLMs implicitly learn these relations, and your demo shows the explicit classical version.




In [28]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop = set(stopwords.words('english'))

# --- Dataset ---
examples = [
    ("She pays 3% interest on the loan.", "interest_money"),
    ("The bank charges interest on late payments.", "interest_money"),
    ("Interest on the mortgage is high this month.", "interest_money"),

    ("He showed great interest in astronomy.", "interest_curiosity"),
    ("I have an interest in modern art.", "interest_curiosity"),
    ("Playing chess is one of my interests.", "interest_curiosity"),

    ("Microsoft purchased a controlling interest in the company.", "interest_stake"),
    ("They sold their interest in the factory.", "interest_stake"),
    ("Business interests lobbied for the legislation.", "interest_stake"),
]

# --- Better featurizer ---
def featurize(sent, focus='interest', window=4):
    toks = word_tokenize(sent.lower())
    idxs = [i for i,t in enumerate(toks) if t == focus]

    features = {}
    for idx in idxs:
        start = max(0, idx - window)
        end = min(len(toks), idx + window + 1)
        ctx = toks[start:end]
        for t in ctx:
            if t.isalpha() and t not in stop and t != focus:
                features[f"ctx({t})"] = 1

    if not idxs:  # fallback
        for t in toks:
            if t.isalpha() and t not in stop:
                features[f"bow({t})"] = 1

    return features

# --- Vectorize ---
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

X_dict = [featurize(s) for s, lbl in examples]
y = [lbl for s,lbl in examples]

vec = DictVectorizer(sparse=False)
X = vec.fit_transform(X_dict)

# --- Stratified Split (critical) ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y
)

# --- Train ---
clf = LogisticRegression(max_iter=500, multi_class='ovr')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nReport:\n", classification_report(y_test, y_pred))


Accuracy: 0.3333333333333333

Report:
                     precision    recall  f1-score   support

interest_curiosity       0.00      0.00      0.00         1
    interest_money       0.33      1.00      0.50         1
    interest_stake       0.00      0.00      0.00         1

          accuracy                           0.33         3
         macro avg       0.11      0.33      0.17         3
      weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [29]:
def predict_sentence(sent):
    f = featurize(sent)
    xv = vec.transform(f)  # note: DictVectorizer expects 2D; transform accepts list, but here dictionary okay
    pred = clf.predict(xv.reshape(1, -1))
    return pred[0]

tests = [
    "I'm interested in learning guitar.",
    "The bank increased interest rates.",
    "They sold their interest in the partnership."
]

for t in tests:
    print(t, "->", predict_sentence(t))


I'm interested in learning guitar. -> interest_money
The bank increased interest rates. -> interest_money
They sold their interest in the partnership. -> interest_stake


In [11]:
import nltk
from nltk.tokenize import word_tokenize
from collections import defaultdict, Counter
import math
import numpy as np

nltk.download('punkt')

corpus = [
    "the cat drank milk",
    "the dog drank water",
    "a cat chased a mouse",
    "the dog chased the cat",
    "milk and water are liquids",
    "the mouse drank water",
]

# tokenize
tokenized = [word_tokenize(sent.lower()) for sent in corpus]

tokenized


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['the', 'cat', 'drank', 'milk'],
 ['the', 'dog', 'drank', 'water'],
 ['a', 'cat', 'chased', 'a', 'mouse'],
 ['the', 'dog', 'chased', 'the', 'cat'],
 ['milk', 'and', 'water', 'are', 'liquids'],
 ['the', 'mouse', 'drank', 'water']]

In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = set(stopwords.words('english'))

# collect vocabulary
vocab = sorted({w for sent in tokenized for w in sent if w.isalpha() and w not in stop})
vocab


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['cat', 'chased', 'dog', 'drank', 'liquids', 'milk', 'mouse', 'water']

In [37]:
window_size = 2

# initialize co-occurrence counts
cooc = defaultdict(lambda: Counter())

for sent in tokenized:
    for i, w in enumerate(sent):
        if w not in vocab:
            continue
        # context window
        left = max(0, i-window_size)
        right = min(len(sent), i+window_size+1)
        # context = [c for j, c in enumerate(sent[left:right]) if j != i and c in vocab]
        context = [sent[j] for j in range(left, right) if j != i and sent[j] in vocab]
        for c in context:
            cooc[w][c] += 1

# show some counts
for w in ["cat", "dog", "milk", "water"]:
    print(w, dict(cooc[w]))




cat {'drank': 1, 'milk': 1, 'chased': 2}
dog {'drank': 1, 'water': 1, 'chased': 1}
milk {'cat': 1, 'drank': 1, 'water': 1}
water {'dog': 1, 'drank': 2, 'milk': 1, 'liquids': 1, 'mouse': 1}


In [38]:
# matrix with rows = words, columns = contexts
word2idx = {w:i for i,w in enumerate(vocab)}
idx2word = {i:w for w,i in word2idx.items()}

count_matrix = np.zeros((len(vocab), len(vocab)))

for w in vocab:
    for c,count in cooc[w].items():
        count_matrix[word2idx[w], word2idx[c]] = count

count_matrix


array([[0., 2., 0., 1., 0., 1., 0., 0.],
       [2., 0., 1., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 1., 1., 2.],
       [0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 1.],
       [0., 0., 1., 2., 1., 1., 1., 0.]])

In [39]:
# total co-occurrences
total = count_matrix.sum()

# compute probabilities
p_wc = count_matrix / total
p_w = p_wc.sum(axis=1, keepdims=True)
p_c = p_wc.sum(axis=0, keepdims=True)

PMI = np.log((p_wc) / (p_w * p_c + 1e-12) + 1e-12)  # add eps to avoid div/0
PPMI = np.maximum(PMI, 0)

PMI, PPMI


(array([[-27.63102112,   1.32175584, -27.63102112,   0.22314355,
         -27.63102112,   0.91629073, -27.63102112, -27.63102112],
        [  1.32175584, -27.63102112,   0.91629073, -27.63102112,
         -27.63102112, -27.63102112,   0.91629073, -27.63102112],
        [-27.63102112,   0.91629073, -27.63102112,   0.51082562,
         -27.63102112, -27.63102112, -27.63102112,   0.51082562],
        [  0.22314355, -27.63102112,   0.51082562, -27.63102112,
         -27.63102112,   0.51082562,   0.51082562,   0.51082562],
        [-27.63102112, -27.63102112, -27.63102112, -27.63102112,
         -27.63102112, -27.63102112, -27.63102112,   1.60943791],
        [  0.91629073, -27.63102112, -27.63102112,   0.51082562,
         -27.63102112, -27.63102112, -27.63102112,   0.51082562],
        [-27.63102112,   0.91629073, -27.63102112,   0.51082562,
         -27.63102112, -27.63102112, -27.63102112,   0.51082562],
        [-27.63102112, -27.63102112,   0.51082562,   0.51082562,
           1.60943

In [40]:
def cosine(v1, v2):
    if np.linalg.norm(v1)==0 or np.linalg.norm(v2)==0:
        return 0
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def sim(word1, word2, M=PPMI):
    return cosine(M[word2idx[word1]], M[word2idx[word2]])


In [41]:
pairs = [
    ("cat", "dog"),
    ("cat", "mouse"),
    ("milk", "water"),
    ("dog", "water"),
    ("cat", "milk"),
]

for w1, w2 in pairs:
    print(f"{w1:5} ‚Äì {w2:5} : {sim(w1, w2):.4f}")


cat   ‚Äì dog   : 0.6994
cat   ‚Äì mouse : 0.6994
milk  ‚Äì water : 0.1173
dog   ‚Äì water : 0.1173
cat   ‚Äì milk  : 0.0602


In [42]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=5)
dense = svd.fit_transform(PPMI)

def sim_dense(w1, w2):
    return cosine(dense[word2idx[w1]], dense[word2idx[w2]])

for w1, w2 in pairs:
    print(f"(SVD) {w1:5} ‚Äì {w2:5} : {sim_dense(w1, w2):.4f}")


(SVD) cat   ‚Äì dog   : 0.7559
(SVD) cat   ‚Äì mouse : 0.7559
(SVD) milk  ‚Äì water : 0.1250
(SVD) dog   ‚Äì water : 0.1205
(SVD) cat   ‚Äì milk  : 0.0115




### **1. How does the code perform Word Sense Disambiguation (WSD)?**

Your demo includes **two WSD approaches**:

1. **Lesk algorithm** ‚Üí dictionary-overlap‚Äìbased unsupervised WSD.
2. **Logistic Regression classifier** ‚Üí supervised WSD trained on a small labeled dataset using bag-of-words features.

This allows students to compare classical rule-based WSD with data-driven machine learning WSD.

---

### **2. What features are used in the supervised WSD classifier?**

The classifier uses a simple **bag-of-words context representation**:

* tokenized words (lowercased)
* stopwords removed
* target word (‚Äúinterest‚Äù) excluded
  Each remaining word becomes a binary feature `has(word)=1`.

These are converted into vectors via `DictVectorizer`.

---

### **3. How is the co-occurrence matrix constructed for PMI?**

The code:

* tokenizes each sentence
* defines a **window size = 2**
* counts how often each word occurs with every other within that window
* stores counts in a matrix (rows = target words, columns = context words)

This produces the raw statistics needed for PMI and PPMI.

---

### **4. How is PMI and PPMI computed?**

PMI = log( P(w,c) / (P(w) * P(c)) )
PPMI = max(PMI, 0)

The code applies smoothing (epsilon = 1e-12) to avoid division by zero and negative values are zeroed out to form PPMI.

---

### **5. What is the purpose of applying SVD to the PPMI matrix?**

SVD reduces the high-dimensional co-occurrence space to dense semantic vectors (like early word embeddings).
This demonstrates how **distributional similarity ‚Üí low-dimensional embeddings ‚Üí semantic similarity**.

----------------------------------------------

### **1. Why is it important to compare WordNet similarity with PMI similarity?**

It shows the difference between:

* **knowledge-based semantics** (WordNet hierarchy)
* **distributional semantics** (‚Äúyou shall know a word by the company it keeps‚Äù)

Students see how meaning can come from structured knowledge vs. raw corpus statistics.

---

### **2. Why do we train a supervised WSD model when unsupervised methods exist?**

Lesk is interpretable but brittle.
Supervised models adapt to real usage and generally achieve higher accuracy, showing the **evolution from rule-based ‚Üí statistical ‚Üí neural WSD**.

---

### **3. Why is PMI/PPMI important for understanding modern embeddings?**

PMI is mathematically related to **Skip-gram Negative Sampling (SGNS)** and **GloVe**.
Your demo helps students understand how classical distributional statistics inspired modern word embeddings and even transformer semantics.

---

### **4. What is the educational value of building a tiny labeled WSD dataset?**

It illustrates the **full ML pipeline** on a small scale:

* feature extraction
* vectorization
* train/test split
* training
* evaluation
* prediction on unseen sentences
  Students understand WSD without needing huge datasets.

---

### **5. Why reduce dimensionality with SVD?**

SVD demonstrates the core idea behind embeddings:

* compress co-occurrence information
* preserve key semantic dimensions
* produce dense vectors that enable smooth similarity computations

This is the ‚Äúclassical ancestor‚Äù of word2vec/GloVe.




In [19]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

# small sample corpus (you can replace with your own)
documents = [
    "The cat drinks milk.",
    "Dogs and cats are common household pets.",
    "Milk is a white liquid produced by mammals.",
    "The dog chased the cat near the river.",
    "Fresh milk and water are essential for health.",
]

documents


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['The cat drinks milk.',
 'Dogs and cats are common household pets.',
 'Milk is a white liquid produced by mammals.',
 'The dog chased the cat near the river.',
 'Fresh milk and water are essential for health.']

In [20]:
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

print("Shape (documents √ó terms):", tfidf_matrix.shape)
print("Vocabulary sample:", list(vectorizer.vocabulary_.keys())[:10])


Shape (documents √ó terms): (5, 20)
Vocabulary sample: ['cat', 'drinks', 'milk', 'dogs', 'cats', 'common', 'household', 'pets', 'white', 'liquid']


In [21]:
def search(query, top_k=3):
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()

    ranked = scores.argsort()[::-1]
    print(f"\nQuery: '{query}'\n")

    for idx in ranked[:top_k]:
        print(f"Score: {scores[idx]:.4f}  |  Doc {idx}: {documents[idx]}")

# Test queries
search("cat milk")
search("dog chased")
search("white liquid")



Query: 'cat milk'

Score: 0.7237  |  Doc 0: The cat drinks milk.
Score: 0.2879  |  Doc 3: The dog chased the cat near the river.
Score: 0.2028  |  Doc 4: Fresh milk and water are essential for health.

Query: 'dog chased'

Score: 0.6558  |  Doc 3: The dog chased the cat near the river.
Score: 0.0000  |  Doc 4: Fresh milk and water are essential for health.
Score: 0.0000  |  Doc 2: Milk is a white liquid produced by mammals.

Query: 'white liquid'

Score: 0.6705  |  Doc 2: Milk is a white liquid produced by mammals.
Score: 0.0000  |  Doc 4: Fresh milk and water are essential for health.
Score: 0.0000  |  Doc 3: The dog chased the cat near the river.


In [22]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(stop_words='english')
tf_matrix = tf_vectorizer.fit_transform(documents)

def search_tf(query, top_k=3):
    query_vec = tf_vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tf_matrix).flatten()
    ranked = scores.argsort()[::-1]
    print(f"\n[TF ONLY] Query: '{query}'\n")
    for idx in ranked[:top_k]:
        print(f"Score: {scores[idx]:.4f} | Doc {idx}: {documents[idx]}")

search_tf("white liquid")
search("white liquid")  # TF-IDF version



[TF ONLY] Query: 'white liquid'

Score: 0.6325 | Doc 2: Milk is a white liquid produced by mammals.
Score: 0.0000 | Doc 4: Fresh milk and water are essential for health.
Score: 0.0000 | Doc 3: The dog chased the cat near the river.

Query: 'white liquid'

Score: 0.6705  |  Doc 2: Milk is a white liquid produced by mammals.
Score: 0.0000  |  Doc 4: Fresh milk and water are essential for health.
Score: 0.0000  |  Doc 3: The dog chased the cat near the river.


In [23]:
import pandas as pd
feature_names = vectorizer.get_feature_names_out()

def show_weights(doc_id):
    vec = tfidf_matrix[doc_id].T.todense()
    df = pd.DataFrame(vec, index=feature_names, columns=['tfidf'])
    print(f"\nTF-IDF weights for Document {doc_id}:")
    display(df.sort_values('tfidf', ascending=False).head(10))

show_weights(2)  # example: milk document



TF-IDF weights for Document 2:


Unnamed: 0,tfidf
mammals,0.474125
produced,0.474125
white,0.474125
liquid,0.474125
milk,0.317527
chased,0.0
cats,0.0
cat,0.0
common,0.0
dog,0.0


In [24]:
import numpy as np

sim_matrix = cosine_similarity(tfidf_matrix)
pd.DataFrame(sim_matrix, columns=[f"D{i}" for i in range(len(documents))], index=[f"D{i}" for i in range(len(documents))])


Unnamed: 0,D0,D1,D2,D3,D4
D0,1.0,0.0,0.146763,0.208308,0.146763
D1,0.0,1.0,0.0,0.0,0.0
D2,0.146763,0.0,1.0,0.0,0.100823
D3,0.208308,0.0,0.0,1.0,0.0
D4,0.146763,0.0,0.100823,0.0,1.0


In [25]:
!pip install wikipedia

import requests
import time
import wikipedia
from datetime import datetime

wikipedia.set_lang("en")

HEADERS = {
    "User-Agent": "Raghvendra-Kumar-Demo/1.0 (contact: raghvendra.kumar@1004.com)"
}

def get_random_titles_batch(batch_size=50):
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
        "action": "query",
        "format": "json",
        "list": "random",
        "rnlimit": batch_size,
        "rnnamespace": 0
    }

    for _ in range(5):  # retry up to 5 times
        try:
            r = requests.get(URL, params=PARAMS, headers=HEADERS, timeout=10)
            data = r.json()
            return [item["title"] for item in data["query"]["random"]]
        except:
            time.sleep(1)

    return []


def get_random_wiki_pages(n_pages=100):
    pages = []
    last_print = time.time()

    print(f"Starting download of {n_pages} Wikipedia pages...\n")

    while len(pages) < n_pages:
        batch_start = time.time()
        titles = get_random_titles_batch(50)
        if not titles:
            print("‚ö†Ô∏è API failed, retrying...")
            time.sleep(1)
            continue

        for title in titles:
            try:
                page = wikipedia.page(title)
                text = page.content
                if len(text) > 500:
                    pages.append((title, text))

                    # Print progress every 50 pages added
                    if len(pages) % 50 == 0:
                        print(f"[{datetime.now().strftime('%H:%M:%S')}] "
                              f"Collected {len(pages)} / {n_pages}")
            except:
                continue

            if len(pages) >= n_pages:
                break

        batch_time = time.time() - batch_start
        print(f"   ‚úî Batch processed in {batch_time:.2f} sec "
              f"(Total: {len(pages)} pages)")

        time.sleep(0.1)

    print("\nüéâ Completed downloading all pages!")
    return pages


wiki_docs = get_random_wiki_pages(100)
len(wiki_docs)


Starting download of 100 Wikipedia pages...

   ‚úî Batch processed in 49.44 sec (Total: 43 pages)




  lis = BeautifulSoup(html).find_all('li')


[04:41:06] Collected 50 / 100
   ‚úî Batch processed in 46.97 sec (Total: 78 pages)
[04:42:11] Collected 100 / 100
   ‚úî Batch processed in 26.37 sec (Total: 100 pages)

üéâ Completed downloading all pages!


100

In [26]:
titles = [t for t, _ in wiki_docs]
texts = [txt for _, txt in wiki_docs]

vectorizer = TfidfVectorizer(stop_words='english', max_features=20000)
tfidf_matrix = vectorizer.fit_transform(texts)

print("TF-IDF matrix shape:", tfidf_matrix.shape)


TF-IDF matrix shape: (100, 11604)


In [27]:
def search_wiki(query, top_k=5):
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    ranked = scores.argsort()[::-1]

    print(f"\nüîç Query: {query}\n")
    for idx in ranked[:top_k]:
        print(f"üìò {titles[idx]}  | score={scores[idx]:.4f}\n")

# example searches
search_wiki("quantum mechanics")
search_wiki("Indian classical music")
search_wiki("machine learning algorithms")



üîç Query: quantum mechanics

üìò Alexios Polychronakos  | score=0.3439

üìò Li Shigu  | score=0.0000

üìò Community gardening in Singapore  | score=0.0000

üìò Tom Allin  | score=0.0000

üìò Rarotonga International Airport  | score=0.0000


üîç Query: Indian classical music

üìò Indian Congress (Socialist)  | score=0.2105

üìò Juan Orrego-Salas  | score=0.0982

üìò Celebrate (Whitney Houston and Jordin Sparks song)  | score=0.0625

üìò L.A.X (musician)  | score=0.0467

üìò Death by Design (album)  | score=0.0415


üîç Query: machine learning algorithms

üìò Damir Mirviƒá  | score=0.0808

üìò Meet Wally Sparks  | score=0.0313

üìò Juan Orrego-Salas  | score=0.0227

üìò SpyHunter (security software)  | score=0.0214

üìò Katy Croff Bell  | score=0.0207



In [30]:
!pip install wikipedia beautifulsoup4 lxml
import requests
import time
import wikipedia
from datetime import datetime
from bs4 import BeautifulSoup   # needed to fix parser warning
import re

# ---------------------------------------------
# FIX-1: Clean text (removes citations, spacing)
# ---------------------------------------------
def clean_text(t):
    t = re.sub(r'\[[0-9]+\]', '', t)      # remove numeric citations
    t = re.sub(r'\s+', ' ', t).strip()    # normalize whitespace
    return t

# ---------------------------------------------
# FIX-2: patch wikipedia library to avoid warning
# ---------------------------------------------
def safe_html_parser(html):
    return BeautifulSoup(html, features="lxml")

import wikipedia.wikipedia as wk
wk.BeautifulSoup = safe_html_parser    # monkey-patch

# ---------------------------------------------
# Set language + user agent
# ---------------------------------------------
wikipedia.set_lang("en")

HEADERS = {
    "User-Agent": "Raghvendra-Kumar-Demo/1.0 (contact: raghvendra.kumar@iitp.ac.in)"
}

# ---------------------------------------------
# Fetch random titles
# ---------------------------------------------
def get_random_titles_batch(batch_size=50):
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
        "action": "query",
        "format": "json",
        "list": "random",
        "rnlimit": batch_size,
        "rnnamespace": 0
    }

    for _ in range(5):
        try:
            r = requests.get(URL, params=PARAMS, headers=HEADERS, timeout=10)
            data = r.json()
            return [item["title"] for item in data["query"]["random"]]
        except:
            time.sleep(1)

    return []


# ---------------------------------------------
# Fetch random Wikipedia pages
# ---------------------------------------------
def get_random_wiki_pages(n_pages=100):
    pages = []

    print(f"Starting download of {n_pages} Wikipedia pages...\n")

    while len(pages) < n_pages:
        batch_start = time.time()
        titles = get_random_titles_batch(50)

        if not titles:
            print("‚ö†Ô∏è API failed, retrying...")
            time.sleep(1)
            continue

        for title in titles:
            try:
                page = wikipedia.page(title)
                raw_text = page.content
                text = clean_text(raw_text)

                if len(text) > 500:
                    pages.append((title, text))

                    if len(pages) % 50 == 0:
                        print(f"[{datetime.now().strftime('%H:%M:%S')}] "
                              f"Collected {len(pages)} / {n_pages}")

            except Exception:
                continue

            if len(pages) >= n_pages:
                break

        batch_time = time.time() - batch_start
        print(f"   ‚úî Batch processed in {batch_time:.2f} sec "
              f"(Total: {len(pages)} pages)")

        time.sleep(0.1)

    print("\nüéâ Completed downloading all pages!")
    return pages


# ----------------------------------------------------------------------
# Download random Wikipedia pages
# ----------------------------------------------------------------------
wiki_docs = get_random_wiki_pages(100)
print(len(wiki_docs))

titles = [t for t, _ in wiki_docs]
texts  = [txt for _, txt in wiki_docs]

# ----------------------------------------------------------------------
# FIX-3: Much better TF-IDF settings
# ----------------------------------------------------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(
    stop_words="english",
    min_df=2,         # remove extremely rare words
    max_df=0.8,       # remove overly common/broad words
    ngram_range=(1,2) # bigrams improve quality
)

tfidf_matrix = vectorizer.fit_transform(texts)
print("TF-IDF matrix shape:", tfidf_matrix.shape)

# ----------------------------------------------------------------------
# Search function
# ----------------------------------------------------------------------
def search_wiki(query, top_k=5):
    query = clean_text(query)  # normalize
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    ranked = scores.argsort()[::-1]

    print(f"\nüîç Query: {query}\n")
    for idx in ranked[:top_k]:
        print(f"üìò {titles[idx]}  | score={scores[idx]:.4f}\n")


# ----------------------------------------------------------------------
# Example searches
# ----------------------------------------------------------------------
search_wiki("quantum mechanics")
search_wiki("Indian classical music")
search_wiki("machine learning algorithms")


Starting download of 100 Wikipedia pages...

   ‚úî Batch processed in 47.51 sec (Total: 35 pages)
[04:56:10] Collected 50 / 100
   ‚úî Batch processed in 47.76 sec (Total: 65 pages)
[04:57:12] Collected 100 / 100
   ‚úî Batch processed in 41.06 sec (Total: 100 pages)

üéâ Completed downloading all pages!
100
TF-IDF matrix shape: (100, 4277)

üîç Query: quantum mechanics

üìò Sing Again  | score=0.0000

üìò Scott C. Fergus  | score=0.0000

üìò Yang Shengshi  | score=0.0000

üìò Pterophylla racemosa  | score=0.0000

üìò American River 50 Mile Endurance Run  | score=0.0000


üîç Query: Indian classical music

üìò Sintilimab  | score=0.2146

üìò CNR Music  | score=0.1233

üìò Tamanna (1942 film)  | score=0.0849

üìò Sean Ardoin  | score=0.0767

üìò Cry Along with the Babies  | score=0.0723


üîç Query: machine learning algorithms

üìò Velachery taluk  | score=0.1910

üìò Information Systems Professional  | score=0.0805

üìò Panaeolus semiovatus var. semiovatus  | score=0.



### **1. How does the TF-IDF‚Äìbased search function work?**

Your `search(query)` function:

1. Converts the query into a TF-IDF vector using the same vocabulary learned from documents.
2. Computes cosine similarity between the query and each document.
3. Ranks documents from most to least relevant.
   This demonstrates a classical vector-space information retrieval model.

---

### **2. Why is TF search (`search_tf`) included alongside TF-IDF search?**

You implemented both to show the difference:

* **TF (raw term frequency)** ‚Üí favors documents that repeat words, even if common.
* **TF-IDF** ‚Üí gives higher weight to rare, discriminative words.
  The comparison helps students understand **why TF-IDF improves retrieval quality**.

---

### **3. What does `show_weights(doc_id)` demonstrate?**

This function prints the **top TF-IDF weights** for any document.
Students can see which words uniquely represent that document (e.g., ‚Äúwhite‚Äù, ‚Äúliquid‚Äù for the milk document).
It teaches **interpretability** of TF-IDF vectors.

---

### **4. How is document‚Äìdocument similarity computed in the code?**

You compute a full **cosine similarity matrix** on all TF-IDF vectors:

```
sim_matrix = cosine_similarity(tfidf_matrix)
```

This reveals which documents are semantically close to each other (e.g., ‚Äúcat‚Äìdog chasing‚Äù sentences cluster).

---

### **5. How does the Wikipedia download module work?**

Your code:

* Uses the **Wikipedia Random API** to fetch random page titles in batches of 50.
* Downloads full page content with retries + progress printing.
* Builds a corpus of 100 articles for large-scale TF-IDF search.
* Creates a high-dimensional TF-IDF matrix (max 20,000 features).

This enables **real search** over real Wikipedia articles, not toy examples.

---


### **1. Why is TF-IDF still taught when modern search uses embeddings?**

TF-IDF is:

* simple
* interpretable
* fast
* surprisingly strong for domain-specific or small corpora
  It shows students **the mathematical foundation of vector search**, which embeddings later extend.

---

### **2. Why is cosine similarity used for search and clustering?**

Cosine focuses on **direction**, not magnitude, making it ideal for text:

* two documents are similar if they share important terms
* independent of document length
  It models semantic closeness in sparse vector spaces.

---

### **3. Why download 100+ Wikipedia pages for the TF-IDF search demo?**

Realistic examples illustrate:

* high-vocabulary retrieval
* noise in real-world documents
* genuine search results (e.g., queries like "quantum mechanics")
  This transitions the demo from a **toy example ‚Üí practical IR system**.

---

### **4. What does the feature-weight inspection teach?**

It highlights that TF-IDF vectors are **explainable**:

* each weight = how important a term is
* you can open the vector and interpret the score
  This contrasts with modern embedding models, where features are opaque.

---

### **5. Why is comparing TF vs TF-IDF searches educational?**

Because it clearly demonstrates:

* TF returns documents with repeated words (even if irrelevant)
* TF-IDF returns conceptually relevant documents
  Students learn *why* weighting by inverse document frequency is necessary.


