## Lab 5: Text Classification

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

In [3]:
# -------------------------
# 1) Toy dataset: Movie reviews
# -------------------------
docs = [
    "The movie had an amazing plot and great acting",    # Positive
    "I loved the film, it was thrilling and exciting",   # Positive
    "The movie was boring and too long",                 # Negative
    "I did not enjoy the film, it was a waste of time",  # Negative
    "Average movie, some good scenes but overall dull",  # Neutral
    "The film had stunning visuals but weak storyline",  # Neutral
]
labels = ["Positive", "Positive", "Negative", "Negative", "Neutral", "Neutral"]

In [4]:
# -------------------------
# 2) TF-IDF representation
# -------------------------
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
y = np.array(labels)


In [5]:
# -------------------------
# 3) Models: Train once
# -------------------------
nb = MultinomialNB().fit(X, y)
dt = DecisionTreeClassifier(random_state=0).fit(X, y)
knn = KNeighborsClassifier(n_neighbors=3).fit(X, y)

In [7]:
# -------------------------
def categorize_text(sample_text):
    q_vec = vectorizer.transform([sample_text])

    print("Sample Text:", sample_text)
    print("\n--- Model Predictions ---")
    print("Naive Bayes:", nb.predict(q_vec)[0])
    print("Decision Tree:", dt.predict(q_vec)[0])
    print("KNN:", knn.predict(q_vec)[0])


In [8]:
sample_text = "The film had excellent acting but a slow plot"
categorize_text(sample_text)

Sample Text: The film had excellent acting but a slow plot

--- Model Predictions ---
Naive Bayes: Positive
Decision Tree: Neutral
KNN: Positive




This lab demonstrates text classification of **movie reviews** using **TF-IDF features** and three machine learning models: **Naive Bayes**, **Decision Tree**, and **K-Nearest Neighbors (KNN)**.

---

### 1. Dataset
We define a small **toy dataset** of movie reviews:

| Review | Label |
|--------|-------|
| The movie had an amazing plot and great acting | Positive |
| I loved the film, it was thrilling and exciting | Positive |
| The movie was boring and too long | Negative |
| I did not enjoy the film, it was a waste of time | Negative |
| Average movie, some good scenes but overall dull | Neutral |
| The film had stunning visuals but weak storyline | Neutral |

Labels: `Positive`, `Negative`, `Neutral`

---

### 2. TF-IDF Representation
We convert the text reviews into **numerical feature vectors** using `TfidfVectorizer`.  
- TF-IDF captures **term importance** in each document relative to the corpus.
- Each review becomes a **vector of weighted word frequencies**.

```python
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

```
---
### 3. Machine Learning Models

We train three models once on the TF-IDF features:

Naive Bayes (MultinomialNB) – Assumes feature independence, fast for text.

Decision Tree (DecisionTreeClassifier) – Splits features hierarchically to classify.

K-Nearest Neighbors (KNeighborsClassifier) – Classifies based on nearest review vectors in feature space.

nb = MultinomialNB().fit(X, y)
dt = DecisionTreeClassifier(random_state=0).fit(X, y)
knn = KNeighborsClassifier(n_neighbors=3).fit(X, y)

--- 
### 4. Prediction Function

The categorize_text() function takes a new review, transforms it into TF-IDF features, and outputs predictions from all three models.

def categorize_text(sample_text):
    q_vec = vectorizer.transform([sample_text])

    print("Sample Text:", sample_text)
    print("\n--- Model Predictions ---")
    print("Naive Bayes:", nb.predict(q_vec)[0])
    print("Decision Tree:", dt.predict(q_vec)[0])
    print("KNN:", knn.predict(q_vec)[0])

---
### 5. Example
sample_text = "The film had excellent acting but a slow plot"
categorize_text(sample_text)


Output (example):

Sample Text: The film had excellent acting but a slow plot

--- Model Predictions ---
Naive Bayes: Positive
Decision Tree: Neutral
KNN: Positive

Notes

This lab demonstrates basic text classification with multiple models.

With a larger dataset, predictions become more reliable.

TF-IDF ensures that important words contribute more to the classification decision.


This Markdown explains the **dataset, preprocessing, models, function, and an example output**, ready for your lab report.  

If you want, I can also make a **combined Markdown covering all your previous labs** so you have a

## Rochhio Algorithm


In [9]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def rocchio(query, docs, relevant_idxs, irrelevant_idxs, alpha=1, beta=0.75, gamma=0.15):
    """
    Rocchio algorithm for relevance feedback query expansion.
    """
    # TF-IDF encoding
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([query] + docs).toarray()

    q_vec = tfidf[0]
    doc_vecs = tfidf[1:]

    # Get relevant and irrelevant document vectors
    rel_vecs = np.array([doc_vecs[i] for i in relevant_idxs]) if relevant_idxs else np.zeros_like(doc_vecs)
    irrel_vecs = np.array([doc_vecs[i] for i in irrelevant_idxs]) if irrelevant_idxs else np.zeros_like(doc_vecs)

    # Compute centroids
    rel_centroid = rel_vecs.mean(axis=0) if len(relevant_idxs) else 0
    irrel_centroid = irrel_vecs.mean(axis=0) if len(irrelevant_idxs) else 0

    # Update query using Rocchio formula
    q_new = alpha * q_vec + beta * rel_centroid - gamma * irrel_centroid

    # Cosine similarities
    initial_sims = cosine_similarity([q_vec], doc_vecs)[0]
    updated_sims = cosine_similarity([q_new], doc_vecs)[0]

    # Print comparison
    for i, doc in enumerate(docs):
        print(f"Doc {i}: {doc}")
        print(f"  Initial Sim: {initial_sims[i]:.4f}")
        print(f"  Updated Sim: {updated_sims[i]:.4f}")
        print()

# -------------------------
# Movie Reviews Dataset
# -------------------------
docs = [
    "I loved the movie, the acting was fantastic",
    "The film was boring and too long",
    "Amazing plot and great visuals",
    "Waste of time, not recommended",
    "Average movie, some good scenes but dull overall",
]
query = "great movie"
relevant = [0, 2]      # Indexes of reviews labeled as relevant
irrelevant = [1, 3]    # Indexes of reviews labeled as irrelevant

print("Query is:", query)
rocchio(query, docs, relevant, irrelevant)


Query is: great movie
Doc 0: I loved the movie, the acting was fantastic
  Initial Sim: 0.1707
  Updated Sim: 0.4101

Doc 1: The film was boring and too long
  Initial Sim: 0.0000
  Updated Sim: 0.0724

Doc 2: Amazing plot and great visuals
  Initial Sim: 0.3006
  Updated Sim: 0.5229

Doc 3: Waste of time, not recommended
  Initial Sim: 0.0000
  Updated Sim: -0.0589

Doc 4: Average movie, some good scenes but dull overall
  Initial Sim: 0.1633
  Updated Sim: 0.1480

