# OPTIONAL: Handling Text - TF-IDF and Linear Models

**Module**: ML700 Advanced Topics (Optional)  
**Notebook**: 03 - Handling Text: TF-IDF and Linear Models  
**Status**: OPTIONAL - This notebook covers advanced material beyond the core curriculum.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the bag-of-words concept and how text is converted to numerical features
2. Use `CountVectorizer` to create a term-count matrix
3. Use `TfidfVectorizer` to create TF-IDF features
4. Train a Logistic Regression classifier on text data
5. Inspect top features per class to interpret the model
6. Build a text classification pipeline using scikit-learn `Pipeline`

## Prerequisites

- Understanding of Logistic Regression (Module ML300)
- Familiarity with scikit-learn `fit`/`predict`/`transform` API
- Basic understanding of matrices and sparse data

## Table of Contents

1. [Text as Features: Bag of Words](#1.-Text-as-Features)
2. [CountVectorizer](#2.-CountVectorizer)
3. [TF-IDF](#3.-TF-IDF)
4. [Hands-On: Text Classification](#4.-Hands-On)
5. [Inspecting Top Features Per Class](#5.-Top-Features)
6. [Pipeline: TfidfVectorizer + LogisticRegression](#6.-Pipeline)
7. [Tuning Text Features](#7.-Tuning)
8. [Common Mistakes](#8.-Common-Mistakes)
9. [Summary](#9.-Summary)

---

## 1. Text as Features: Bag of Words

ML models need numerical input. The simplest way to convert text to numbers is the
**bag-of-words** (BoW) representation:

1. Build a **vocabulary** of all unique words across all documents
2. Represent each document as a **vector** of word counts (or frequencies)

This ignores word order (hence "bag") but works surprisingly well for many classification tasks.

**Example**:
- Document 1: "the movie was great"  
- Document 2: "the movie was terrible"  
- Vocabulary: [great, movie, terrible, the, was]  
- Doc 1 vector: [1, 1, 0, 1, 1]  
- Doc 2 vector: [0, 1, 1, 1, 1]

## 2. CountVectorizer

Scikit-learn's `CountVectorizer` implements the bag-of-words approach:
- Tokenizes text (splits into words)
- Builds a vocabulary
- Produces a sparse count matrix

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt

In [None]:
# Simple CountVectorizer demo
sample_docs = [
    "the movie was great and fun",
    "the movie was terrible and boring",
    "great acting and a fun story",
    "terrible plot and boring dialogue",
]

count_vec = CountVectorizer()
X_counts = count_vec.fit_transform(sample_docs)

print("Vocabulary:", count_vec.get_feature_names_out())
print("\nCount matrix (dense):")
print(X_counts.toarray())
print(f"\nShape: {X_counts.shape} (4 documents, {X_counts.shape[1]} unique words)")

## 3. TF-IDF

**TF-IDF** (Term Frequency - Inverse Document Frequency) improves on raw counts by
down-weighting words that appear in many documents (common words like "the") and
up-weighting words that are distinctive to specific documents.

$$\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \log\frac{N}{\text{df}(t)}$$

Where:
- $\text{tf}(t, d)$ = frequency of term $t$ in document $d$
- $N$ = total number of documents
- $\text{df}(t)$ = number of documents containing term $t$

**Intuition**: A word that appears in every document (like "the") gets a low IDF.
A word that appears in only one document gets a high IDF.

In [None]:
# TfidfVectorizer demo on the same sample docs
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(sample_docs)

print("TF-IDF matrix (rounded):")
tfidf_df = pd.DataFrame(
    X_tfidf.toarray().round(3),
    columns=tfidf_vec.get_feature_names_out(),
    index=[f'Doc {i+1}' for i in range(len(sample_docs))]
)
print(tfidf_df.to_string())
print()
print("Notice: common words like 'and' have lower TF-IDF scores.")
print("Distinctive words like 'great', 'terrible' have higher scores.")

## 4. Hands-On: Text Classification

Let us build a simple sentiment classifier using synthetic movie reviews.

In [None]:
# Synthetic movie review dataset
positive_reviews = [
    "This movie was absolutely wonderful and heartwarming",
    "Brilliant acting and a fantastic storyline throughout",
    "I loved every minute of this beautiful film",
    "An excellent movie with great performances from the cast",
    "The best movie I have seen this year truly amazing",
    "Superb direction and outstanding cinematography made this a joy",
    "A masterpiece of storytelling with incredible depth and emotion",
    "Funny charming and thoroughly entertaining from start to finish",
    "The performances were stellar and the script was brilliant",
    "A delightful film that exceeded all my expectations",
    "Wonderfully crafted with amazing attention to detail",
    "Exceptional acting and a truly moving story",
    "One of the finest films of the decade absolutely loved it",
    "A beautiful and inspiring movie that everyone should see",
    "Gripping suspenseful and deeply satisfying throughout",
    "The humor was perfect and the characters were lovable",
    "An outstanding achievement in cinema great work",
    "Uplifting and powerful with a wonderful message",
    "The best performance I have ever seen truly remarkable",
    "A fantastic journey with superb writing and acting",
]

negative_reviews = [
    "This movie was absolutely terrible and a waste of time",
    "Awful acting and a boring storyline throughout",
    "I hated every minute of this dreadful film",
    "A horrible movie with bad performances from the cast",
    "The worst movie I have seen this year truly awful",
    "Poor direction and dull cinematography made this painful",
    "A disaster of storytelling with no depth or emotion",
    "Boring predictable and thoroughly disappointing from start to finish",
    "The performances were weak and the script was terrible",
    "A dreadful film that failed all my expectations",
    "Poorly crafted with no attention to detail whatsoever",
    "Terrible acting and a truly depressing waste of talent",
    "One of the worst films of the decade absolutely hated it",
    "A boring and uninspiring movie that nobody should see",
    "Slow confusing and deeply unsatisfying throughout",
    "The humor was forced and the characters were annoying",
    "An embarrassing failure in cinema bad work",
    "Depressing and pointless with a confused message",
    "The worst performance I have ever seen truly forgettable",
    "A terrible journey with awful writing and acting",
]

# Combine into dataset
texts = positive_reviews + negative_reviews
labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)  # 1=positive, 0=negative

X_train_text, X_test_text, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

print(f"Training samples: {len(X_train_text)}, Test samples: {len(X_test_text)}")
print(f"Class distribution (train): {sum(y_train)} positive, {len(y_train) - sum(y_train)} negative")

In [None]:
# Vectorize with TF-IDF and train Logistic Regression
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

print(f"TF-IDF matrix shape (train): {X_train_tfidf.shape}")
print(f"TF-IDF matrix shape (test):  {X_test_tfidf.shape}")

# Train classifier
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_tfidf, y_train)

# Evaluate
y_pred = lr.predict(X_test_tfidf)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

## 5. Inspecting Top Features Per Class

One advantage of linear models on text is **interpretability**: we can inspect which words
are most associated with each class by looking at the model coefficients.

In [None]:
# Show top features per class
feature_names = tfidf.get_feature_names_out()
coefficients = lr.coef_[0]

# Top positive words (highest coefficients -> predict positive class)
top_positive_idx = np.argsort(coefficients)[-10:]
top_negative_idx = np.argsort(coefficients)[:10]

print("Top 10 words associated with POSITIVE reviews:")
for idx in reversed(top_positive_idx):
    print(f"  {feature_names[idx]:20s}  coef: {coefficients[idx]:+.4f}")

print("\nTop 10 words associated with NEGATIVE reviews:")
for idx in top_negative_idx:
    print(f"  {feature_names[idx]:20s}  coef: {coefficients[idx]:+.4f}")

In [None]:
# Visualize top features
n_top = 10
top_pos_idx = np.argsort(coefficients)[-n_top:]
top_neg_idx = np.argsort(coefficients)[:n_top]
top_idx = np.concatenate([top_neg_idx, top_pos_idx])

fig, ax = plt.subplots(figsize=(8, 6))
colors = ['#d32f2f' if c < 0 else '#388e3c' for c in coefficients[top_idx]]
ax.barh(range(len(top_idx)), coefficients[top_idx], color=colors)
ax.set_yticks(range(len(top_idx)))
ax.set_yticklabels(feature_names[top_idx])
ax.set_xlabel('Coefficient Value')
ax.set_title('Top Words by Logistic Regression Coefficient')
ax.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

print("Green bars (positive coef) = words associated with positive sentiment")
print("Red bars (negative coef) = words associated with negative sentiment")

## 6. Pipeline: TfidfVectorizer + LogisticRegression

Using a `Pipeline` ensures that the vectorizer is always fit on training data only
and avoids data leakage during cross-validation.

In [None]:
# Build a Pipeline
text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

# Cross-validate the pipeline on the full dataset
cv_scores = cross_val_score(text_pipeline, texts, labels, cv=5, scoring='accuracy')

print("Pipeline Cross-Validation (5-fold):")
for i, score in enumerate(cv_scores):
    print(f"  Fold {i+1}: {score:.2f}")
print(f"  Mean Accuracy: {cv_scores.mean():.2f} (+/- {cv_scores.std():.2f})")

In [None]:
# Train pipeline on full training set and predict on test
text_pipeline.fit(X_train_text, y_train)

# Predict on new unseen examples
new_reviews = [
    "This was a great movie with wonderful acting",
    "Terrible film I was bored the entire time",
    "The story was okay but nothing special",
]

predictions = text_pipeline.predict(new_reviews)
probabilities = text_pipeline.predict_proba(new_reviews)

print("Predictions on new reviews:")
for review, pred, prob in zip(new_reviews, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"  '{review}'")
    print(f"    -> {sentiment} (P(positive)={prob[1]:.2f})")
    print()

## 7. Tuning Text Features

Several parameters in `TfidfVectorizer` can significantly affect performance:

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| `max_features` | Limit vocabulary size | 5000, 10000, 50000 |
| `min_df` | Ignore words appearing in fewer than N documents | 2, 5, 0.01 |
| `max_df` | Ignore words appearing in more than N% of documents | 0.9, 0.95 |
| `ngram_range` | Include n-grams (word combinations) | (1,1), (1,2), (1,3) |
| `sublinear_tf` | Apply sublinear TF scaling (1 + log(tf)) | True/False |

**N-grams**: Instead of single words, include pairs or triples:  
- Unigrams (1,1): "not", "good"  
- Bigrams (1,2): "not", "good", "not good"  
- Trigrams (1,3): "not", "good", "not good", "not good enough"

In [None]:
# Demo: effect of n-grams and max_features
configs = [
    {'ngram_range': (1, 1), 'max_features': None, 'label': 'Unigrams (no limit)'},
    {'ngram_range': (1, 2), 'max_features': None, 'label': 'Uni+Bigrams (no limit)'},
    {'ngram_range': (1, 2), 'max_features': 50, 'label': 'Uni+Bigrams (max_features=50)'},
]

for config in configs:
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=config['ngram_range'],
            max_features=config['max_features']
        )),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ])
    scores = cross_val_score(pipe, texts, labels, cv=3, scoring='accuracy')
    
    # Fit to see vocabulary size
    pipe.fit(texts, labels)
    vocab_size = len(pipe.named_steps['tfidf'].vocabulary_)
    
    print(f"{config['label']:40s} | vocab: {vocab_size:4d} | accuracy: {scores.mean():.2f}")

## 8. Common Mistakes

1. **Not using pipelines for text**: Fitting `TfidfVectorizer` on the full dataset before splitting causes data leakage. Always use `Pipeline` so the vectorizer is fit only on training data.
2. **Too many features without `max_features`**: With large corpora, the vocabulary can be enormous. Use `max_features`, `min_df`, and `max_df` to control vocabulary size.
3. **Ignoring n-grams**: Unigrams alone miss important phrases like "not good" or "very bad". Try `ngram_range=(1, 2)` or `(1, 3)`.
4. **Using TF-IDF on very short texts**: For very short texts (tweets, single words), TF-IDF may not be effective because term frequencies are all 0 or 1.
5. **Forgetting to use the same vectorizer for prediction**: Always use `transform` (not `fit_transform`) on new data, or use a `Pipeline` which handles this automatically.

## 9. Summary

- **Bag of words** converts text to numerical features by counting word occurrences
- **CountVectorizer** creates raw count matrices; **TfidfVectorizer** adds IDF weighting
- TF-IDF formula: $\text{tfidf}(t,d) = \text{tf}(t,d) \cdot \log\frac{N}{\text{df}(t)}$
- **Logistic Regression** on TF-IDF features is a strong baseline for text classification
- Inspect **model coefficients** to see which words drive predictions (interpretability)
- Always use **Pipeline** (TfidfVectorizer + classifier) to prevent data leakage
- Tune `ngram_range`, `max_features`, `min_df`, `max_df` for better performance