# Lab 3: Natural Language Processing

**Duration:** 45-60 minutes | **Difficulty:** Intermediate

---

## Overview

This lab teaches NLP fundamentals through hands-on implementation.

### Lab Structure

| Section | Topic | Key Concepts |
|---------|-------|---------------|
| **1** | Text Preprocessing | Cleaning, tokenization, normalization |
| **2** | Bag of Words | CountVectorizer, document-term matrix |
| **3** | TF-IDF | Term frequency, inverse document frequency |
| **4** | Sentiment Analysis | Classification dataset |
| **5** | Naive Bayes | Text classification |
| **6** | Prediction | Classifying new text |

### Instructions

- Read each markdown cell carefully
- Write your code in the empty code cells
- Run cells with `Shift+Enter`

## Setup

Run the cell below to import the required libraries.

In [None]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

print("Setup complete!")

---
# Part 6: Natural Language Processing

Learn to process text data for machine learning: tokenization, vectorization, and text classification.

## 6.1 Text Preprocessing

Before feeding text to ML models, we must clean and normalize it.

Common preprocessing steps:

| Step | Description | Example |
|------|-------------|---------|
| Lowercase | Convert to lowercase | "Hello World" → "hello world" |
| Remove punctuation | Strip special characters | "Hello!" → "Hello" |
| Tokenization | Split into words | "hello world" → ["hello", "world"] |
| Remove stopwords | Remove common words | ["the", "is", "a"] removed |

**Your Task:** Write a function `preprocess_text(text)` that:
1. Converts text to lowercase
2. Removes all non-alphanumeric characters (keep spaces)
3. Returns the cleaned text

Test it on: `"Hello, World! This is NLP 101."`

**Expected Output:**
```
Original: Hello, World! This is NLP 101.
Cleaned: hello world this is nlp 101
```

**Sample Code:**
```python
import re

def clean_example(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Keep only alphanumeric and spaces
    return text

result = clean_example("Test! 123")
print(result)  # "test 123"
```

In [None]:
# Your code here


## 6.2 Bag of Words with CountVectorizer

Bag of Words converts text to numerical vectors by counting word occurrences.

```python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)  # Returns sparse matrix
print(vectorizer.get_feature_names_out())  # Vocabulary
print(X.toarray())  # Dense matrix
```

**Your Task:**
1. Create a list of 3 documents:
   - "I love machine learning"
   - "Machine learning is great"
   - "I love programming"
2. Create a `CountVectorizer` and fit_transform the documents
3. Print the vocabulary (feature names)
4. Print the document-term matrix as an array

**Expected Output:**
```
Vocabulary: ['great' 'is' 'learning' 'love' 'machine' 'programming']

Document-Term Matrix:
[[0 0 1 1 1 0]
 [1 1 1 0 1 0]
 [0 0 0 1 0 1]]
```

In [None]:
# Your code here


## 6.3 TF-IDF Vectorization

TF-IDF (Term Frequency - Inverse Document Frequency) weights words by importance:
- **TF**: How often a word appears in a document
- **IDF**: How rare a word is across all documents
- Words common in one doc but rare overall get high scores

```python
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(documents)
print(X.toarray())  # TF-IDF weighted matrix
```

**Your Task:**
1. Use the same 3 documents from 6.2
2. Create a `TfidfVectorizer` and fit_transform the documents
3. Print the TF-IDF matrix (rounded to 2 decimals)
4. Observe: Which words have higher weights? Why?

**Expected Output:**
```
TF-IDF Matrix:
[[0.   0.   0.52 0.68 0.52 0.  ]
 [0.55 0.55 0.42 0.   0.42 0.  ]
 [0.   0.   0.   0.61 0.   0.79]]

Note: 'programming' has high weight in doc 3 (unique to it)
      'love' has lower weight (appears in docs 1 and 3)
```

In [None]:
# Your code here


## 6.4 Sentiment Analysis Dataset

Run the cell below to create a simple sentiment analysis dataset.

In [None]:
# Sentiment Analysis Dataset
reviews = [
    "This movie was amazing and wonderful",
    "I loved this film, it was great",
    "Excellent movie, highly recommended",
    "Best film I have ever seen",
    "Wonderful story and great acting",
    "This movie was terrible and boring",
    "I hated this film, it was awful",
    "Worst movie ever, do not watch",
    "Boring and disappointing film",
    "Terrible acting and bad story",
    "The movie was okay, nothing special",
    "It was an average film",
    "Not bad but not great either",
    "Mediocre movie with some good moments",
    "Fantastic cinematography and brilliant performances",
    "Absolutely dreadful, waste of time",
]

# Labels: 1 = positive, 0 = negative
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0]

# Split into train and test
X_train_text, X_test_text, y_train_nlp, y_test_nlp = train_test_split(
    reviews, labels, test_size=0.25, random_state=42
)

print(f"Training samples: {len(X_train_text)}")
print(f"Test samples: {len(X_test_text)}")
print(f"\nSample reviews:")
for i in range(3):
    sentiment = "Positive" if y_train_nlp[i] == 1 else "Negative"
    print(f"  [{sentiment}] {X_train_text[i]}")

## 6.5 Text Classification with Naive Bayes

Naive Bayes is a probabilistic classifier that works well with text data.

```python
from sklearn.naive_bayes import MultinomialNB

# Vectorize text
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train_text)
X_test_vec = vectorizer.transform(X_test_text)

# Train classifier
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)

# Predict
predictions = clf.predict(X_test_vec)
```

**Your Task:**
1. Create a `TfidfVectorizer` and vectorize the training and test text
2. Create a `MultinomialNB` classifier and train it
3. Make predictions on the test set
4. Calculate and print the accuracy
5. Print the classification report

**Expected Output:**
```
Accuracy: ~0.75 (75%)

Classification Report:
              precision    recall  f1-score   support
           0       X.XX      X.XX      X.XX         X
           1       X.XX      X.XX      X.XX         X
```

**Sample Code:**
```python
# Complete pipeline example
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])
pipeline.fit(X_train_text, y_train_nlp)
accuracy = pipeline.score(X_test_text, y_test_nlp)
print(f"Accuracy: {accuracy:.2f}")
```

In [None]:
# Your code here


## 6.6 Predict on New Text

Use your trained model to classify new reviews.

**Your Task:**
1. Create a list of 3 new reviews (make up your own!)
2. Vectorize them using the same vectorizer (use `.transform()`, not `.fit_transform()`)
3. Use your classifier to predict the sentiment
4. Print each review with its predicted sentiment

**Expected Output:**
```
Review: "This was the best experience ever!"
Predicted: Positive

Review: "Horrible waste of my time"
Predicted: Negative

Review: "It was pretty good overall"
Predicted: Positive
```

**Sample Code:**
```python
new_reviews = ["Your review here", "Another review"]
new_vectors = vectorizer.transform(new_reviews)
predictions = clf.predict(new_vectors)

for review, pred in zip(new_reviews, predictions):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"Review: {review}")
    print(f"Predicted: {sentiment}\n")
```

In [None]:
# Your code here


---
# Lab Complete!

## Summary

You learned:
- **PyTorch Tensors**: Create, manipulate, and use autograd
- **Linear Regression**: nn.Module, training loop, MSE loss
- **Logistic Regression**: Sigmoid, BCE loss, classification
- **SVMs**: Different kernels for linear/non-linear data
- **Evaluation**: Confusion matrix, precision, recall, F1-score
- **NLP**: Text preprocessing, Bag of Words, TF-IDF, Naive Bayes classification