---

## **Section 1: Setup and Understanding the Data**

### üéØ Objective
Import necessary libraries and load the IMDB movie review dataset.

### üìù Your Tasks

1. **Import libraries:**
   - `torch` and `torch.nn`
   - `numpy`
   - `pandas` (for loading CSV data)
   - `matplotlib.pyplot`
   - `sklearn.model_selection` (train_test_split)
   - `sklearn.metrics` (accuracy_score, classification_report, confusion_matrix)
   - `re` (for text cleaning)

2. **Check device availability** (GPU or CPU)

3. **Load the IMDB movie review dataset**:
   - Load data from `IMDB Dataset.csv`
   - The CSV has two columns: `review` (text) and `sentiment` (positive/negative)
   - Convert sentiment labels to numeric: 0=negative, 1=positive
   - Explore the dataset structure

### üí° Hints
- Use `torch.cuda.is_available()` to check for GPU
- Use `pandas.read_csv()` to load the CSV file
- Labels: 0 = negative sentiment, 1 = positive sentiment
- The IMDB dataset contains 50,000 real movie reviews from IMDB

### ü§î Think About It
What makes a review positive vs negative? The words used, right? That's why embeddings are perfect - they capture word meaning!

In [None]:
# TODO: Import all necessary libraries
# Import torch, numpy, pandas, matplotlib, sklearn utilities, and re
# Your code here

In [None]:
# TODO: Check device availability
# Your code here

In [None]:
# TODO: Load IMDB movie review dataset
# Load data from 'IMDB Dataset.csv'
# Convert sentiment labels: 'positive' -> 1, 'negative' -> 0
# Your code here

# Example structure:
# df = pd.read_csv('IMDB Dataset.csv')
# reviews = df['review'].tolist()
# labels = df['sentiment'].map({'positive': 1, 'negative': 0}).tolist()

In [None]:
# TODO: Print dataset statistics
# Print total number of reviews, number of positive, number of negative
# Print a few example reviews from the dataset
# Your code here

---

## **Section 2: Text Preprocessing**

### üéØ Objective
Clean and prepare the text data for our model.

### üìù Your Tasks

1. **Create a preprocessing function** that:
   - Converts text to lowercase
   - Removes punctuation (except keeping spaces)
   - Removes extra whitespace
   - Splits into words (tokens)
   - Returns a list of clean tokens

2. **Process all reviews**:
   - Apply your preprocessing function to each review
   - Store the tokenized reviews in a list

3. **Build vocabulary**:
   - Collect all unique words from all reviews
   - Create `word_to_idx` and `idx_to_word` dictionaries
   - Calculate vocabulary size

### üí° Hints
- Use `text.lower()` for lowercase
- Use `re.sub(r'[^a-z\s]', '', text)` to remove punctuation
- Use `text.split()` to tokenize
- Get unique words: `set()` all tokens from all reviews
- Sort vocabulary for consistency: `sorted(vocab_set)`

### ü§î Think About It
Why preprocess? Because "Movie", "movie", and "movie!" should all be treated as the same word!

In [None]:
# TODO: Create preprocessing function
# Your code here

# def preprocess_text(text):
#     """
#     Clean and tokenize text
#     Args: text (string)
#     Returns: list of tokens
#     """
#     # Your implementation here

In [None]:
# TODO: Process all reviews
# Apply preprocessing to each review and store in a list
# Your code here

In [None]:
# TODO: Build vocabulary
# Create word_to_idx and idx_to_word dictionaries
# Your code here

In [None]:
# TODO: Print vocabulary statistics
# Print vocabulary size, show sample words and their indices
# Your code here

---

## **Section 3: Loading Pre-trained GloVe Embeddings**

### üéØ Objective
Load pre-trained word embeddings that were trained on billions of words!

### üìù About GloVe Embeddings

**GloVe (Global Vectors for Word Representation)** is similar to Word2Vec but trained differently:
- Trained on billions of tokens from Wikipedia + Gigaword corpus
- Contains hundreds of thousands of words
- Each word is represented by 50 numbers (50-dimensional vectors)
- File format: Each line is: `word number1 number2 ... number50`

**Why use pre-trained embeddings?**
- ‚úÖ Trained on massive datasets
- ‚úÖ Captures rich semantic relationships
- ‚úÖ Works well even with limited training data
- ‚úÖ Saves training time

### üìù Your Tasks

1. **Load GloVe embeddings from file**:
   - File location: `wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt`
   - Read line by line
   - Each line: first item is word, rest are 50 numbers
   - Store in a dictionary: `{word: numpy_array_of_50_numbers}`

2. **Create embedding matrix for YOUR vocabulary**:
   - For each word in your vocabulary:
     - If word exists in GloVe: use GloVe vector
     - If word NOT in GloVe: use random vector
   - Create a matrix of shape: `(vocab_size, 50)`
   - Row i = embedding for word with index i

3. **Print statistics**:
   - How many GloVe vectors loaded?
   - How many of your words have GloVe embeddings?
   - How many words need random initialization?

### üí° Hints
```python
# Loading GloVe
glove_embeddings = {}
with open('wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        glove_embeddings[word] = vector
```

- Embedding dimension: 50
- Random initialization: `np.random.randn(50) * 0.01`
- Convert to float32 for PyTorch compatibility

### ü§î Think About It
GloVe knows that "excellent" and "fantastic" are similar because it was trained on billions of words. These pre-trained embeddings give us powerful semantic representations!

In [None]:
# TODO: Load GloVe embeddings from file
# Your code here

In [None]:
# TODO: Create embedding matrix for your vocabulary
# For each word in word_to_idx, get its GloVe embedding or create random vector
# Your code here

In [None]:
# TODO: Print embedding statistics
# How many words have GloVe embeddings vs random initialization?
# Your code here

---

## **Section 4: Preparing Data for Training**

### üéØ Objective
Convert reviews to fixed-length sequences and split into train/test sets.

### üìù Your Tasks

1. **Convert reviews to index sequences**:
   - For each tokenized review, convert words to indices using `word_to_idx`
   - Handle different length reviews by:
     - **Padding**: If review is shorter than max_length, add 0s at the end
     - **Truncating**: If review is longer than max_length, cut it off
   - Choose a reasonable `max_length` (e.g., 200 or 300 for IMDB reviews)

2. **Create PyTorch tensors**:
   - Convert padded sequences to torch.LongTensor
   - Convert labels to torch.LongTensor

3. **Split into train and test sets**:
   - Use 80% for training, 20% for testing
   - Use `train_test_split` from sklearn
   - Set `random_state=42` for reproducibility

4. **Print dataset information**:
   - Shape of train/test data
   - Number of train/test samples
   - Show example of padded sequence

### üí° Hints
```python
# Padding example
def pad_sequence(sequence, max_length):
    if len(sequence) < max_length:
        # Add zeros at the end
        sequence = sequence + [0] * (max_length - len(sequence))
    else:
        # Cut off at max_length
        sequence = sequence[:max_length]
    return sequence
```

- Use 0 as padding index
- `train_test_split(X, y, test_size=0.2, random_state=42)`
- IMDB reviews tend to be longer, so consider max_length of 200-300

### ü§î Think About It
Why fixed length? Neural networks need consistent input shapes. Padding lets us keep all reviews without losing information!

In [None]:
# TODO: Define max_length and create padding function
# Your code here

In [None]:
# TODO: Convert all reviews to padded index sequences
# Your code here

In [None]:
# TODO: Convert to PyTorch tensors
# Your code here

In [None]:
# TODO: Split into train and test sets
# Your code here

In [None]:
# TODO: Print dataset information
# Print shapes, sizes, show example
# Your code here

---

## **Section 5: Building the Sentiment Classifier**

### üéØ Objective
Create a neural network that uses embeddings to classify sentiment.

### üìù Architecture Overview

Our model will have these layers:

```
Input: Review as indices [batch_size, max_length]
   ‚Üì
Embedding Layer: Lookup word vectors [batch_size, max_length, 50]
   ‚Üì
Average Pooling: Take mean across words [batch_size, 50]
   ‚Üì
Fully Connected Layer 1: [batch_size, 128]
   ‚Üì
ReLU Activation
   ‚Üì
Dropout (0.3)
   ‚Üì
Fully Connected Layer 2: [batch_size, 2]
   ‚Üì
Output: Logits for [negative, positive]
```

### üìù Your Tasks

Create a class `SentimentClassifier` that inherits from `nn.Module`:

1. **`__init__` method** - Initialize layers:
   - `self.embedding`: Use `nn.Embedding.from_pretrained()`
     - Pass your embedding matrix (convert to torch tensor)
     - Set `freeze=False` to allow fine-tuning
     - Set `padding_idx=0`
   - `self.fc1`: Linear layer (50 ‚Üí 128)
   - `self.fc2`: Linear layer (128 ‚Üí 2)
   - `self.dropout`: Dropout(0.3)
   - `self.relu`: ReLU activation

2. **`forward` method**:
   - Input: `x` of shape `(batch_size, max_length)`
   - Steps:
     1. Get embeddings: `embedded = self.embedding(x)`  # (batch, max_length, 50)
     2. Average pool: `pooled = torch.mean(embedded, dim=1)`  # (batch, 50)
     3. First layer: `x = self.relu(self.fc1(pooled))`
     4. Dropout: `x = self.dropout(x)`
     5. Output layer: `out = self.fc2(x)`  # (batch, 2)
   - Return: `out`

3. **Instantiate the model**:
   - Create model and move to device
   - Print model architecture
   - Count parameters

### üí° Hints
- Convert numpy array to tensor: `torch.FloatTensor(embedding_matrix)`
- `nn.Embedding.from_pretrained(embeddings, freeze=False, padding_idx=0)`
- Average pooling across dimension 1: `torch.mean(x, dim=1)`
- Model to device: `model.to(device)`

### ü§î Key Concepts

**Why average pooling?**
- We have embeddings for each word: `[word1_vec, word2_vec, ..., wordN_vec]`
- We need ONE vector for the entire review
- Average pooling: Take the mean of all word vectors
- Result: A single 50-dimensional vector representing the review

**Why freeze=False?**
- Starts with GloVe embeddings (good general knowledge)
- Fine-tunes them for sentiment analysis specifically
- Best of both worlds!

In [None]:
# TODO: Create SentimentClassifier class
# Your code here

In [None]:
# TODO: Instantiate the model and print architecture
# Your code here

---

## **Section 6: Training Setup**

### üéØ Objective
Set up loss function, optimizer, and evaluation metrics.

### üìù Your Tasks

1. **Define loss function**:
   - Use `nn.CrossEntropyLoss()`
   - This is perfect for binary classification

2. **Define optimizer**:
   - Use `torch.optim.Adam(model.parameters(), lr=0.001)`
   - Learning rate: 0.001 is a good starting point

3. **Print training configuration**:
   - Loss function
   - Optimizer
   - Learning rate
   - Number of epochs you plan to train
   - Device being used

### üí° Hints
- `criterion = nn.CrossEntropyLoss()`
- `optimizer = torch.optim.Adam(...)`
- Recommended epochs: 3-5 (since IMDB dataset is large, training takes longer)

In [None]:
# TODO: Define loss function and optimizer
# Your code here

In [None]:
# TODO: Print training configuration
# Your code here

---

## **Section 7: The Training Loop**

### üéØ Objective
Train your sentiment classifier and track performance.

### üìù Your Tasks

Create a training function with this structure:

1. **Function signature**: `train_model(model, X_train, y_train, X_test, y_test, criterion, optimizer, num_epochs)`

2. **Training loop**:
   ```
   for each epoch:
       # Training phase
       model.train()
       1. Move data to device
       2. Zero gradients
       3. Forward pass
       4. Calculate loss
       5. Backward pass
       6. Update weights
       7. Calculate accuracy
       
       # Validation phase
       model.eval()
       with torch.no_grad():
           1. Forward pass on test data
           2. Calculate test loss
           3. Calculate test accuracy
   ```

3. **Track metrics**:
   - Store train loss, train accuracy per epoch
   - Store test loss, test accuracy per epoch
   - Print progress every epoch

4. **Return**:
   - Dictionary with loss and accuracy histories

5. **Train the model**:
   - Call your training function
   - Use 3-5 epochs (IMDB is large, so fewer epochs needed)

### üí° Hints
```python
# Calculate accuracy
_, predicted = torch.max(outputs, 1)
correct = (predicted == labels).sum().item()
accuracy = correct / labels.size(0)
```

- Use `model.train()` before training
- Use `model.eval()` before testing
- Use `torch.no_grad()` during validation
- Move data: `X.to(device)`, `y.to(device)`

### ü§î What to Expect
- Training accuracy should increase over epochs
- Test accuracy should also increase
- With the full IMDB dataset, you can expect 85-90% accuracy or higher
- Each epoch will take some time due to dataset size

In [None]:
# TODO: Create training function
# Your code here

In [None]:
# TODO: Train the model
# Call your training function with 3-5 epochs
# Your code here

---

## **Section 8: Visualizing Training Progress**

### üéØ Objective
Plot training curves to understand model performance.

### üìù Your Tasks

1. **Plot Loss Curves**:
   - Create a figure with training loss and test loss
   - X-axis: Epochs
   - Y-axis: Loss
   - Use different colors for train vs test
   - Add legend, labels, title, grid

2. **Plot Accuracy Curves**:
   - Create a figure with training accuracy and test accuracy
   - X-axis: Epochs
   - Y-axis: Accuracy
   - Use different colors for train vs test
   - Add legend, labels, title, grid

3. **Print final metrics**:
   - Final training accuracy
   - Final test accuracy
   - Best test accuracy achieved

### üí° Hints
```python
plt.figure(figsize=(12, 4))

# Loss plot
plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses, label='Train Loss', marker='o')
plt.plot(epochs, test_losses, label='Test Loss', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Accuracy plot
plt.subplot(1, 2, 2)
# ... similar for accuracy

plt.tight_layout()
plt.show()
```

### ü§î What to Look For
- **Loss should decrease** over time
- **Accuracy should increase** over time
- **Gap between train and test**: Small gap = good, large gap = overfitting

In [None]:
# TODO: Plot training curves
# Your code here

In [None]:
# TODO: Print final metrics
# Your code here

---

## **Section 9: Detailed Evaluation**

### üéØ Objective
Evaluate model performance with detailed metrics.

### üìù Your Tasks

1. **Get predictions on test set**:
   - Set model to eval mode
   - Get predictions for all test samples
   - Convert predictions and labels to numpy arrays

2. **Print classification report**:
   - Use `classification_report` from sklearn
   - Shows precision, recall, F1-score for each class
   - Target names: ['Negative', 'Positive']

3. **Create confusion matrix**:
   - Use `confusion_matrix` from sklearn
   - Visualize as a heatmap
   - Show how many predictions were correct/incorrect

4. **Analyze errors**:
   - Find misclassified reviews
   - Print a few examples
   - Try to understand why the model made mistakes

### üí° Hints
```python
# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(X_test.to(device))
    _, predicted = torch.max(outputs, 1)
    y_pred = predicted.cpu().numpy()
    y_true = y_test.cpu().numpy()

# Classification report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
```

### ü§î Understanding Metrics
- **Precision**: Of all predicted positive, how many were actually positive?
- **Recall**: Of all actual positives, how many did we find?
- **F1-Score**: Harmonic mean of precision and recall
- **Confusion Matrix**: Shows where the model confuses classes

In [None]:
# TODO: Get predictions on test set
# Your code here

In [None]:
# TODO: Print classification report
# Your code here

In [None]:
# TODO: Create and visualize confusion matrix
# Your code here

In [None]:
# TODO: Analyze errors - find and print misclassified reviews
# Your code here

---

## **Section 10: Testing on New Reviews**

### üéØ Objective
Create a function to predict sentiment for any new movie review.

### üìù Your Tasks

1. **Create prediction function**:
   - Function signature: `predict_sentiment(review_text, model, word_to_idx, max_length)`
   - Steps:
     1. Preprocess the review text
     2. Convert to indices
     3. Pad/truncate to max_length
     4. Convert to tensor
     5. Get model prediction
     6. Return predicted class and probability

2. **Test on new reviews**:
   - Create 5-10 new movie reviews (not in training data)
   - Mix of positive and negative
   - Use your prediction function
   - Print review, prediction, and confidence

3. **Interactive testing** (optional):
   - Allow user to input their own review
   - Predict and display sentiment

### üí° Hints
```python
def predict_sentiment(review_text, model, word_to_idx, max_length):
    model.eval()
    
    # Preprocess
    tokens = preprocess_text(review_text)
    
    # Convert to indices
    indices = [word_to_idx.get(token, 0) for token in tokens]  # 0 for unknown words
    
    # Pad
    indices = pad_sequence(indices, max_length)
    
    # To tensor
    x = torch.LongTensor([indices]).to(device)
    
    # Predict
    with torch.no_grad():
        output = model(x)
        probs = torch.softmax(output, dim=1)
        predicted = torch.argmax(probs, dim=1).item()
        confidence = probs[0][predicted].item()
    
    sentiment = "Positive üòä" if predicted == 1 else "Negative üòû"
    return sentiment, confidence
```

### ü§î Think About It
Does your model work well on new reviews? Can it handle different writing styles? What kinds of reviews does it struggle with?

In [None]:
# TODO: Create prediction function
# Your code here

In [None]:
# TODO: Test on new reviews
# Create list of new reviews and test them
# Your code here

In [None]:
# TODO: Interactive testing (optional)
# Allow user to input their own review
# Your code here

In [None]:
# TODO: Find similar words for sentiment keywords
# Your code here

In [None]:
# TODO: Visualize embeddings in 2D
# Your code here

---

## **Section 12: Saving Your Model**

### üéØ Objective
Save the trained model for future use.

### üìù Your Tasks

1. **Create checkpoint dictionary**:
   - Model state dict
   - Optimizer state dict
   - Vocabulary mappings
   - Max length
   - Training history
   - Any other important information

2. **Save using torch.save()**:
   - Filename: 'sentiment_classifier.pth'

3. **Create a function to load the model**:
   - Load checkpoint
   - Recreate model
   - Load weights
   - Return ready-to-use model

### üí° Hints
```python
# Save
checkpoint = {
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'word_to_idx': word_to_idx,
    'idx_to_word': idx_to_word,
    'max_length': max_length,
    'vocab_size': vocab_size,
    'embedding_dim': 50,
    'history': history
}
torch.save(checkpoint, 'sentiment_classifier.pth')

# Load
checkpoint = torch.load('sentiment_classifier.pth')
model = SentimentClassifier(embedding_matrix)
model.load_state_dict(checkpoint['model_state_dict'])
```

In [None]:
# TODO: Save the model
# Your code here

In [None]:
# TODO: Create function to load the model
# Your code here