### Step 1️⃣ — Setup & Import

#### What you’ll do:

- Import libraries.
- Download corpus if needed.
- Inspect the data.

#### Why:
You must understand the raw data to design how to process it.

In [None]:
import nltk
import numpy as np
from nltk.corpus import movie_reviews
from collections import defaultdict
import random

# Download corpus if not done yet
nltk.download('movie_reviews')

# Inspect data
print("Total files:", len(movie_reviews.fileids()))
print("Sample file IDs:", movie_reviews.fileids()[:5])
print("Categories:", movie_reviews.categories())


### Step 2️⃣ — Split Data

#### What you’ll do:

- Make training & testing sets.
- Shuffle to avoid order bias.

#### Why:
Good ML practice: always separate training & validation.

In [None]:
# Get pos & neg file IDs
pos_files = movie_reviews.fileids('pos')
neg_files = movie_reviews.fileids('neg')

# Shuffle for randomness
random.shuffle(pos_files)
random.shuffle(neg_files)

# Split: 80% train, 20% test
train_pos = pos_files[:800]
test_pos = pos_files[800:]

train_neg = neg_files[:800]
test_neg = neg_files[800:]

print(f"Training samples: {len(train_pos) + len(train_neg)}")
print(f"Testing samples: {len(test_pos) + len(test_neg)}")


### Step 3️⃣ — Write a Text Preprocessor

#### What you’ll do:

- Tokenize words.
- Lowercase.
- Remove punctuation.
- Remove stopwords.

#### Why:
Cleaner data = better features.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('punkt')
nltk.download('stopwords')

def process_review(words):
    stop_words = set(stopwords.words('english'))
    # Keep only alphabetic words, lowercase, no stopwords
    cleaned = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
    return cleaned

# Test
sample_words = movie_reviews.words(train_pos[0])
print(process_review(sample_words)[:10])


### Step 4️⃣ — Build Word Frequency Dictionary

#### What you’ll do:

- For all training data, count (word, label) pairs.

#### Why:
You’ll use these counts for your features.

In [None]:
freqs = defaultdict(int)

# Process positive reviews
for fileid in train_pos:
    words = process_review(movie_reviews.words(fileid))
    for word in words:
        freqs[(word, 1.0)] += 1

# Process negative reviews
for fileid in train_neg:
    words = process_review(movie_reviews.words(fileid))
    for word in words:
        freqs[(word, 0.0)] += 1

print(list(freqs.items())[:5])


### Step 5️⃣ — Create extract_features

#### What you’ll do:

Given a review, count:

- How many words are positive.
- How many words are negative.
- Always add a bias term.

#### Why:
Transforms raw text → numeric feature vector [1, pos_count, neg_count].

In [None]:
def extract_features(words, freqs):
    x = np.zeros(3)  # bias + pos + neg
    x[0] = 1  # bias

    for word in words:
        x[1] += freqs.get((word, 1.0), 0)
        x[2] += freqs.get((word, 0.0), 0)
    return x


### Step 6️⃣ — Create Training Matrix

#### What you’ll do:

- Loop through all training reviews.
- For each, get its [1, pos_count, neg_count] feature vector.

#### Why:
- This is your X.
- Labels are your y.

In [None]:
# All train files + labels
train_files = train_pos + train_neg
train_labels = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))

# Matrix X
X_train = np.zeros((len(train_files), 3))
y_train = train_labels.reshape(-1, 1)

for i, fileid in enumerate(train_files):
    words = process_review(movie_reviews.words(fileid))
    X_train[i, :] = extract_features(words, freqs)


### Step 7️⃣ — Write Logistic Regression Functions

- 1️⃣ Sigmoid
- 2️⃣ Cost
- 3️⃣ Gradient Descent

#### Why:
Core math for logistic regression.

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def gradient_descent(X, y, theta, alpha, num_iters):
    m = X.shape[0]

    for i in range(num_iters):
        z = np.dot(X, theta)
        h = sigmoid(z)
        J = -(1/m) * (np.dot(y.T, np.log(h)) + np.dot((1-y).T, np.log(1 - h)))
        gradient = (1/m) * np.dot(X.T, (h - y))
        theta -= alpha * gradient

        if i % 100 == 0:
            print(f"Iter {i}: Cost {float(J)}")

    return theta


### Step 8️⃣ — Train!

In [None]:
theta = np.zeros((3, 1))
theta = gradient_descent(X_train, y_train, theta, alpha=1e-7, num_iters=1500)

print("Final theta:", theta)


### Step 9️⃣ — Evaluate Accuracy

In [None]:
# Prepare test data
test_files = test_pos + test_neg
test_labels = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
X_test = np.zeros((len(test_files), 3))

for i, fileid in enumerate(test_files):
    words = process_review(movie_reviews.words(fileid))
    X_test[i, :] = extract_features(words, freqs)

# Predict
z = np.dot(X_test, theta)
preds = sigmoid(z)
pred_labels = preds >= 0.5

accuracy = np.mean(pred_labels.flatten() == test_labels)
print(f"Test accuracy: {accuracy:.4f}")
