# Movie Review Classification (IMDB) — `num_words=200`
This notebook is a revised version of `3.5-classifying-movie-reviews.ipynb` that:
- limits the vocabulary to the **200 most frequent words** (`num_words=200`),
- trains a simple dense network,
- plots training & validation accuracy vs epochs,
- selects the best epoch by validation accuracy, retrains on train+val, and reports **test accuracy**.

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers


In [None]:
# 1) Load IMDB with vocabulary capped at 200 most frequent words
NUM_WORDS = 200

# The IMDB dataset comes tokenized as integer sequences of word indices (by frequency).
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=NUM_WORDS)

# Optional: peek at one example length
len(x_train[0]), y_train[0]


In [None]:
# 2) Vectorize sequences via multi-hot encoding into dimension NUM_WORDS
def vectorize(seqs, dimension=NUM_WORDS):
    X = np.zeros((len(seqs), dimension), dtype="float32")
    for i, seq in enumerate(seqs):
        for idx in seq:
            if 0 <= idx < dimension:
                X[i, idx] = 1.0
    return X

X_train = vectorize(x_train, NUM_WORDS)
X_test  = vectorize(x_test,  NUM_WORDS)

X_train.shape, X_test.shape


In [None]:
# 3) Train/Validation split (hold-out validation; we'll later retrain on train+val using the best epoch)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=10000, shuffle=False)

# 4) Build the model (simple baseline; small capacity due to tiny vocab)
def build_model():
    model = keras.Sequential([
        layers.Input(shape=(NUM_WORDS,)),
        layers.Dense(32, activation='relu'),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=keras.optimizers.Adam(1e-3),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

model = build_model()
history = model.fit(
    X_tr, y_tr,
    epochs=30,
    batch_size=512,
    validation_data=(X_val, y_val),
    verbose=1
)

# 5) Plot training & validation accuracy
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.figure(figsize=(8,4))
plt.plot(epochs, acc, label='Training acc')
plt.plot(epochs, val_acc, label='Validation acc')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training vs Validation Accuracy (num_words=200)')
plt.legend()
plt.tight_layout()
plt.show()

# Determine best epoch by validation accuracy
best_epoch = int(np.argmax(val_acc)) + 1
print(f"Best epoch by validation accuracy: {best_epoch}")

In [None]:
# 6) Retrain a fresh model on (train + val) for the best number of epochs, then evaluate on test
model2 = build_model()
X_full = np.concatenate([X_tr, X_val], axis=0)
y_full = np.concatenate([y_tr, y_val], axis=0)

model2.fit(X_full, y_full, epochs=best_epoch, batch_size=512, verbose=1)
test_loss, test_acc = model2.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy with best-epoch ({best_epoch}) model: {test_acc:.4f}")

### What does `x_train[0]` represent?
`x_train[0]` is the **first training review** represented as a list of **integer word indices**, where each integer corresponds to a word ranked by frequency in the training corpus. Because we set `num_words=200`, only indices in `[0, 199]` are kept; rarer words are filtered out. The indices are not the original words—just their frequency-based IDs (with several low integers reserved for special tokens in the Keras IMDB loader). The raw sequence length varies by review.