In [1]:
# Import libraries  #RNN
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb

# Parameters
vocab_size = 10000   # Only consider the top 10,000 words
maxlen = 200         # Cut texts after 200 words
embedding_dim = 64

# Load IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# Pad sequences (to make all sequences the same length)
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

# Build the RNN model
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    LSTM(128, return_sequences=False),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Summary
model.summary()

# Train model
history = model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=64,
    validation_split=0.2
)

# Evaluate on test data
loss, acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {acc:.3f}")

# Example prediction
sample = x_test[0].reshape(1, -1)
prediction = model.predict(sample)[0][0]
print("Predicted Sentiment:", "Positive" if prediction > 0.5 else "Negative")



AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 64)           640000    
                                                                 
 lstm (LSTM)                 (None, 128)               98816     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 738945 (2.82 MB)
Trainable params: 738945 (2.82 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accura

Detailed explanation, line by line
Imports

import tensorflow as tf
Imports TensorFlow (the high-level ML library). We alias it tf because that’s standard. This gives access to Keras (tf.keras) and low-level TensorFlow APIs.

from tensorflow.keras.models import Sequential
Imports the Sequential model class from Keras. Sequential is a container for stacking layers in order (a linear stack).

from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
Imports layer types used in the model:

Embedding: maps integer word indices → dense vectors (learned embeddings).

LSTM: Long Short-Term Memory recurrent layer (captures sequence patterns).

Dense: fully connected (classification) layer.

Dropout: regularization layer that randomly zeroes activations to reduce overfitting.

from tensorflow.keras.preprocessing.sequence import pad_sequences
Utility to pad (or truncate) lists of token IDs so every input sequence has the same length (required for batching).

from tensorflow.keras.datasets import imdb
Imports Keras’s built-in IMDB reviews dataset (preprocessed as integer token indices).

Hyperparameters / configuration

vocab_size = 10000 # Only consider the top 10,000 words
Use only the top 10k most frequent words in the dataset. Words outside this set are replaced with an "out-of-vocab" token. Limits vocabulary size and memory.

maxlen = 200 # Cut texts after 200 words
All sequences will be exactly 200 tokens long after padding/truncation. Longer reviews are truncated; shorter ones are padded.

embedding_dim = 64
Each word will be represented by a 64-dimensional embedding vector. This controls embedding layer output size and thus number of parameters.

Load the dataset

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
Loads IMDB data already converted to integer sequences.

x_train and x_test are lists/arrays of integer lists: each review is a list of word indices (integers from 1..vocab_size-1; 0 is usually reserved for padding).

y_train and y_test are binary labels: 0 for negative, 1 for positive.

num_words=vocab_size keeps only top vocab_size words; rarer words are replaced by a reserved index.

Shape note: x_train is a Python list of length ~25,000 (for IMDB). Each element is variable length before padding.

Pad sequences

x_train = pad_sequences(x_train, maxlen=maxlen)
Pads or truncates every sequence in x_train so each becomes length maxlen. By default pad_sequences pads with zeros at the beginning (padding='pre') and truncates longer sequences from the start (truncating='pre'). You can change to padding='post' if you prefer trailing zeros.

x_test = pad_sequences(x_test, maxlen=maxlen)
Same as above for the test set.

Shape after padding: x_train.shape == (num_examples, maxlen); with IMDB typically (25000, 200).

Build the model
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    LSTM(128, return_sequences=False),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])


Sequential([...])
Create a sequential model and pass the ordered layers.

Embedding(vocab_size, embedding_dim, input_length=maxlen)

Input: integer sequences shaped (batch_size, maxlen) where each integer is in [0, vocab_size-1].

Output: a 3D tensor shaped (batch_size, maxlen, embedding_dim) — each token replaced by a learned embedding_dim vector.

vocab_size is the input dimension (size of the vocabulary + reserved indices). The embedding matrix size will be (vocab_size, embedding_dim) and is learned during training.

LSTM(128, return_sequences=False)

An LSTM layer with 128 hidden units (the dimensionality of the LSTM output/state).

return_sequences=False (default) means the layer returns only the final hidden state for the whole sequence (shape (batch_size, 128)), not a sequence of outputs at each timestep.

If you set return_sequences=True, the LSTM would return an output for each timestep (shape (batch_size, maxlen, 128)), useful for stacked RNNs or sequence-to-sequence tasks.

Dropout(0.5)

During training, randomly sets 50% of inputs to zero to reduce overfitting. Applied to the LSTM output before the Dense layer.

Note: this is standard Dropout applied to layer activations. LSTMs also support recurrent_dropout inside the LSTM for recurrent connections.

Dense(1, activation='sigmoid')

Final classification layer with a single output neuron.

sigmoid squashes outputs to (0,1), giving probability of the positive class. For binary classification, this and binary_crossentropy loss is standard.

Compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


loss='binary_crossentropy'

Appropriate loss for binary classification with a single sigmoid output. It measures how close predicted probabilities are to true labels.

optimizer='adam'

Adam optimizer — adaptive learning rate method that works well out of the box.

metrics=['accuracy']

Track accuracy during training and evaluation.

Model summary

model.summary()
Prints a table with each layer, output shapes, and number of trainable parameters. Useful to verify dimension flow and parameter counts.

Train the model
history = model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=64,
    validation_split=0.2
)


x_train, y_train — training data.

epochs=5 — run through the entire training set 5 times. Increase for better fit; watch for overfitting.

batch_size=64 — number of samples per gradient update. Smaller batches give noisier but sometimes better generalization; larger batches are faster on GPUs but need more memory.

validation_split=0.2 — hold out 20% of the training data for validation (Keras will split the last 20% of x_train and y_train). The model evaluates loss/accuracy on this split after each epoch, which helps monitor overfitting.

history stores training/validation loss & metrics per epoch (useful to plot learning curves).

Notes:

If your dataset is already pre-shuffled, validation_split is fine. For some datasets you'd want to use a dedicated validation_data=(x_val, y_val) instead of splitting.

Training on CPU may be slow — GPU accelerates RNN training significantly.

Evaluate on test data

loss, acc = model.evaluate(x_test, y_test)
Runs the trained model on the held-out test set and returns loss and the metrics defined during compile (accuracy here). Gives an unbiased estimate of final performance.

print(f"Test Accuracy: {acc:.3f}")
Prints test accuracy to 3 decimal places.

Single sample prediction
sample = x_test[0].reshape(1, -1)
prediction = model.predict(sample)[0][0]
print("Predicted Sentiment:", "Positive" if prediction > 0.5 else "Negative")


x_test[0] selects the first padded test review — shape (maxlen,).

.reshape(1, -1) changes it to (1, maxlen) to form a batch of size 1, because model.predict expects a batch dimension.

model.predict(sample) returns an array of shape (1, 1) (batch_size, output_dim). [0][0] extracts the scalar probability.

The if prediction > 0.5 threshold converts probability into a binary class (0.5 is the common default). You can change threshold depending on precision/recall preferences.

Extra tips, common changes, and gotchas

Padding direction: By default pad_sequences uses padding='pre'. For many text tasks padding='post' is convenient because the start of sequence aligns across examples; use pad_sequences(..., padding='post', truncating='post') if you prefer.

Embedding index 0 = padding: Keras’s pad_sequences uses 0 for padding. If your dataset uses index 0 for a real token, adjust accordingly. With imdb.load_data() index 0 is reserved.

Return_sequences True: If you want to stack another RNN after this LSTM (e.g., a second LSTM), set return_sequences=True on the first LSTM.

Bidirectional LSTM: For better performance on many text tasks, wrap the LSTM with tf.keras.layers.Bidirectional(...), e.g.
Bidirectional(LSTM(128)) — doubles effective context (forward + backward).

Regularization: You can add recurrent_dropout=0.2 to the LSTM to regularize recurrent connections or add L2 penalties.

Changing model capacity: Increase/decrease LSTM(128) units or embedding_dim to change model capacity. More units = more expressive but more prone to overfitting.

Training length: More epochs often help but monitor validation loss — if validation loss rises while training loss falls, you’re overfitting; use early stopping:
tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True).

Performance: LSTMs are slower than simpler RNNs and slower than Transformer architectures for long sequences. For production/performance consider GRU (faster) or Transformer-based models for large tasks.

If you want, I can:

Annotate the code inline with comments instead of separate explanations.

Convert this to a GRU or SimpleRNN.

Add preprocessing (like mapping raw text → tokens) so you can use your own dataset (not just IMDB).

Add callbacks (ModelCheckpoint, EarlyStopping) and plotting of history.

In [None]:
“First, we import TensorFlow — that’s the main deep learning library we’ll use.
Inside TensorFlow, we use Keras — the high-level API — to easily build our model.

Sequential means we’ll stack layers one after another.

Embedding converts words into numerical vectors.

LSTM is our main RNN layer that can remember long-term patterns.

Dense and Dropout handle final classification and regularization.

Finally, we import imdb, a built-in dataset of movie reviews.”

“Here, Keras automatically downloads the IMDb dataset.
It comes preprocessed — every review is already converted into a list of numbers.
For example, the word ‘amazing’ might be number 52, ‘movie’ might be 37, and so on.”

🧠 (Visual idea: Show a sentence like “The movie was amazing” → [37, 24, 15, 52].)

“Each review is a different length — some are 50 words, some are 300.
But neural networks need fixed-length inputs.
So pad_sequences adds zeros at the beginning of shorter reviews,
and cuts off extra words if they’re too long — all reviews become 200 words long.”

🟦 (Visual idea: Rows of sequences with 0-padding on the left to make all equal length.)

Now we build the neural network.

The Embedding layer takes those numbers (word IDs) and turns them into word vectors — 64 numbers per word, learned automatically during training.

The LSTM layer has 128 memory cells. It reads the sequence one word at a time and tries to understand the meaning of the whole sentence.

The Dropout layer randomly turns off half the neurons while training — that prevents overfitting.

Finally, the Dense layer outputs a single number between 0 and 1, using a sigmoid activation — 0 means negative review, 1 means positive.”

🎥 (Visual idea: Show arrows connecting “Embedding → LSTM → Dropout → Dense → output”.)


“Now we tell the model how to learn.

We use binary_crossentropy since there are only two classes: positive or negative.

adam optimizer automatically adjusts learning speed.

And we track accuracy to see how well it performs.”

This prints a summary of each layer — the shapes, and how many trainable parameters there are.
It’s a good sanity check before training.”

“Let’s train the model!
We’ll train for 5 rounds — called epochs.
Each round, the model sees all 25,000 reviews and tries to minimize the loss.
The data is divided into batches of 64 samples each.

We also use 20% of the data for validation — that helps us check if the model is overfitting.”

📈 (Visual idea: Loss curve and accuracy curve rising during epochs.)

    “After training, we test the model on completely unseen reviews.
The test accuracy tells us how well it generalizes — ideally above 80%.”

    “Finally, we test one example manually.
We reshape it into a batch of 1, and the model predicts a number between 0 and 1.
If it’s greater than 0.5, it’s positive — otherwise, negative.
Simple and powerful!”

🏁 Wrap-Up

🎙️ Voiceover:

“So that’s how an LSTM-based RNN reads movie reviews and predicts their sentiment!

Remember the flow:
Text → Tokenization → Padding → Embedding → LSTM → Dense Output.

You can use this same idea for any kind of sequence data — like chat messages, stock prices, or even music notes.”

    “Let’s start by importing the libraries.
TensorFlow gives us access to Keras — the easy-to-use high-level API for deep learning.

Sequential helps us stack layers in order.

Embedding converts words to numerical vectors.

LSTM is our main RNN layer that captures sequence patterns.

Dense and Dropout are for the output and regularization.

And the imdb dataset contains 25,000 movie reviews ready for sentiment analysis.”


    “Before building the model, we define a few hyperparameters.

vocab_size = 10,000: We’ll only use the 10,000 most common words in English reviews.

maxlen = 200: Each review will be trimmed or padded to exactly 200 words.

embedding_dim = 64: Every word will be represented by a 64-dimensional learned vector.”

    “Keras makes loading the IMDb dataset extremely simple.
Each review is already converted into a list of integers — where each number represents a specific word.

Here, we print out how many reviews we have and inspect one example.
Notice that the words aren’t plain English yet — just numbers that point to words in the vocabulary.”

        “Each review has a different length, but neural networks require fixed-size inputs.

We use pad_sequences to make all reviews 200 words long.
Shorter reviews are padded with zeros at the beginning, longer ones are truncated.

The shape printed here confirms each review now has exactly 200 tokens.”
        “Now, let’s build our model!

We start with an Embedding layer — this layer learns a dense vector for every word.

Next is the LSTM layer with 128 memory cells.
LSTM stands for Long Short-Term Memory.
It reads the sequence word by word and remembers important information from earlier words.

return_sequences=False means we only want the final output — perfect for classification.

After that, a Dropout layer randomly turns off 50% of neurons during training to prevent overfitting.

Finally, a Dense layer with a sigmoid activation gives a probability between 0 and 1 — predicting whether the review is positive or negative.

When we call model.summary(), we can see each layer’s shape and the number of trainable parameters.”
    Before training, we compile the model.

binary_crossentropy is the loss function for two-class problems.

The adam optimizer adapts the learning rate automatically.

And we’ll track accuracy as our metric.”
    “Now we’re ready to train!
We’ll run for five epochs, meaning the model will see the entire training set five times.

Each batch contains 64 samples, and we hold out 20% of the data for validation.

As it trains, you’ll see loss and accuracy improving with each epoch.

Usually, by the 4th or 5th epoch, accuracy reaches around 85% on validation data — which is quite good for this small network.”

Visual cue: Simulate training output lines appearing, highlight accuracy increasing.

    “Once training finishes, we test the model on completely unseen data — the 25,000 test reviews.

This gives us a realistic measure of how well our model generalizes.
A test accuracy above 80% means it’s doing a solid job understanding review sentiment.”
    “Let’s try predicting one review manually.

We take the first example from our test set, reshape it into a batch of one, and feed it to the model.

The model outputs a number between 0 and 1 — if it’s above 0.5, we call it positive; otherwise negative.

And that’s how our neural network interprets a piece of text!”

[10:15 – 11:30] Concept Recap

Visual: Slide/markdown cell summarizing the flow:

Text → Tokenization → Padding → Embedding → LSTM → Dense → Sentiment


Voiceover:

“Let’s quickly review the workflow:

First, we convert text into tokens.

Then we pad to equal length.

The Embedding layer learns relationships between words.

The LSTM processes them sequentially to find meaning.

Finally, the Dense layer makes a binary prediction.

That’s the full pipeline of text sentiment classification using RNNs.”

[11:30 – 12:00] Closing and Next Steps

Visual: Outro markdown cell:

✅ You learned:
- How to build an RNN (LSTM) with Keras
- How to train on IMDb movie reviews
- How to test and predict sentiment
Next step: try GRU or bidirectional LSTM for better accuracy!


Voiceover:

“And that wraps up our tutorial!
You’ve learned how an RNN processes sequential text data, how embeddings work, and how to train your own sentiment classifier.

Next, you can experiment with GRUs or Bidirectional LSTMs for even better results.

Thanks for watching — and happy coding!”

Visual: Fade-out with “Subscribe for more Deep Learning Tutorials”.

🎧 Optional: Voice-Over Production Settings

To create the actual video:

Tool suggestion: HeyGen
 or Pika Labs
.

Voice style: Female or male, calm “teacher” tone.

Script speed: ~120–130 words per minute → 12-minute total length.

Background: Plain Jupyter Notebook screen capture; use subtle zooms on code blocks and highlight important lines with a yellow rectangle overlay.

Add light background music (low volume instrumental) for a polished effect.

In [3]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, GRU, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
vocab_size = 10000
maxlen = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
simple_rnn_model = Sequential([
    Embedding(vocab_size, 64, input_length=maxlen),
    SimpleRNN(64),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

simple_rnn_model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

history_rnn = simple_rnn_model.fit(
    x_train, y_train,
    epochs=3,
    batch_size=64,
    validation_split=0.2
)
loss, acc = simple_rnn_model.evaluate(x_test, y_test)
print(f"Simple RNN Test Accuracy: {acc:.3f}")
print("Simple RNN:", round(history_rnn.history['val_accuracy'][-1], 3))
# Example prediction
sample1= x_test[0].reshape(1, -1)
prediction = simple_rnn_model.predict(sample1)[0][0]
print("Predicted Sentiment:", "Positive" if prediction > 0.5 else "Negative")




Epoch 1/3
Epoch 2/3
Epoch 3/3
Simple RNN Test Accuracy: 0.835
Simple RNN: 0.847
Predicted Sentiment: Negative


In [4]:
imdb

<module 'keras.api._v2.keras.datasets.imdb' from 'C:\\Users\\Welcome\\anaconda3\\envs\\mycorrectenv\\lib\\site-packages\\keras\\api\\_v2\\keras\\datasets\\imdb\\__init__.py'>

In [5]:
#to see the dataset

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


In [7]:
import os
os.environ['KERAS_HOME'] = 'D:/keras_datasets'


In [8]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 10000
maxlen = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

print("Train:", len(x_train), "Test:", len(x_test))
print("Example tokens:", x_train[0][:20], "Label:", y_train[0])

word_index = imdb.get_word_index()
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"], word_index["<START>"], word_index["<UNK>"], word_index["<UNUSED>"] = 0, 1, 2, 3
reverse_word_index = {v:k for k,v in word_index.items()}

def decode_review(encoded):
    return ' '.join([reverse_word_index.get(i, '?') for i in encoded])

for i in range(3):
    print("\n" + "="*60)
    print("Review", i+1)
    print(decode_review(x_train[i][:100]))
    print("Sentiment:", "Positive" if y_train[i]==1 else "Negative")

x_train = pad_sequences(x_train, maxlen=maxlen)
x_test  = pad_sequences(x_test,  maxlen=maxlen)
print("Padded shapes:", x_train.shape, x_test.shape)


Train: 25000 Test: 25000
Example tokens: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25] Label: 1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
   8192/1641221 [..............................] - ETA: 0s

OSError: [Errno 28] No space left on device

Excellent — you caught a very common but important issue 💡

The error:

OSError: [Errno 28] No space left on device


means that your disk or the temporary folder TensorFlow/Keras uses to download datasets is full — not that your code is wrong.

Let’s break this down clearly 👇

🧩 1️⃣ Why This Happens

When you call:

word_index = imdb.get_word_index()


Keras tries to download a small JSON file:

https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


and store it in your local cache folder, usually:

~/.keras/datasets/


or
on Windows:

C:\Users\<YourName>\.keras\datasets\


If that folder (or your disk) is full, Python cannot write the file → “No space left on device”.

🧩 2️⃣ How to Fix It (Windows / Anaconda users)
✅ Step 1 — Check your Keras cache location

Open a terminal or Anaconda Prompt and run:

echo %USERPROFILE%\.keras\datasets


Then open that folder in File Explorer.

You’ll likely see files like:

imdb.npz
imdb_word_index.json
mnist.npz


If the folder is too big or corrupted, delete old files:

imdb.npz
imdb_word_index.json


Then rerun your notebook — Keras will automatically re-download clean copies.

✅ Step 2 — Clean temporary folders

Run this in Command Prompt (close Jupyter first):

del /q/f/s %TEMP%\*


This removes temporary cache files that often eat space.

✅ Step 3 — (Optional) Change Keras dataset storage location

If your C: drive is low on space, you can make Keras use another folder, e.g. D:\keras_datasets.

Before loading the dataset, run:

import os
os.environ['KERAS_HOME'] = 'D:/keras_datasets'


Then your datasets will be downloaded there instead of C:\Users\<name>\.keras.

✅ Step 4 — Verify space

Make sure you have at least 500MB free on the drive where Anaconda and .keras are located.
You can check by right-clicking your drive → Properties.

🧩 3️⃣ After Cleaning

Once you free space, rerun these lines:

from tensorflow.keras.datasets import imdb
word_index = imdb.get_word_index()


✅ You should see it download quickly:

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


and then continue normally.

🧠 Quick Recap
Issue	Meaning	Fix
OSError: [Errno 28] No space left on device	Disk or cache folder full	Delete old Keras datasets / free space
Where data is stored	C:\Users\<name>\.keras\datasets	Clean this folder
Avoid future issues	Set KERAS_HOME to another drive	os.environ['KERAS_HOME'] = 'D:/keras_datasets'

Would you like me to give you a one-line cleanup script you can safely run in Anaconda to automatically delete all old Keras datasets and free up space (without affecting your environment)?

Line-by-line explanation — IMDb decoding & preview (complete, detailed)

Below I’ll explain every line (and small groups of lines) from your IMDb preview + decoding + padding section. I’ll say what each line does, what the types/shapes are, why it’s there, and common pitfalls or alternatives. Read it as if I’m walking you through the notebook while pointing at each line.

🧾 Section 1 — Load and Preview IMDb Dataset
# Import libraries
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences


from tensorflow.keras.datasets import imdb
Imports the built-in IMDb dataset loader from Keras. This gives you the function imdb.load_data() which returns tokenized review data (lists of integers) and labels.

from tensorflow.keras.preprocessing.sequence import pad_sequences
Imports a utility that will convert lists of variable-length token sequences into fixed-length arrays by padding or truncating. You use this before feeding data to a neural network.

# Set parameters
vocab_size = 10000   # only keep top 10,000 words
maxlen = 200         # cut or pad all reviews to 200 words


vocab_size = 10000
We'll only keep the top 10,000 most frequent words from the dataset. Any word with an index ≥ 10000 will be treated as "out of vocabulary" (mapped to the UNK or unknown token when loading with num_words).

maxlen = 200
All reviews will be forced to length 200 tokens: shorter reviews are padded (with zeros), longer ones are truncated. This produces arrays of shape (num_samples, 200) required by Keras models.

# Load the IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)


imdb.load_data(num_words=vocab_size)
Downloads (if not cached) and loads the IMDb dataset. Returns two tuples: training and test sets.

x_train — a Python list of len = 25000 where each element is a list of integers (word indices).

y_train — a list/array of 25000 binary labels (0 or 1).

Same for x_test, y_test (also 25000 samples).

num_words=vocab_size tells Keras to only keep tokens with index < vocab_size. Anything beyond that will be excluded or replaced by reserved indices.

Type / shape before padding:

type(x_train) == list, len(x_train) == 25000

type(x_train[0]) == list, len(x_train[0]) varies (e.g., 50–500)

print("Training samples:", len(x_train))
print("Testing samples:", len(x_test))
print("Example tokenized review:", x_train[0][:20])
print("Label:", y_train[0])


len(x_train) / len(x_test) — print number of samples (should be 25000 each).

x_train[0][:20] — print first 20 tokens (word indices) of the first review to show it’s numeric, not plain text.

y_train[0] — print the label (0 or 1) for that first review.

Why: quick sanity check: data loaded, labels exist, and tokens are integer indices.

🔁 Section 2 — Decode Reviews into Readable Text

We now convert integer tokens back into words so you can read the review.

# Get the word index mapping
word_index = imdb.get_word_index()


imdb.get_word_index() returns a dictionary mapping words → integer index.
Example: {'the': 1, 'and': 2, ...} (indices are Keras’s internal mapping, but note we’ll shift them in the next step).

Type: word_index is a dict with str keys and int values.

# Adjust indices (Keras reserves first 3)
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3


word_index = {k: (v + 3) for k, v in word_index.items()}
Keras reserves indices 0–3 for special tokens. The original word_index maps words to indices starting at 1 — we shift them by +3 so we can insert special tokens at indices 0..3.

word_index["<PAD>"] = 0
Reserve index 0 for padding. This makes pad_sequences use 0 as the padding value.

word_index["<START>"] = 1
Index 1 marks the start of a sequence (some datasets insert a START token).

word_index["<UNK>"] = 2
Index 2 used for unknown words (words excluded by num_words).

word_index["<UNUSED>"] = 3
Reserved unused token (Keras historically included it).

Why shift indices: Because imdb.load_data() uses indices 0.. for special reserved tokens; we must align our reverse map with those reserved IDs.

Pitfall: If you skip the +3 shift you’ll decode wrong words (off-by-index errors).

# Reverse the mapping to get words back
reverse_word_index = {value: key for (key, value) in word_index.items()}


Creates reverse_word_index where keys are indices → values are words.
Example: reverse_word_index[1] == "<START>", reverse_word_index[4] == "the", etc.

Type: dict mapping int → str.

# Function to decode reviews
def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i, '?') for i in encoded_review])


def decode_review(encoded_review): defines a helper function.

' '.join([...]) builds a single string from a list of tokens.

reverse_word_index.get(i, '?') — for each token index i, return the corresponding word; if i not found, return '?'.

Why: safer than direct indexing — handles unexpected indices.

# Display a few decoded reviews
for i in range(3):
    print("\n" + "="*80)
    print(f"Review {i+1}:")
    print(decode_review(x_train[i][:100]))
    print("Sentiment:", "Positive" if y_train[i] == 1 else "Negative")


for i in range(3): — loop to print first 3 reviews.

x_train[i][:100] — decode only the first 100 tokens to avoid giant prints.

decode_review(...) — prints readable text approximating the original review.

"Positive" if y_train[i] == 1 else "Negative" — map label 1 → Positive, 0 → Negative.

Output: A readable snippet of reviews with labels.

Note: The decoded text will include tokens like <START> or <UNK> for special/unknown tokens.

🧩 Section 3 — Pad Sequences for Model Input
# Pad all sequences to fixed length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

print("Padded training shape:", x_train.shape)
print("Padded test shape:", x_test.shape)


pad_sequences(x_train, maxlen=maxlen)
Converts the list of variable-length sequences into a 2D NumPy array with shape (num_samples, maxlen):

If a sequence is shorter than maxlen, it is padded with zeros (by default at the start; padding='post' changes this to the end).

If longer, it is truncated (by default at the start; truncating='post' cuts the end instead).

After these lines:

x_train.shape typically (25000, 200)

x_test.shape typically (25000, 200)

Why padding: Neural networks process fixed-size tensors in batches. RNNs expect sequences to be same length per batch.

Alternatives / options:

pad_sequences(..., padding='post') — pads at the end (often desirable for RNNs so the start of sequence aligns at index 0).

truncating='post' — prefer to keep the beginning of long reviews instead of the end.

Use dtype='int32' if you want to ensure dtype.

✅ Final notebook flow (what happens next)

You will typically follow the preview/padding steps with:

Train SimpleRNN — Embedding → SimpleRNN → Dense

Good to illustrate conceptually, but often weaker performance.

Train LSTM — Embedding → LSTM → Dense

Handles long-term dependencies using gates (input, forget, output).

Train GRU — Embedding → GRU → Dense

Simpler than LSTM with comparable performance and fewer parameters.

Compare validation/test accuracy to see differences.

⚠️ Common pitfalls & tips

Index shift errors: If you don’t shift word_index by +3 (or you use a different num_words), your reverse_word_index will be misaligned — decoded text will be gibberish.

Padding direction matters: Default pad_sequences uses padding='pre'. For interpretability and some RNN behaviors you may prefer padding='post'.

Token 0 is padding: After padding, 0 means “no word”. Ensure your Embedding layer does not learn meaningful embeddings for padding (Keras handles this well, but you can set mask_zero=True in Embedding to mask padded values for RNN layers that support masking).

Memory & speed: Loading full dataset and training can be slow on CPU. Use GPU if available. Use fewer epochs for quick experiments (3–5), then increase.

Unseen words: num_words=vocab_size removes rare words; when decoding you’ll see <UNK> for unknown tokens.