
# Next-Word Prediction with LSTM (Keras/TensorFlow)

This notebook trains a small **LSTM language model** to predict the **next word** given a text prefix.
It is self-contained and runs on CPU or GPU (e.g., Google Colab).

**What you'll do:**
1. Install TensorFlow (if needed)
2. Prepare a tiny corpus (you can replace it with your own text)
3. Tokenize and create training sequences
4. Build and train an LSTM model
5. Use `generate_next_words()` to predict continuations with adjustable temperature


In [None]:

# If running locally and TensorFlow is not installed, uncomment the next line.
# In Google Colab this typically isn't required, but it's safe to run.
!pip -q install tensorflow


In [None]:

import os, random, sys, math, numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical

# Reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)

print("TensorFlow:", tf.__version__)
print("GPU Available:", tf.config.list_physical_devices('GPU'))


TensorFlow: 2.19.0
GPU Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]



## 1) Prepare a Text Corpus

Replace `corpus_text` with your own dataset for better results. You can paste paragraphs of text or load a file.


In [None]:

# A small public-domain snippet (Lewis Carroll - Alice in Wonderland, short excerpt)
corpus_text = """
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her sister was reading,
but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid),
whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
"""

# (Optional) Load your own text file instead
# with open('/path/to/your/text.txt', 'r', encoding='utf-8') as f:
#     corpus_text = f.read()

corpus_text = corpus_text.lower()
print("Corpus length (chars):", len(corpus_text))
print("\nSample:\n", corpus_text[:400], "...")


Corpus length (chars): 593

Sample:
 
alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her sister was reading,
but it had no pictures or conversations in it, 'and what is the use of a book,' thought alice 'without pictures or conversation?'
so she was considering in her own mind (as well as she could, for the hot day made her feel very  ...



## 2) Tokenize & Create Sequences

We create n-gram sequences where each step predicts the next word.


In [None]:

# Tokenize
tokenizer = Tokenizer(oov_token="<oov>")
tokenizer.fit_on_texts([corpus_text])
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1  # +1 for padding
print("Vocab size:", vocab_size)

# Convert text to token list
tokens = tokenizer.texts_to_sequences([corpus_text])[0]

# Build input-output sequences
# Example: [w1, w2] -> w3 ; [w1, w2, w3] -> w4; etc.
sequences = []
for i in range(2, len(tokens)):
    seq = tokens[:i]
    sequences.append(seq)

max_len = max(len(s) for s in sequences)

# Pad sequences and split into X (inputs) and y (labels = last token)
padded = pad_sequences(sequences, maxlen=max_len, padding='pre')
X, y = padded[:, :-1], padded[:, -1]
y = to_categorical(y, num_classes=vocab_size)

print("Number of sequences:", len(sequences))
print("Max sequence length:", max_len)
X.shape, y.shape


Vocab size: 81
Number of sequences: 113
Max sequence length: 114


((113, 113), (113, 81))


## 3) Build the LSTM Model

A simple Embedding → LSTM → Dense softmax architecture.


In [None]:

embedding_dim = 100
lstm_units = 128
dropout_rate = 0.2

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len-1),
    LSTM(lstm_units, return_sequences=False),
    Dropout(dropout_rate),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()





## 4) Train

Increase `epochs` for better results (and provide a larger corpus).


In [None]:

epochs = 20
batch_size = 64

history = model.fit(X, y, epochs=epochs, batch_size=batch_size, verbose=1)


Epoch 1/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 38ms/step - accuracy: 0.0111 - loss: 4.3930
Epoch 2/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step - accuracy: 0.0784 - loss: 4.3844 
Epoch 3/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.1014 - loss: 4.3765
Epoch 4/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.1014 - loss: 4.3679
Epoch 5/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.0798 - loss: 4.3541
Epoch 6/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.0739 - loss: 4.3334 
Epoch 7/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.0621 - loss: 4.2945
Epoch 8/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.0843 - loss: 4.2337
Epoch 9/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 


## 5) Predict Next Words

Use temperature sampling for creative outputs (higher = more random).


In [None]:

def sample_from_probs(probs, temperature=1.0):
    probs = np.asarray(probs).astype('float64')
    if temperature <= 0:
        # Greedy
        return np.argmax(probs)
    # Temperature scaling
    probs = np.log(probs + 1e-9) / temperature
    probs = np.exp(probs) / np.sum(np.exp(probs))
    return np.random.choice(len(probs), p=probs)

def generate_next_words(seed_text, num_words=5, temperature=0.8):
    text = seed_text.lower()
    for _ in range(num_words):
        seq = tokenizer.texts_to_sequences([text])[0]
        seq = pad_sequences([seq], maxlen=max_len-1, padding='pre')
        preds = model.predict(seq, verbose=0)[0]
        next_id = sample_from_probs(preds, temperature=temperature)
        next_word = None
        # Map id -> word
        for w, idx in word_index.items():
            if idx == next_id:
                next_word = w
                break
        if not next_word or next_word == "<oov>":
            # fallback to greedy if OOV/None
            next_id = int(np.argmax(preds))
            for w, idx in word_index.items():
                if idx == next_id:
                    next_word = w
                    break
        text += " " + next_word
    return text

# Quick test after training
seed = "alice was beginning"
print("Seed:", seed)
print("Greedy  :", generate_next_words(seed, num_words=8, temperature=0.0))
print("Creative:", generate_next_words(seed, num_words=8, temperature=0.9))


Seed: alice was beginning
Greedy  : alice was beginning to to to to to to to to
Creative: alice was beginning to conversation tired ' her her of do



## 6) Tips to Improve
- Use a **much larger corpus** (millions of tokens) for meaningful predictions.
- Increase **epochs** and **model size** (more LSTM units, stacked layers).
- Try **GRU** instead of LSTM for speed.
- For modern state-of-the-art results, consider **Transformers**.
