# Next-Word Prediction with LSTM

This notebook demonstrates a simple next-word prediction pipeline using an LSTM-based language model (TensorFlow / Keras). The project uses a cleaned text file (`metamorphosis_clean.txt`) as the training corpus and shows the full flow: tokenization, sequence creation, padding, model definition, training, and inference (generating the next words).

Sections in this notebook:
- Imports and setup
- Data loading and preprocessing
- Building the LSTM model
- Training and evaluation
- Simple inference loop to generate next words

Notes: replace `metamorphosis_clean.txt` with your own plain-text corpus if you want to train on a different dataset. For reproducible runs, pin TensorFlow to a compatible version (see README).

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

## Data loading and preprocessing

We load the full text file into a single string, then use Keras' Tokenizer to build a word index. The notebook splits the text into lines and builds incremental n-gram sequences from each line (e.g. `[w1, w2] -> w3`, `[w1, w2, w3] -> w4`, ...). Sequences are padded to the same length so they can be batched.

Key preprocessing steps:
- Tokenize the text to integers (word -> index).
- Build input sequences of increasing prefix lengths for next-word prediction.
- Pad sequences to a fixed length with `pad_sequences` (pre-padding).
- Split sequences into `X` (prefixes) and `y` (next word), then one-hot encode `y` with `to_categorical`.

In [None]:
with open("metamorphosis_clean.txt", "r") as f:
    input_text = f.read()
print(input_text)


In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([input_text])
tokenizer.word_index

In [None]:
len(tokenizer.word_index)

In [None]:
sequences = []
for sen in input_text.split('\n'):
    tokenized_sentence = tokenizer.texts_to_sequences([sen])[0]
    for i in range(1,len(tokenized_sentence)):
        sequences.append(tokenized_sentence[:i+1])
# print(sequences)


In [None]:
max_len = max([len(x) for x in sequences])
sequences = pad_sequences(sequences, maxlen=max_len,padding='pre')

In [None]:
X = sequences[:,:-1]
y = sequences[:,-1]
y = to_categorical(y,num_classes=len(tokenizer.word_index)+1)

In [None]:
X.shape

In [None]:
y.shape

## Model architecture and hyperparameters

A straightforward LSTM language model is defined using Keras' Sequential API. The notebook uses an Embedding layer to learn dense word vectors, followed by a single LSTM layer and a Dense softmax output over the vocabulary.

Things to consider and tune:
- Embedding size (currently 100).
- LSTM units (currently 200).
- Vocabulary size (derived from tokenizer; the notebook hardcodes an example value in the model definition — replace with `len(tokenizer.word_index)+1` for general runs).
- Input length (max sequence length) — used by the Embedding layer.

## Training and notes

The model is compiled with categorical crossentropy and trained for a modest number of epochs (example uses 50). For larger corpora or improved performance, consider adding callbacks (ModelCheckpoint, EarlyStopping), using a validation split, and experimenting with learning rates and optimizers.

Training tips:
- Save the model weights after the best validation accuracy.
- Use batch sizes appropriate for your GPU/CPU memory.
- For faster training, reduce the vocabulary (filter rare words) or use subword tokenization.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense

In [None]:
model = Sequential()
model.add(Embedding(2618,100, input_length=17))
model.add(LSTM(200))
model.add(Dense(2618,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
history = model.fit(
    X, y,
    epochs=50,
    batch_size=64,
    validation_split=0.2
)


## Inference / Generating text

A small example loop below demonstrates how to seed the model with a short phrase and iteratively predict the next word. Notes:
- The notebook uses argmax on the softmax output which picks the single most likely word; sampling from the distribution (temperature sampling) can produce more varied, creative outputs.
- Make sure to preprocess the seed text the same way as training (tokenization and padding).
- Save and reload both the trained model and the tokenizer for reproducible inference outside the notebook.

In [None]:
text = "random setting of the house"
for i in range(5):
    token_text = tokenizer.texts_to_sequences([text])
    print(token_text)
    padded_seq = pad_sequences(token_text, maxlen=max_len,padding='pre')
    pos = np.argmax(model.predict([padded_seq]))
    for word,index in tokenizer.word_index.items():
        if index==pos:
            text = text + ' ' + word
            print(text)


In [None]:
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.show()

## Conclusion

The training loss stays low while validation loss rises. In simple terms: the model learns the training examples well but does worse on unseen data (overfitting).

Here is how we can fix this:
* Shuffle and split the data properly (keep a separate test set).
* Use EarlyStopping and save the best model (ModelCheckpoint).
* Regularize or reduce model size, and check example predictions by hand.
* Gather **more training data** for better generalization.
* Try **hyperparameter tuning** (embedding size, LSTM units, learning rate, dropout).
* Explore **advanced architectures** such as stacked/bidirectional LSTMs or Transformer-based models.

