# Next Word Prediction Using LSTM
#### This notebook demonstrates how to build and train an LSTM model for next word prediction using Shakespeare's "Hamlet" as the dataset.

## Table of Contents
1. Data Collection
2. Data Preprocessing
3. Creating Input Sequences
4. Preparing Predictors and Labels
5. Building the LSTM Model
6. Training the Model
7. Evaluating the Model
8. Saving the Model and Tokenizer
9. Testing the Model
10. Conclusion

## 1. Data Collection <a id="data-collection"></a>
- We will use the NLTK Gutenberg corpus to obtain the text of Shakespeare's "Hamlet".

In [1]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

# Load the dataset
data = gutenberg.raw('shakespeare-hamlet.txt')

# Save the data to a text file
with open('hamlet.txt', 'w') as file:
    file.write(data)

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/vineethsai/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## 2. Data Preprocessing <a id="data-preprocessing"></a>
- We need to preprocess the text data to make it suitable for training our model.

In [2]:
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the dataset
with open('hamlet.txt', 'r') as file:
    text = file.read().lower()


- Tokenization: We tokenize the text to create indices for each word.

In [3]:
# Initialize the tokenizer and fit on the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

# Total number of words in the vocabulary
total_words = len(tokenizer.word_index) + 1
print(f"Total words: {total_words}")


Total words: 4818


## 3. Creating Input Sequences <a id="creating-input-sequences"></a>
- We create input sequences to train the model to predict the next word in a sequence.

In [4]:
# Create input sequences using the tokens
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

print(f"Total input sequences: {len(input_sequences)}")


Total input sequences: 25732


- Padding Sequences: Pad the sequences to ensure uniform length.

In [5]:
# Determine the maximum sequence length
max_sequence_len = max([len(x) for x in input_sequences])
print(f"Maximum sequence length: {max_sequence_len}")

# Pad sequences
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))


Maximum sequence length: 14


## 4. Preparing Predictors and Labels <a id="preparing-predictors-and-labels"></a>
- We split the data into predictors (X) and labels (y).

In [6]:
from sklearn.model_selection import train_test_split
import tensorflow as tf

# Split into predictors and label
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# One-hot encode the labels
y = tf.keras.utils.to_categorical(y, num_classes=total_words)


- Training and Testing Split: Divide the data into training and testing sets.

In [7]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## 5. Building the LSTM Model <a id="building-the-lstm-model"></a>
- We define an LSTM model to predict the next word in a sequence.

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the model architecture
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len - 1))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()




## 6. Training the Model <a id="training-the-model"></a>
- We train the model on the prepared data.

In [9]:
# Train the model
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), verbose=1)


Epoch 1/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 19ms/step - accuracy: 0.0324 - loss: 7.1518 - val_accuracy: 0.0334 - val_loss: 6.7421
Epoch 2/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 18ms/step - accuracy: 0.0358 - loss: 6.4584 - val_accuracy: 0.0418 - val_loss: 6.8131
Epoch 3/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 19ms/step - accuracy: 0.0451 - loss: 6.3135 - val_accuracy: 0.0521 - val_loss: 6.8497
Epoch 4/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 19ms/step - accuracy: 0.0501 - loss: 6.1602 - val_accuracy: 0.0519 - val_loss: 6.8836
Epoch 5/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 18ms/step - accuracy: 0.0526 - loss: 6.0379 - val_accuracy: 0.0540 - val_loss: 6.9345
Epoch 6/50
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 19ms/step - accuracy: 0.0638 - loss: 5.8746 - val_accuracy: 0.0604 - val_loss: 6.9660
Epoch 7/50
[1m6

## 7. Evaluating the Model <a id="evaluating-the-model"></a>
- We define a function to predict the next word given an input sequence.

In [10]:
# Function to predict the next word
def predict_next_word(model, tokenizer, text, max_sequence_len):
    token_list = tokenizer.texts_to_sequences([text])[0]
    if len(token_list) >= max_sequence_len:
        token_list = token_list[-(max_sequence_len - 1):]  # Ensure the sequence length matches max_sequence_len - 1
    token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=1)
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    return None


## 8. Saving the Model and Tokenizer <a id="saving-the-model-and-tokenizer"></a>
- We save the trained model and tokenizer for future use.

In [11]:
# Save the model
model.save('next_word_lstm.h5')

# Save the tokenizer
import pickle
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)




## 9. Testing the Model <a id="testing-the-model"></a>
- We test the model with some input sequences.

In [12]:
# First test input
input_text = "To be or not to be"
print(f"Input text: {input_text}")
max_sequence_len = model.input_shape[1] + 1
next_word = predict_next_word(model, tokenizer, input_text, max_sequence_len)
print(f"Next word prediction: {next_word}")


Input text: To be or not to be
Next word prediction: our


In [13]:
# Second test input
input_text = "Barn. Last night of all, When yond same"
print(f"\nInput text: {input_text}")
next_word = predict_next_word(model, tokenizer, input_text, max_sequence_len)
print(f"Next word prediction: {next_word}")



Input text: Barn. Last night of all, When yond same
Next word prediction: queen


## 10. Conclusion <a id="conclusion"></a>
- In this notebook, we successfully built and trained an LSTM model capable of predicting the next word in a sequence based on Shakespeare's "Hamlet". The model can be further improved by adjusting hyperparameters, expanding the dataset, or experimenting with different model architectures.