# Next Word Prediction Model using Python

Next word prediction means predicting the most likely word or phrase that will come next in a sentence or text. It is like having an inbuilt feature on an application that suggests the next word as you type or speak. The Next Word Prediction Models are used in applications like messaging apps, search engines, virtual assistants, and autocorrect features on smartphones.

Next Word Prediction is a language modelling task in Machine Learning that aims to predict the most probable word or sequence that follows a given input context. This task utilizes statistical patterns and linguistic structures to generate accurate predictions based on the context provided.


Steps to build a Next Word Prediction Model:

1. Start by collecting a diverse dataset of text documents.
2. Preprocess the data by cleaning and tokenizing it.
3. Prepare the data by creating input-output pairs.
4. Engineer features such as word embeddings.
5. Select an appropriate model like an LSTM or GPT.
6. Train the model on the dataset while adjusting hyperparameters
7. Improve the model by experimenting with different techniques and architectures.

Dataset source - https://sherlock-holm.es/ascii/

In [5]:
# Importing necessary libraries

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer # Converts words into numbers, which the model can understand
from tensorflow.keras.preprocessing.sequence import pad_sequences # Ensures all sentences have the same length by adding extra padding if needed
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Reading the text file
with open('book.txt','r',encoding = 'utf-8') as file:
  text = file.read()

In [6]:
# Tokenizing the text to create a sequence of words

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

This method analyzes the text and builds a vocabulary of unique words, assigning each word a numerical index.

In [7]:
total_words

8200

In [8]:
# Creating input-output pairs by splitting the text into sequences of tokens forming n-grams (or subsequence) from the sequences
# So that the model can learn how words in a sentence are related and predict the next word.

input_sequences = []
for line in text.split('\n'):
  token_list = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(token_list)):
    n_gram_sequence = token_list[:i+1]
    input_sequences.append(n_gram_sequence)


# Example, line - "I love coding"

# [
#   [1, 2],      # "I love"
#   [1, 2, 3]    # "I love coding"
# ]

The n-gram sequence represents the input context, with the last token being the target or predicted word. These input-output sequences will be used for training the next word prediction model.

In [9]:
# Padding(adding zeros) the input sequences to have equal length

max_sequence_len = max([len(seq) for seq in input_sequences]) # To determine the target length for padding
input_sequences = np.array(pad_sequences(input_sequences, maxlen = max_sequence_len, padding = 'pre'))
# adds zeros to the beginning of each sequence so that they all match the maxlen ensuring that the most recent words (important for
# prediction) appear at the end of the sequence.

In [10]:
max_sequence_len

18

In [11]:
input_sequences

array([[   0,    0,    0, ...,    0,    1, 1561],
       [   0,    0,    0, ...,    1, 1561,    5],
       [   0,    0,    0, ..., 1561,    5,  129],
       ...,
       [   0,    0,    0, ...,    1, 8198, 8199],
       [   0,    0,    0, ..., 8198, 8199, 3187],
       [   0,    0,    0, ..., 8199, 3187, 3186]])

In [12]:
# Splitting the sequences into input and output
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

In [13]:
# Converting the output to one-hot encode vectors
y = np.array(tf.keras.utils.to_categorical(y, num_classes = total_words))

In [14]:
total_words

8200

In [15]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [16]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len - 1))
# Embedding layer converts each word into a vector of size 100
model.add(LSTM(150))  # 150 -> no. of memory units
model.add(Dense(total_words, activation='softmax'))
# Softmax ensures the outputs are probabilities (all add up to 1)
model.build(input_shape=(None, max_sequence_len - 1))
# Tells the model what shape to expect as input
model.summary()



In [18]:
# Compiling and training the model

model.compile(loss = 'categorical_crossentropy',optimizer = 'adam', metrics = ['accuracy'])
# categorical_crossentropy -> Measures the difference between the predicted
# probabilities and the actual target (the correct next word).
# adam -> Adaptive Moment Estimation, adjusts learning rates automatically
model.fit(X, y, epochs = 100, verbose = 1)

Epoch 1/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 18ms/step - accuracy: 0.0614 - loss: 6.5567
Epoch 2/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 18ms/step - accuracy: 0.1180 - loss: 5.5524
Epoch 3/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 18ms/step - accuracy: 0.1432 - loss: 5.1339
Epoch 4/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 17ms/step - accuracy: 0.1660 - loss: 4.7728
Epoch 5/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 18ms/step - accuracy: 0.1864 - loss: 4.4402
Epoch 6/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 17ms/step - accuracy: 0.2074 - loss: 4.1463
Epoch 7/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 17ms/step - accuracy: 0.2335 - loss: 3.8686
Epoch 8/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 18ms/step - accuracy: 0.2693 - loss: 3.5964


<keras.src.callbacks.history.History at 0x1cf61302ed0>

In [46]:
seed_text = "I will leave if they"
next_words = 4

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')
    predicted = np.argmax(model.predict(token_list), axis = -1)
    output_word = ""
    for word,index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
I will leave if they are not too late
