<a href="https://colab.research.google.com/github/tproffen/ORCSGirlsPython/blob/master/LLMs/NextWordPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

From <a href="https://www.geeksforgeeks.org/next-word-prediction-with-deep-learning-in-nlp/">this article</a>.

## Importing Modules

Here we simply import all the modules we need for our code. Make sure you run this cell.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import regex as re
import requests

## Tokenizing the text

Remember the example, we need to run words into numbers and organize the text so we have a part of the text as input and the next word as the label or correct answer. This routine does that and we will use it later to explore how the text is turned into numbers.

In [None]:
def tokenize_text(text):

  # Splitting the text into sentences using delimiters like '.', '?', and '!'
  sentences = [sentence.strip() for sentence in re.split(r'(?<=[.!?])\s+', text) if sentence.strip()]

  # Tokenize the text data (turning into a number for each different word)
  tokenizer.fit_on_texts(sentences)
  total_words = len(tokenizer.word_index) + 1

  # Create input sequences
  input_sequences = []
  for line in sentences:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
      n_gram_sequence = token_list[:i+1]
      input_sequences.append(n_gram_sequence)

  # Pad sequences and split into predictors and label
  # Because of the math, all the number lists need to have the same length, so we are adding 0's to make them the same length
  max_sequence_len = max([len(seq) for seq in input_sequences])
  input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

  # X - contains a list of sentences which contain a list of tokens (or words)
  # y - is the corresponding list of predicted next words
  # This is used for training
  X, y = input_sequences[:, :-1], input_sequences[:, -1]

  # Convert target data to one-hot encoding
  y = tf.keras.utils.to_categorical(y, num_classes=total_words)

  return (X,y,max_sequence_len,total_words)

## Tokenizing

With our routine, we can now see how the tokenizer works.

In [None]:
tokenizer = Tokenizer()

In [None]:
# This is our text - you can change it if you like!
text = "Math is super cool. So am I!"

# Let's tokenize :)
(X,y,max_sequence_len,total_words) = tokenize_text(text)

Let's see what we have. Feel free to modify the code below to print different sentences or values.

In [None]:
# Just printing X and y
print (X, y)

In [None]:
# Hmmm, not so use ful. Which word is which?
for i in range(1,len(tokenizer.index_word)+1):
  print (i,tokenizer.index_word[i])

In [None]:
# Loop over the sequences and print them with tokens and words
for s in range(len(X)):
  out = ''
  for word in X[s]:
      if word > 0:
        out+=tokenizer.index_word[word]+' '
  print(f"{out} -- \033[31m {tokenizer.index_word[y[s].argmax()]}\033[30m")

In [None]:
# Space for your exploration code ..

## Training on the pizza text

Next we read and tokenize the <a href="https://raw.githubusercontent.com/tproffen/ORCSGirlsPython/refs/heads/master/LLMs/pizza.txt">pizza input text</a> and train the LLM. <b>Note this will take some time.</b>

In [None]:
tokenizer = Tokenizer()

In [None]:
url = "https://raw.githubusercontent.com/tproffen/ORCSGirlsPython/refs/heads/master/LLMs/pizza.txt"
response = requests.get(url)
text = response.text

(X,y,max_sequence_len,total_words) = tokenize_text(text)

In [None]:
# Define the model
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
model.add(LSTM(128))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(X, y, epochs=50, verbose=1)

## Using the trained model

Now we can use the model and predict the next words based on the pizza text we useed to train :)

In [None]:
# Generate next word predictions - feel free to change these
seed_text = "The best pizza is "
next_words = 20

for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted_probs = model.predict(token_list)
	predicted_word = tokenizer.index_word[np.argmax(predicted_probs)]
	seed_text += " " + predicted_word

print("Next predicted words:", seed_text)