**Vijay Panchal - 7225949**

This project outlines RNN and how it works in the case to make a model which predicts the next word based on the dataset. Utilizing tensorflow, we are able to effectively preprocess the data, train the model, and test the model.


In [None]:
with open('Data.txt', 'r', encoding='utf-8') as file:
    textData = file.read()

# Data Preprocessing

## Clean Up

Cleans up the whitespace, special characters, and fixes extra whitespaces and periods. It makes sure that each sequence will end on a period.

In [None]:
import re

def cleanText(text):
    # lowercase
    text = text.lower()

    # Keep only letters and whitespace and period
    text = re.sub(r'[^a-z.\s]', '', text)

    # Replace multiple spaces/newlines with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    return text

## Split Into Sentences

Using the delimiter, we split text using period and classify as a sentence. We strip the sentences that are empty to make sure we don't classify them as sentences.

In [None]:
def splitIntoSentences(cleanedText, delimiter='.'):
    sentences = cleanedText.split(delimiter)
    # strip whitespace and remove empties
    sentences = [s.strip() for s in sentences if s.strip() != '']
    return sentences

## Build and Tokenize

Find every unique word and build vocabulary or words. This process allows us to convert all words in the book to certain int ids. Each sentence is made up of a sequence of ints.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

def buildAndTokenize(listOfSentences):
    t = Tokenizer()
    t.fit_on_texts(listOfSentences)
    tokenizedSentences = t.texts_to_sequences(listOfSentences)

    return t, tokenizedSentences

## Pad Sequences

Using max length which is determined before calling this method, we know what size we want to either truncate or padd each of the sequences. In this case, we are using the largest possible sequence length so we do not need to truncate anything.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def padTokenizedSequences(tokenizedSentences, maxLength, padding='pre', truncating='pre'):
    padded = pad_sequences(tokenizedSentences, maxlen=maxLength, padding=padding, truncating=truncating)
    return padded

## Running the Data Preprocessing methods

1. First *Clean Up* the text
2. Then we make a list of sentences by splitting everytime there is a period
3. *Build and Tokenize* each sentence.
4. Determine the maxLength of the sequence so it can be used to padd
5. *Padd* each sentence/sequence.

In [None]:
cleanedText = cleanText(textData)
listOfSentences = cleanedText.split('.')
print("Number of sentences:", len(listOfSentences))

t, tokenizedSentences = buildAndTokenize(listOfSentences)

maxLength = max(len(s) for s in tokenizedSentences)
print(maxLength)
paddedSentences = padTokenizedSequences(tokenizedSentences, maxLength)

Number of sentences: 6434
101


# Building and Training the Model

## Creating The Unique Word Dictionary
Creates a dictionary to map each unique word to an index.

In [None]:
from tensorflow.keras.utils import to_categorical

vocabSize = len(t.word_index) + 1

print(vocabSize)

8599


## Building the Model

1. *Embedding Dimension* allows each word to be represented in 200 different contexts.
2. *LSTM* are hidden units which allows us capture different sequence patterns between words.
3. *Dense* allows us to have a probability distribution which over every possible word in the vocabulary. We use the softmax to classify which word has a higher probability based on the text.
4. Using all these classifications, we build the model.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

embeddingDim = 200

model = Sequential()
model.add(Embedding(input_dim=vocabSize, output_dim=embeddingDim, input_length=(maxLength-1)))
model.add(LSTM(128))
model.add(Dense(vocabSize, activation='softmax'))
model.build(input_shape=(None, maxLength - 1))
model.summary()



#

## Compile The Model
1. *Adaptive Moment Estimation* is the optimizer we used which utilizes the ideas of adaptive learning rates and bias correction. This overall allows the program to progress more efficiently.
2. *Cross Entrophy* tells us how close our predicted probabilities are to the true probability.
3. We use the accurary metrix to help us determine if we are progressing correctly.


In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

- X takes in all columns except the output features.
- y takes the last column of as the target label.

In [None]:
X = paddedSentences[:, :-1]
y = paddedSentences[:, -1]
y = to_categorical(y, num_classes=vocabSize)

## Train The Model
1. I did 30 epochs as the model accuracy improved the most drastically after 10 epochs until 25. 30 Epochs was done just incase it improved more.
2. Validation split is 10% as it retains 10% of the training data as a validation set. This allows us to reduce the chances of overfitting.

In [None]:
history = model.fit(X, y, batch_size = 64, epochs=30, validation_split=0.1)

Epoch 1/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 311ms/step - accuracy: 0.0301 - loss: 8.4999 - val_accuracy: 0.0543 - val_loss: 8.1660
Epoch 2/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 319ms/step - accuracy: 0.0386 - loss: 6.9669 - val_accuracy: 0.0543 - val_loss: 8.4580
Epoch 3/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 309ms/step - accuracy: 0.0436 - loss: 6.7249 - val_accuracy: 0.0543 - val_loss: 8.5072
Epoch 4/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 308ms/step - accuracy: 0.0465 - loss: 6.5862 - val_accuracy: 0.0528 - val_loss: 8.6343
Epoch 5/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 309ms/step - accuracy: 0.0476 - loss: 6.5506 - val_accuracy: 0.0435 - val_loss: 8.8131
Epoch 6/75
[1m91/91[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 323ms/step - accuracy: 0.0498 - loss: 6.3970 - val_accuracy: 0.0450 - val_loss: 8.7570
Epoch 7/75
[1m91/91[

# Test The Next Word Program

## Creating the Predict Next Word Method

1. Preprocess the user input.
2. Load in the token list into the Model and use it to predict the probability distribution over the vocabulary.
3. Find the index of the highest probability.
4. Convert the index into the actual word which is found in the word index and return that word.

In [None]:
import numpy as np

def predict_next_word(model, tokenizer, userText, maxLength):
    # 1.
    tokenList = tokenizer.texts_to_sequences([userText])[0]

    tokenList = pad_sequences([tokenList], maxlen=maxLength - 1, padding='pre')

    #2.
    predictedProbs = model.predict(tokenList, verbose=0)[0]

    #3.
    predictedIndex = np.argmax(predictedProbs)

    #4.
    nextWord = None
    for word, idx in tokenizer.word_index.items():
        if idx == predictedIndex:
            nextWord = word
            break

    return nextWord

## Running The Program

Does a while true loop which takes in the user input(A sentence that the user wants to complete), shows the next predicted word, and the option to continue predicting words until you want to quit.

In [None]:
while True:
    userInput = input("Enter a partial sentence (or 'quit'): ")
    if userInput.lower() == 'quit':
        break

    resultWord = predict_next_word(model, t, userInput, maxLength)
    if resultWord:
        print("Next word might be:", resultWord)
    else:
        print("Could not predict a next word.")

Enter a partial sentence (or 'quit'): I am 
Next word might be: armed
Enter a partial sentence (or 'quit'): by train from
Next word might be: waterloo
Enter a partial sentence (or 'quit'): I would have endured
Next word might be: means
Enter a partial sentence (or 'quit'): quit


# Conclusion

In this project we created a RNN Model to predict which word comes after the inputted sentence. Firstly, we started by preprocessing the data.txt file. Then, we built the RNN model utilizing tensorflow's method. Then, after continous tweaking and grasphing the hyperparamters like embedded dimension, optimization model, etc, we were able to train our model to be above 90% accuracy. This accurate model was used in creating a program that can predict the next word in the dataset aka the book "The Adventures of Sherlock Holmes," by Arthur Conan Doyle.