Problem statement - To identify the most likely word to follow given a string of words.
We will use three approaches to this problem:
1. Using NLTK
2. Using RNN - LSTM
3. Using Transformers

## Method 1 : Next word prediction using NLTK

In [2]:
#importing necessary libraries
import nltk
from nltk.corpus import reuters
from nltk import bigrams, ConditionalFreqDist
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [3]:
nltk.download("reuters")
corpus=reuters.sents()

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [4]:
print(corpus[0])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


In [5]:
print(corpus[1])

['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.']




1.   Notice that all corpus are ending with a full stop.
2. Here, corpus is a list of sentences.



In [6]:
print(len(corpus))

54716


CREATING BIGRAMS

Example- Leo Messi is the best player.

A unigram would be:
* Leo
* Messi
* Is
* the
* best
* player

A bigram would be:
* Leo Messi
* Messi is
* is best
* best player

A trigram would be:
* Leo Messi is
* Messi is the
* is the best
* the best player

* <b> But why do we do this? </b>
    * For every bigram, when we take the first word, we check what word comes next very frequently. For instance, if we take the word "Leo", we see that "Messi" is occuring frequently.
    * This is the best performing method for next word prediction as long as you have a huge dataset.

In [7]:
words= [word.lower() for sent in corpus for word in sent]
bigrams_list=list(bigrams(words))

In [8]:
print(bigrams_list[:10])

[('asian', 'exporters'), ('exporters', 'fear'), ('fear', 'damage'), ('damage', 'from'), ('from', 'u'), ('u', '.'), ('.', 's'), ('s', '.-'), ('.-', 'japan'), ('japan', 'rift')]


* <b> CREATING A CONDITIONAL FREQUENCY DISTRIBUTION </b>
    * How many times a pair of words have been repeated in the whole corpus.
    * Example from the above output: How many times ('asian', 'exporters') have been repeated in the entire corpus.
    * This helps us to predict the next word. For example- if ('asian', 'exporters') is repeating a lot of times in the corpus then we can say that when we input the word 'asian' then the most likely next word would be 'exporters'.

In [9]:
cfd= ConditionalFreqDist(bigrams_list)

<b> NEXT WORD PREDICTION </b>

In [10]:
def predict_next_word(input_word):
    input_word=input_word.lower()
    if input_word in cfd:
        return cfd[input_word].max()
    else:
        return "Word not found in corpus"

In [11]:
input_word='fear'
next_word=predict_next_word(input_word)
print(f"The next word after {input_word} could be: {next_word}")

The next word after fear could be: of


<b> Note: In general, as the dataset size increases, we will need to increase our gram size. Here, our dataset is small, so we are using bigram. </b>

## Method 2 : Next word prediction using RNN

In [12]:
import tensorflow as tf
import numpy as np
import random
import sys
import os

A NICELY ANALOGY FOR LSTM (Long short Term Memory):

Consider you riding a bike now. You're using your short term memory as in to check nearby vehicles and checking the traffic lights and other things while riding the bike.
But, you're also using your long term memory as in how you learnt riding a bike when you were young.
In this case, you're using a combination of short term as well as long term.

<b> Loading the dataset </b>

In [13]:
path_to_file=tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text=open(path_to_file, 'rb').read().decode(encoding='utf-8')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


Creating Mappings

In [14]:
## creating a vocabulary
vocab=sorted(set(text))

#creating a mapping from characters to unique indices
char2idx= {char: idx for idx, char in enumerate(vocab)} #representing every character by a number
idx2char=np.array(vocab)

#convert the text to numerical data
text_as_int=[char2idx[char] for char in text ]

Creating training examples and target

In [15]:
sequence_length=100
sequences_per_epoch=len(text) // (sequence_length+1)

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(sequence_length+1, drop_remainder=True)


def split_input_target(chunk):
    input_text=chunk[:-1]
    target_text=chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

TRAINING PARAMETERS

In [16]:
#Batch size
BATCH_SIZE=64

#buffer size to shuffle the dataset
BUFFER_SIZE=10000

dataset=dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

vocab_size=len(vocab)
embedding_dim=256
rnn_units=1024

EPOCHS=30

CREATING THE MODEL

In [17]:
def build_model(vocab_size,embedding_dim,rnn_units, batch_size):
    model=tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model
model=build_model(vocab_size, embedding_dim,rnn_units, BATCH_SIZE)

#COMPILING THE MODEL
model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

#CONFIGURE CHECKPOINTS
checkpoint_dir='./training_checkpoints'
checkpoint_prefix=os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
                        filepath=checkpoint_prefix,
                        save_weights_only=True)


In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           16640     
                                                                 
 lstm (LSTM)                 (64, None, 1024)          5246976   
                                                                 
 dense (Dense)               (64, None, 65)            66625     
                                                                 
Total params: 5330241 (20.33 MB)
Trainable params: 5330241 (20.33 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


* <b> Why are we embedding when we already assigned an index number to every character? </b>
    * This is because embedding groups all the similar characters and their indexes and groups them into a single number . This improves the performance of the model.

TRAINING THE MODEL

In [19]:
history= model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Don't run this code if you are not running it in GPU!! You can use Google colab

PREDICTING THE MODEL

In [21]:
#GENERATE TEXT

def generate_text(model,start_string):
    num_generate=1000
    input_eval=[char2idx[s] for s in start_string]
    input_eval=tf.expand_dims(input_eval,0)
    text_generated=[]
    temperature=1.0

    model.reset_states()
    for i in range(num_generate):
        predictions=model(input_eval)
        predictions=tf.squeeze(predictions, 0)
        predictions=predictions/temperature
        predicted_id=tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        input_eval=tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])

    return (start_string+ "".join(text_generated))

#restore the latest checkpoint and generate text
model=build_model(vocab_size, embedding_dim,rnn_units,batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

print(generate_text(model,start_string=u"ROMEO: "))

ROMEO: 'twere to the earth,
That lives not past of choler my estane;
How has your horse-clocks here accuse my king?

DERIO:
Cannot be burnt out.

ADUENour dire our fair
diseases that the entertainment
have worn i' the laz, or else these
ETES:
Out of one France that in Marcius poisonous
Mune in thee.

FRIAR LAURENCE:
Unhame, thou shalt, which shall suffer my guilty
Be feed to wish it under fast;
And thou shalt will not die this fellow hast a cupst friend.

LADY ANNE:
Neither, every father nor our duty:
'Tis the purpose twaxat Henry's friend.

KING HENRY VI:
Peace, thou! and give King Henry lips the crown.

KING RICHARD III:
You might have kept his goodness.

MENENIUS:
Age the bound of holy hand when he is love's love.
A grace of tears, combeding his affections to kill the blood and the air
Of his own carver and courtesy?

CLARENCE:
Belike the miligious officers upon thy woes,
But that the over-eye of Henry their chiffereds here?

FRIAR PETER:
Then plain men should desire the traught of 

## METHOD 3: NEXT WORD PREDICTION USING TRANSFORMERS

In [22]:
!pip install transformers torch

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.1 MB/s[0m eta [36m0:00:

In [24]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

#load pretrained GPT-2 model and tokenizer
model_name="gpt2"
tokenizer=GPT2Tokenizer.from_pretrained(model_name)
model=GPT2LMHeadModel.from_pretrained(model_name)

#set the model to evaluation mode (no training)
model.eval()

#function to generate text
def generate_text(prompt, max_length=50, temperature=0.7):
  input_ids=tokenizer.encode(prompt,return_tensors="pt")

  #Generate text
  output = model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences=1,
      no_repeat_ngram_size=2,
      top_k=50,
      top_p=0.95,
      temperature=temperature,
  )
  #decode and return generated text
  generated_text= tokenizer.decode(output[0],skip_special_tokens=True)
  return generated_text

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



In [25]:
#Generate text with a prompt
prompt= "once upon a time"
generated_text=generate_text(prompt,max_length=100)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


once upon a time, and I'm not sure if it's a coincidence or not.

I'm sure you're all aware of the fact that I've been working on this project for a while now. I have a lot of work to do, but I want to make sure that it is as good as it can be. So, I'll be working with you guys on the next project. It's going to be a bit of a long project, so I hope you all enjoy
