# Natural Language Processing

### Ruthu S Sanketh

The central idea of this tutorial is to explore LSTM based models. We will explore how the size of the model effects the sequence generated. We will see both character based and word based models.

# Word Based LSTM model

In [6]:
import numpy
import re
import pandas as pd
import numpy as np
import keras
import string
import nltk

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Embedding
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

string.punctuation = string.punctuation + '“' + '”' +'-' + '’' + '‘' + '—'
string.punctuation = string.punctuation.replace('.', '')

We do basic pre processing which includes lowering etc. after looking at the dataset.

In [7]:
# Loads the data and preprocesses data and stores corpus in raw_text
raw_text = open('/content/sample_data/corpus.txt', encoding = 'utf8').read()

file_nl_removed = ""
for line in raw_text:
  line_nl_removed = line.replace("\n", " ")           #removes newlines
  file_nl_removed += line_nl_removed

file_p = "".join([char for char in file_nl_removed if char not in string.punctuation])   #removes all special characters
sents = nltk.sent_tokenize(file_p)
print("The number of sentences is", len(sents)) #prints the number of sentences

string.punctuation = string.punctuation + '.'
file_q = "".join([char for char in file_p if char not in string.punctuation])   #removes even periods.
words = nltk.word_tokenize(file_q)
print("The number of tokens is", len(words)) #prints the number of tokens

average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is", average_tokens) #prints the average number of tokens per sentence

unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens)) #prints the number of unique tokens

preprocessed_text = file_p.lower()       #converts corpus into lowercase

The number of sentences is 981
The number of tokens is 26381
The average number of tokens per sentence is 27
The number of unique tokens are 3037


In [8]:
# Hyperparameters of the model
vocab_size = 2750  #chosen based on statistics of the model
oov_tok = '<OOV>'
embedding_dim = 100
padding_type='post'
trunc_type='post'

In [9]:
# tokenizes sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts([preprocessed_text])
word_index = tokenizer.word_index

In [10]:
seq_length = 50
tokens = tokenizer.texts_to_sequences([preprocessed_text])[0]

In [11]:
dataX = []
dataY = []

for i in range(0, len(tokens) - seq_length-1 , 1):
  seq_in = tokens[i:i + seq_length]
  seq_out = tokens[i + seq_length]

  if seq_out==1: #Skip samples where target word is OOV
    continue
    
  dataX.append(seq_in)
  dataY.append(seq_out)
 
N = len(dataX)
print ("Total training data size is -", N)

Total training data size is - 26333


In [12]:
X = numpy.array(dataX)

# one hot encodes the output variable
y = numpy.array(dataY)
y = np_utils.to_categorical(dataY)

In [13]:
# with embedding
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=seq_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(vocab_size, activation='softmax')
])

# compiles model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 100)           275000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 2750)              354750    
Total params: 714,230
Trainable params: 714,230
Non-trainable params: 0
_________________________________________________________________


In [14]:
# Uses validation split of 0.2 while training
num_epochs = 5
history = model.fit(X, y, epochs=num_epochs, batch_size = 128, verbose=1, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [15]:
#Creates word to idx map using tokenizer.word_index
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

In [16]:
# Returns the next n words greedily
def next_tokens(input_str, n):
    print ("Seed -",  input_str, sep = '\n\n')
    final_string = ''
    for i in range(n):
        token = tokenizer.texts_to_sequences([input_str])[0]
        prediction = model.predict(token, verbose=0)
        final_string = final_string + reverse_word_map[numpy.argmax(prediction[0])] + ' ' 
        input_str = input_str + ' ' + reverse_word_map[numpy.argmax(prediction[0])]
        input_str = ' '.join(input_str.split(' ')[1:])
    return final_string

In [17]:
# picks a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
input_str = ' '.join([reverse_word_map[value] for value in pattern])

output = next_tokens(input_str, 10)
print("\nGenerated string -\n\n", output)

Seed -

must remember remarked the king or ill have you executed the miserable hatter dropped his teacup and breadandbutter and went down on one knee im a poor man your majesty he began youre a very poor speaker said the king here one of the guineapigs cheered and was immediately suppressed

Generated string -

 the the little little be little little little little the 


In [18]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

# Uses first 50 tokens from given input_str as input. Since the seq_length is 50, only 50 tokens are taken using the tokenizer.
output = next_tokens(input_str, 10)
print("\nGenerated string -\n\n", output)

Seed -

The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not  a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue  him or his sheep.

Generated string -

 little the the the little little little little the little 


# Character based LSTM Model 1

In [19]:
# Uses the preprocessed data and create raw_text
raw_text = preprocessed_text   #periods have not been removed for better results

# creates mapping of unique characters to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [20]:
# Prints the total characters and character vocab size
n_chars = len(raw_text)
n_vocab = len(chars)

print("The number of total characters are", n_chars)
print("\nThe character vocab size is", n_vocab)

The number of total characters are 135030

The character vocab size is 29


In [21]:
#Prepares dataset where the input is sequence of 100 characters and target is next character.
seq_length = 100

dataX = []
dataY = []

for i in range(0, n_chars - seq_length, 1):
  seq_in = raw_text[i:i + seq_length]
  seq_out = raw_text[i + seq_length]

  dataX.append([char_to_int[char] for char in seq_in])
  dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  134930


In [22]:
# reshapes X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# one hot encodes the output variable
y = np_utils.to_categorical(dataY)

In [23]:
embedding_dim =100
max_length =100

In [24]:
model = Sequential()
model.add(Embedding(n_vocab, embedding_dim, input_length=max_length))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          2900      
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               365568    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 29)                7453      
Total params: 375,921
Trainable params: 375,921
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.fit(X, y, epochs = 20, batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f3aaab017f0>

In [26]:
#Gets the generated string using the model.
def predict_next_n_chars(pattern, n):
    for i in range(n):
      x = numpy.reshape(pattern, (1, len(pattern), 1))
      prediction = model.predict(x, verbose=0)
      print (int_to_char[numpy.argmax(prediction)], end = '')   #get next char index.
      seq_in = [int_to_char[value] for value in pattern]
      pattern.append(numpy.argmax(prediction))
      pattern = pattern[1:len(pattern)]

In [27]:
#picks a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
input_str = ''.join([int_to_char[value] for value in pattern])
print ("Seed -",  input_str, sep = '\n\n')
print ("\nGenerated string -\n")

predict_next_n_chars(pattern, 200)

Seed -

er sister as well as she could remember them all these strange adventures of hers that you have just

Generated string -

 been that said the caterpillar. well i cant see the rest of the court and the mock turtle said and the mock turtle said and the mock turtle said and the mock turtle said and the mock turtle said and 

In [28]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

#Uses the first 100 characters from given input_str as input to generate next 200 characters. 
input_str = input_str.lower()
input_string = ''
for each in input_str:
  if each in chars:
    if (len (input_string)<100):
      input_string += each

pattern = []
pattern.append([char_to_int[char] for char in input_string])

print ("Seed -",  input_str, sep = '\n\n')
print ("\nGenerated string -\n")
predict_next_n_chars(pattern[0], 200)

Seed -

the boy laughed at the fright he had caused. this time, the villagers left angrily. the third day, as the boy went up the small hill, he suddenly saw a wolf attacking his sheep. he cried as hard as he could, “wolf! wolf! wolf!”, but not  a single villager came to help him. the villagers thought that he was trying to fool them again and did not come to rescue  him or his sheep.

Generated string -

if she was a little before she had not the queen said to the mock turtle said and alice was so much alice thought to herself i wonder what a curious this time and she went on all the rest of the court

## Character based LSTM Model 2


In [29]:
model1 = Sequential()
model1.add(Embedding(n_vocab, embedding_dim, input_length=max_length))
model1.add(LSTM(256, input_shape=(X.shape[1], embedding_dim),return_sequences=True))
model1.add(Dropout(0.2))
model1.add(LSTM(256))
model1.add(Dropout(0.2))
model1.add(Dense(y.shape[1], activation='softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='adam')
model1.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 100)          2900      
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 256)          365568    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 29)                7453      
Total params: 901,233
Trainable params: 901,233
Non-trainable params: 0
________________________________________________

In [30]:
model1.fit(X, y, epochs = 20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f3a154a4e80>

In [31]:
# Generates the sequence similar to above methods. Gets the generated string using the model.
def predict_next_n_chars(pattern, n):
    for i in range(n):
      x = numpy.reshape(pattern, (1, len(pattern), 1))
      prediction = model1.predict(x, verbose=0)
      print (int_to_char[numpy.argmax(prediction)], end = '')   #get next char index.
      seq_in = [int_to_char[value] for value in pattern]
      pattern.append(numpy.argmax(prediction))
      pattern = pattern[1:len(pattern)]

In [32]:
#picks a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
input_str = ''.join([int_to_char[value] for value in pattern])
print ("Seed -",  input_str, sep = '\n\n')
print ("\nGenerated string -\n")

predict_next_n_chars(pattern, 200)

Seed -

it in time said the caterpillar and it put the hookah into its mouth and began smoking again. this t

Generated string -

ime the mouse was a large cat said the mouse to herself it was a little thing and the moral of that is that the mouse was a little birds with a tree in a low voice of the edge of the mouse who was not

In [33]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

#Uses the first 100 characters from given input_str as input to generate next 200 characters. 
input_str = input_str.lower()
input_string = ''
for each in input_str:
  if each in chars:
    if (len (input_string)<100):
      input_string += each

pattern = []
pattern.append([char_to_int[char] for char in input_string])

print ("Seed -",  input_str, sep = '\n\n')
print ("\nGenerated string -\n")
predict_next_n_chars(pattern[0], 200)

Seed -

the boy laughed at the fright he had caused. this time, the villagers left angrily. the third day, as the boy went up the small hill, he suddenly saw a wolf attacking his sheep. he cried as hard as he could, “wolf! wolf! wolf!”, but not  a single villager came to help him. the villagers thought that he was trying to fool them again and did not come to rescue  him or his sheep.

Generated string -

she spoke and the moral of that is that the mouse was a little birds with a tree in a low voice of the edge of the mouse who was not a mouse that was the mouse doesnt matter the white rabbit while she