<h1>Introduction</h1>

This guide will cover the basics of deep learning for NLP tasks. We will first cover classification of data as spam/not-spam using various deep learing frameworks like RNNs and LSTMs. We will then also cover how to predict the next word in a given word-sequence using RNNs.

In [53]:
import warnings
warnings.filterwarnings('ignore')

In [70]:
from nltk.corpus import stopwords
from collections import Counter
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [51]:
import numpy as np
import random
import pandas as pd
import sys
import os
import time
import codecs
import collections
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()

In [2]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding,Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras import initializers, regularizers, constraints,optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout,BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential
from keras.layers.recurrent import SimpleRNN
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sms-spam-collection-dataset/spam.csv


<h2>Text classification using deep learning</h2>

Our main objective here is to build a text classifier using neural networks. The basic NLP pipeline will be the same, followed by a new process of building deep learning models:

In [4]:
# Importing data and checking it out
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
df.shape

(5572, 5)

In [6]:
# Checking null values
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [7]:
# Extracting required columns
df = df[['v1', 'v2']]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# Renaming columns
df.rename(columns={'v1':'label', 'v2':'text'}, inplace=True)

In [9]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
df['text']

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: text, Length: 5572, dtype: object

In [11]:
# Removing stop words and converting it all to lowercase
stop = stopwords.words('english')
df['text'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x.lower() not in stop))
df['text'].head()

0    Go jurong point, crazy.. Available bugis n gre...
1                        Ok lar... Joking wif u oni...
2    Free entry 2 wkly comp win FA Cup final tkts 2...
3            U dun say early hor... U c already say...
4              Nah think goes usf, lives around though
Name: text, dtype: object

In [12]:
# Converting to lowercase
df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['text'].head()

0    go jurong point, crazy.. available bugis n gre...
1                        ok lar... joking wif u oni...
2    free entry 2 wkly comp win fa cup final tkts 2...
3            u dun say early hor... u c already say...
4              nah think goes usf, lives around though
Name: text, dtype: object

In [13]:
# Removing symbols
df['text'] = df['text'].apply(lambda x:re.sub('[!@#$:).;,?&]', "", x.lower()))
df['text'].head()

0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry 2 wkly comp win fa cup final tkts 2...
3                  u dun say early hor u c already say
4               nah think goes usf lives around though
Name: text, dtype: object

In [14]:
df.isnull().sum()

label    0
text     0
dtype: int64

In [15]:
# We can give only two arguments if we're working with a dataframe
training, testing = train_test_split(df, test_size=0.2)

In [16]:
print(training.shape)
print(testing.shape)

(4457, 2)
(1115, 2)


In [17]:
# Finding max sentence length - 300
np.max(df['text'].apply(lambda x: len(x)))

511

In [18]:
# We will take the top 200000 frequently occuring words
words = 20000
tokenizer = Tokenizer(num_words=words)

`fit_on_texts` - Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. `word_index["the"] = 1; word_index["cat"] = 2` it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot). Through num_words, we are picking the most frequent words i.e. the ones with the lower integer values.

`texts_to_sequences` Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [19]:
tokenizer.fit_on_texts(training.text)

In [20]:
train_seq = tokenizer.texts_to_sequences(training.text)
test_seq = tokenizer.texts_to_sequences(testing.text)

In [21]:
import itertools

In [22]:
# Dictionary for the words and the index
word_index = tokenizer.word_index
print(dict(itertools.islice(word_index.items(), 50)))
print()
print('Found %s unique tokens '%len(word_index))

{'u': 1, 'call': 2, '2': 3, 'get': 4, "i'm": 5, 'ur': 6, '4': 7, 'ltgt': 8, 'go': 9, 'ok': 10, 'free': 11, 'know': 12, 'good': 13, 'come': 14, 'like': 15, 'got': 16, 'now': 17, 'day': 18, 'time': 19, 'send': 20, 'you': 21, 'love': 22, 'want': 23, 'text': 24, 'home': 25, 'going': 26, 'one': 27, "i'll": 28, 'see': 29, 'me': 30, 'lor': 31, 'need': 32, 'txt': 33, 'r': 34, 'still': 35, 'today': 36, 'stop': 37, 'sorry': 38, 'later': 39, 'back': 40, 'dont': 41, 'n': 42, 'it': 43, 'tell': 44, 'think': 45, 'new': 46, 'da': 47, 'hi': 48, 'take': 49, 'phone': 50}

Found 8470 unique tokens 


`pad_sequences` is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.
Sequences longer than num_timesteps are truncated so that they fit the desired length.
The position where padding or truncation happens is determined by the arguments padding and truncating, respectively. Pre-padding or removing values from the beginning of the sequence is the default.

In [23]:
# Padding data for equal lengths, for our models
training_data = pad_sequences(train_seq, maxlen=300)
testing_data = pad_sequences(test_seq, maxlen=300)

In [24]:
print(training_data.shape)
print(testing_data.shape)

(4457, 300)
(1115, 300)


In [25]:
y_train = training['label']
y_test = testing['label']

In [26]:
le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)
print(le.classes_)

['ham' 'spam']


In [27]:
y_train.shape

(4457,)

In [28]:
y_train

array([0, 0, 1, ..., 0, 1, 0])

In [29]:
y_test.shape

(1115,)

In [30]:
# Converting the labels to categorical
# To pass through our model
y_train_cat = to_categorical(np.asarray(y_train))
y_test_cat = to_categorical(np.asarray(y_test))
print('Shape of data tensor', training_data.shape)
print('Shape of label tensors (training)', y_train_cat.shape)
print('Shape of label tensors (testing)', y_test_cat.shape)

Shape of data tensor (4457, 300)
Shape of label tensors (training) (4457, 2)
Shape of label tensors (testing) (1115, 2)


In [31]:
y_train_cat

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

In [32]:
# Defining our embedding dimension
embeds = 100

<h2>Model building and predicting</h2>

We are building the models using different deep learning approaches
like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the
performance of each model using different accuracy metrics.
We can now define our CNN model.
Here we define a single hidden layer with 128 memory units. The
network uses a dropout with a probability of 0.5. The output layer is a
dense layer using the softmax activation function to output a probability
prediction.

In [33]:
print('Training CNN 1D model')
model = Sequential()
# 20000 was our maximum word number in the tokenizer
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
 optimizer='rmsprop',
 metrics=['acc'])

Training CNN 1D model


In [34]:
model.fit(training_data, y_train_cat, batch_size=64, epochs=5, validation_data = (testing_data, y_test_cat))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fd55b1a6ad0>

In [35]:
predicted=model.predict(testing_data)
predicted

array([[0.7227747 , 0.27722535],
       [0.7149998 , 0.28500015],
       [0.7137979 , 0.2862021 ],
       ...,
       [0.733541  , 0.26645893],
       [0.7143932 , 0.2856068 ],
       [0.7145982 , 0.2854018 ]], dtype=float32)

In [36]:
# Metrics
precision, recall, fscore, support = score(y_test_cat,predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted.round()))

precision: [0.9334638 1.       ]
recall: [1.         0.57763975]
fscore: [0.96558704 0.73228346]
support: [954 161]
############################
              precision    recall  f1-score   support

           0       0.93      1.00      0.97       954
           1       1.00      0.58      0.73       161

   micro avg       0.94      0.94      0.94      1115
   macro avg       0.97      0.79      0.85      1115
weighted avg       0.94      0.94      0.93      1115
 samples avg       0.94      0.94      0.94      1115



<h2>RNN model</h2>

In [37]:
print('Training SIMPLERNN model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=5,
 validation_data=(testing_data, y_test_cat))

Training SIMPLERNN model.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fd55b12ea50>

In [38]:
# probabilities
predicted_Srnn=model.predict(testing_data)
predicted_Srnn

array([[9.9948907e-01, 5.1089545e-04],
       [9.9779910e-01, 2.2008889e-03],
       [9.9943000e-01, 5.7000760e-04],
       ...,
       [9.9918503e-01, 8.1497722e-04],
       [9.9882549e-01, 1.1744860e-03],
       [9.9968708e-01, 3.1298236e-04]], dtype=float32)

In [39]:
precision, recall, fscore, support = score(y_test_cat, predicted_Srnn.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_Srnn.round()))

precision: [0.95213849 0.85714286]
recall: [0.98008386 0.70807453]
fscore: [0.96590909 0.7755102 ]
support: [954 161]
############################
              precision    recall  f1-score   support

           0       0.95      0.98      0.97       954
           1       0.86      0.71      0.78       161

   micro avg       0.94      0.94      0.94      1115
   macro avg       0.90      0.84      0.87      1115
weighted avg       0.94      0.94      0.94      1115
 samples avg       0.94      0.94      0.94      1115



<h2>LSTM model</h2>

In [40]:
print('Training LSTM model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(LSTM(16, activation='relu',return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=5,
 validation_data=(testing_data, y_test_cat))

Training LSTM model.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fd5419936d0>

In [41]:
predicted_lstm=model.predict(testing_data)
predicted_lstm

array([[1.0000000e+00, 1.1004146e-09],
       [1.0000000e+00, 1.4025856e-16],
       [9.9999952e-01, 4.7674393e-07],
       ...,
       [1.0000000e+00, 2.9936931e-23],
       [1.0000000e+00, 1.5904758e-11],
       [1.0000000e+00, 9.1839727e-09]], dtype=float32)

In [42]:
precision, recall, fscore, support = score(y_test_cat, predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_lstm.round()))

precision: [0.98553719 1.        ]
recall: [1.         0.91304348]
fscore: [0.99271592 0.95454545]
support: [954 161]
############################
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       954
           1       1.00      0.91      0.95       161

   micro avg       0.99      0.99      0.99      1115
   macro avg       0.99      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115
 samples avg       0.99      0.99      0.99      1115



<h2>Bidirectional LSTM</h2>

In [43]:
print('Training Bidirectional LSTM model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(Bidirectional(LSTM(16, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=3,
 validation_data=(testing_data, y_test_cat))

Training Bidirectional LSTM model.
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fd528472b50>

In [44]:
predicted_blstm=model.predict(testing_data)
predicted_blstm

array([[1.0000000e+00, 1.4269067e-11],
       [1.0000000e+00, 1.4120646e-11],
       [1.0000000e+00, 3.2728870e-08],
       ...,
       [1.0000000e+00, 1.6519794e-11],
       [1.0000000e+00, 1.0365188e-11],
       [1.0000000e+00, 3.4784989e-08]], dtype=float32)

In [45]:
precision, recall, fscore, support = score(y_test_cat, predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_blstm.round()))

precision: [0.98651452 0.98013245]
recall: [0.99685535 0.91925466]
fscore: [0.99165798 0.94871795]
support: [954 161]
############################
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       954
           1       0.98      0.92      0.95       161

   micro avg       0.99      0.99      0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115
 samples avg       0.99      0.99      0.99      1115



<h2>Next word prediction</h2>

Mechanisms such as autofills can help us understand the potential sequence of words that can be filled in front of an incomplete sentence. This technique is leveraged in different formats, mostly for email writing.

We will build an LSTM model to learn sequences of data from our spam texts.

In [48]:
df.head()

Unnamed: 0,label,text
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah think goes usf lives around though


In [50]:
df_listing = df.text.tolist()
df_listing[:10]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 "free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry question(std txt ratetc's apply 08452810075over18's",
 'u dun say early hor u c already say',
 'nah think goes usf lives around though',
 "freemsg hey darling 3 week's word back i'd like fun still tb ok xxx std chgs send å£150 rcv",
 'even brother like speak me treat like aids patent',
 "per request 'melle melle (oru minnaminunginte nurungu vettam' set callertune callers press *9 copy friends callertune",
 'winner valued network customer selected receivea å£900 prize reward claim call 09061701461 claim code kl341 valid 12 hours only',
 'mobile 11 months more u r entitled update latest colour mobiles camera free call mobile update co free 08002986030']

In [54]:
# Convert the given list to strings
from collections import Iterable

def reduce_dims(items):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            for sub_x in reduce_dims(x):
                yield sub_x
        else:
            yield x

In [56]:
string_final = ''.join(df_listing)

In [57]:
string_final = string_final.replace('\n', '')
string_final = string_final.lower()

In [60]:
pattern = r'[^a-zA-z0-9\s]'
string_final = re.sub(pattern, "", string_final)

In [61]:
tokens = tokenizer.tokenize(string_final)
tokens = [token.strip() for token in tokens]

In [74]:
total_words = Counter(tokens)
len(total_words)

13386

In [84]:
total_words.most_common()[:10]

[('u', 951),
 ('call', 525),
 ('2', 467),
 ('ur', 356),
 ('get', 346),
 ('im', 344),
 ('4', 282),
 ('go', 265),
 ('ltgt', 244),
 ('free', 221)]

In [86]:
words = [x[0] for x in total_words.most_common()]
words[:10]

['u', 'call', '2', 'ur', 'get', 'im', '4', 'go', 'ltgt', 'free']

In [91]:
sorted_words = list(sorted(words))
sorted_words[:10]

['0',
 '008704050406',
 '0089my',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '020603',
 '0207']

In [93]:
word_ind = {x: i for i, x in enumerate(sorted_words)}

In [94]:
# Decide on a sentence length
sentence_length = 25

<h2>Data preparation for modeling</h2>

We will be dividing all the data in our text column into sequences of words with fixed length of 10 words (we can modify this according to our requirements). We will be splititng the text based on word sequences, when we create the sequence, we can slide the window across the whole document one word at a time, allowing to learn from its predeceding one.

In [98]:
# Prepare input to output pairs encoded as integers
# input - sentence input 
# output - model output with index
inp = []
out = []
# As we need 11 words (10 words for sentence, 1 for output)
# We will set the for loop like this
for i in range(0, len(total_words) - sentence_length, 1):
    x = tokens[i:i+sentence_length]
    y = tokens[i+sentence_length]
    # Creating a vector
    inp.append([word_ind[char] for char in x])
    out.append(word_ind[y])

In [131]:
# Inverse dictionary
inv_dict = dict(map(reversed, word_ind.items()))

In [100]:
out[:1]

[12832]

Now that we have our input and output data in numerical format, we can proceed with one-hot encoding the target variables and training our model.

In [101]:
X = numpy.reshape(inp, (len(inp), sentence_length, 1))
# to_categorical for one-hot encoding
Y = np_utils.to_categorical(out)

In [102]:
Y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

<h2>Model building</h2>

We will be using LSTMs. We are using a single layer with 256 memory units. The model will use a dropout of 0.2. Softmax activation function is used alongside the ADAM optimizer.

In [103]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')

In [104]:
file_name_path="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file_name_path, monitor='loss',
verbose=1, save_best_only=True, mode='min')
callbacks = [checkpoint]

**NOTE** - We have not split the training and testing data here, as we are not interested in the accuracy. Deep learning models require huge amounts of data and time to train, so we will be using all the data we have access to.

In [105]:
model.fit(X, Y, epochs=5, batch_size=128, callbacks=callbacks) 

Epoch 1/5

Epoch 00001: loss improved from inf to 8.67458, saving model to weights-improvement-01-8.6746.hdf5
Epoch 2/5

Epoch 00002: loss improved from 8.67458 to 8.05159, saving model to weights-improvement-02-8.0516.hdf5
Epoch 3/5

Epoch 00003: loss improved from 8.05159 to 7.88771, saving model to weights-improvement-03-7.8877.hdf5
Epoch 4/5

Epoch 00004: loss improved from 7.88771 to 7.68709, saving model to weights-improvement-04-7.6871.hdf5
Epoch 5/5

Epoch 00005: loss improved from 7.68709 to 7.59327, saving model to weights-improvement-05-7.5933.hdf5


<tensorflow.python.keras.callbacks.History at 0x7fd51c2d9350>

<h2>Generating random input to predict next word</h2>

In [163]:
# Generate random sequence
rand_val = numpy.random.randint(0, len(inp))
rand_val

10546

In [164]:
input_sentence = inp[rand_val]
input_sentence

[6502,
 7687,
 7925,
 4119,
 11551,
 5122,
 13023,
 4553,
 3612,
 1791,
 9489,
 4555,
 5592,
 2370,
 5434,
 644,
 11443,
 11999,
 9649,
 8315,
 7123,
 9943,
 7119,
 3379,
 2182]

In [165]:
X = numpy.reshape(input_sentence, (1, len(input_sentence), 1))

In [166]:
predict_word = model.predict(X, verbose=0)
index = numpy.argmax(predict_word)

In [170]:
result = inv_dict[index]
sent_in = [inv_dict[value] for value in input_sentence]
print(sent_in)
print ("\n")
print(result)

['knowal', 'moan', 'n', 'e', 'thin', 'goes', 'wrong', 'faultal', 'de', 'arguments', 'r', 'faultfed', 'himso', 'bother', 'hav', '2go', 'thanxxxneft', 'transaction', 'reference', 'number', 'ltgt', 'rs', 'ltdecimalgt', 'credited', 'beneficiary']


u


So, given the 25 input words, it's predicting the word “u” as the next
word. Of course, its not making much sense, since it has been trained on
much less data and epochs. Make sure you have great computation power
and train on huge data with high number of epochs.

Through this, we were successful in creating a model that can predict the next word based on a given sequence. This can further be improved with much larger corpus of text and bigger networks.