<h1>Introduction</h1>

This guide will cover the basics of deep learning for NLP tasks. We will first cover classification of data as spam/not-spam using various deep learing frameworks like RNNs and LSTMs. We will then also cover how to predict the next word in a given word-sequence using RNNs.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from nltk.corpus import stopwords
from collections import Counter
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [None]:
import numpy as np
import random
import pandas as pd
import sys
import os
import time
import codecs
import collections
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding,Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras import initializers, regularizers, constraints,optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout,BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential
from keras.layers.recurrent import SimpleRNN
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h2>Text classification using deep learning</h2>

Our main objective here is to build a text classifier using neural networks. The basic NLP pipeline will be the same, followed by a new process of building deep learning models:

In [None]:
# Importing data and checking it out
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding='ISO-8859-1')
df.head()

In [None]:
df.shape

In [None]:
# Checking null values
df.isnull().sum()

In [None]:
# Extracting required columns
df = df[['v1', 'v2']]
df.head()

In [None]:
# Renaming columns
df.rename(columns={'v1':'label', 'v2':'text'}, inplace=True)

In [None]:
df.head()

In [None]:
df['text']

In [None]:
# Removing stop words and converting it all to lowercase
stop = stopwords.words('english')
df['text'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x.lower() not in stop))
df['text'].head()

In [None]:
# Converting to lowercase
df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['text'].head()

In [None]:
# Removing symbols
df['text'] = df['text'].apply(lambda x:re.sub('[!@#$:).;,?&]', "", x.lower()))
df['text'].head()

In [None]:
df.isnull().sum()

In [None]:
# We can give only two arguments if we're working with a dataframe
training, testing = train_test_split(df, test_size=0.2)

In [None]:
print(training.shape)
print(testing.shape)

In [None]:
# Finding max sentence length - 300
np.max(df['text'].apply(lambda x: len(x)))

In [None]:
# We will take the top 200000 frequently occuring words
words = 20000
tokenizer = Tokenizer(num_words=words)

`fit_on_texts` - Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. `word_index["the"] = 1; word_index["cat"] = 2` it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot). Through num_words, we are picking the most frequent words i.e. the ones with the lower integer values.

`texts_to_sequences` Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [None]:
tokenizer.fit_on_texts(training.text)

In [None]:
train_seq = tokenizer.texts_to_sequences(training.text)
test_seq = tokenizer.texts_to_sequences(testing.text)

In [None]:
import itertools

In [None]:
# Dictionary for the words and the index
word_index = tokenizer.word_index
print(dict(itertools.islice(word_index.items(), 50)))
print()
print('Found %s unique tokens '%len(word_index))

`pad_sequences` is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.
Sequences longer than num_timesteps are truncated so that they fit the desired length.
The position where padding or truncation happens is determined by the arguments padding and truncating, respectively. Pre-padding or removing values from the beginning of the sequence is the default.

In [None]:
# Padding data for equal lengths, for our models
training_data = pad_sequences(train_seq, maxlen=300)
testing_data = pad_sequences(test_seq, maxlen=300)

In [None]:
print(training_data.shape)
print(testing_data.shape)

In [None]:
y_train = training['label']
y_test = testing['label']

In [None]:
le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)
print(le.classes_)

In [None]:
y_train.shape

In [None]:
y_train

In [None]:
y_test.shape

In [None]:
# Converting the labels to categorical
# To pass through our model
y_train_cat = to_categorical(np.asarray(y_train))
y_test_cat = to_categorical(np.asarray(y_test))
print('Shape of data tensor', training_data.shape)
print('Shape of label tensors (training)', y_train_cat.shape)
print('Shape of label tensors (testing)', y_test_cat.shape)

In [None]:
y_train_cat

In [None]:
# Defining our embedding dimension
embeds = 100

<h2>Model building and predicting</h2>

We are building the models using different deep learning approaches
like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the
performance of each model using different accuracy metrics.
We can now define our CNN model.
Here we define a single hidden layer with 128 memory units. The
network uses a dropout with a probability of 0.5. The output layer is a
dense layer using the softmax activation function to output a probability
prediction.

In [None]:
print('Training CNN 1D model')
model = Sequential()
# 20000 was our maximum word number in the tokenizer
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
 optimizer='rmsprop',
 metrics=['acc'])

In [None]:
model.fit(training_data, y_train_cat, batch_size=64, epochs=5, validation_data = (testing_data, y_test_cat))

In [None]:
predicted=model.predict(testing_data)
predicted

In [None]:
# Metrics
precision, recall, fscore, support = score(y_test_cat,predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted.round()))

<h2>RNN model</h2>

In [None]:
print('Training SIMPLERNN model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=5,
 validation_data=(testing_data, y_test_cat))

In [None]:
# probabilities
predicted_Srnn=model.predict(testing_data)
predicted_Srnn

In [None]:
precision, recall, fscore, support = score(y_test_cat, predicted_Srnn.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_Srnn.round()))

<h2>LSTM model</h2>

In [None]:
print('Training LSTM model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(LSTM(16, activation='relu',return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))

model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=5,
 validation_data=(testing_data, y_test_cat))

In [None]:
predicted_lstm=model.predict(testing_data)
predicted_lstm

In [None]:
precision, recall, fscore, support = score(y_test_cat, predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_lstm.round()))

<h2>Bidirectional LSTM</h2>

In [None]:
print('Training Bidirectional LSTM model.')
model = Sequential()
model.add(Embedding(20000,
 embeds,
 input_length=300
 ))
model.add(Bidirectional(LSTM(16, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',
optimizer='adam',metrics = ['accuracy'])
model.fit(training_data, y_train_cat,
 batch_size=16,
 epochs=3,
 validation_data=(testing_data, y_test_cat))

In [None]:
predicted_blstm=model.predict(testing_data)
predicted_blstm

In [None]:
precision, recall, fscore, support = score(y_test_cat, predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(y_test_cat,predicted_blstm.round()))

<h2>Next word prediction</h2>

Mechanisms such as autofills can help us understand the potential sequence of words that can be filled in front of an incomplete sentence. This technique is leveraged in different formats, mostly for email writing.

We will build an LSTM model to learn sequences of data from our spam texts.

In [None]:
df.head()

In [None]:
df_listing = df.text.tolist()
df_listing[:10]

In [None]:
# Convert the given list to strings
from collections import Iterable

def reduce_dims(items):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            for sub_x in reduce_dims(x):
                yield sub_x
        else:
            yield x

In [None]:
string_final = ''.join(df_listing)

In [None]:
string_final = string_final.replace('\n', '')
string_final = string_final.lower()

In [None]:
pattern = r'[^a-zA-z0-9\s]'
string_final = re.sub(pattern, "", string_final)

In [None]:
tokens = tokenizer.tokenize(string_final)
tokens = [token.strip() for token in tokens]

In [None]:
total_words = Counter(tokens)
len(total_words)

In [None]:
total_words.most_common()[:10]

In [None]:
words = [x[0] for x in total_words.most_common()]
words[:10]

In [None]:
sorted_words = list(sorted(words))
sorted_words[:10]

In [None]:
word_ind = {x: i for i, x in enumerate(sorted_words)}

In [None]:
# Decide on a sentence length
sentence_length = 25

<h2>Data preparation for modeling</h2>

We will be dividing all the data in our text column into sequences of words with fixed length of 10 words (we can modify this according to our requirements). We will be splititng the text based on word sequences, when we create the sequence, we can slide the window across the whole document one word at a time, allowing to learn from its predeceding one.

In [None]:
# Prepare input to output pairs encoded as integers
# input - sentence input 
# output - model output with index
inp = []
out = []
# As we need 11 words (10 words for sentence, 1 for output)
# We will set the for loop like this
for i in range(0, len(total_words) - sentence_length, 1):
    x = tokens[i:i+sentence_length]
    y = tokens[i+sentence_length]
    # Creating a vector
    inp.append([word_ind[char] for char in x])
    out.append(word_ind[y])

In [None]:
# Inverse dictionary
inv_dict = dict(map(reversed, word_ind.items()))

In [None]:
out[:1]

Now that we have our input and output data in numerical format, we can proceed with one-hot encoding the target variables and training our model.

In [None]:
X = numpy.reshape(inp, (len(inp), sentence_length, 1))
# to_categorical for one-hot encoding
Y = np_utils.to_categorical(out)

In [None]:
Y

<h2>Model building</h2>

We will be using LSTMs. We are using a single layer with 256 memory units. The model will use a dropout of 0.2. Softmax activation function is used alongside the ADAM optimizer.

In [None]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')

In [None]:
file_name_path="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file_name_path, monitor='loss',
verbose=1, save_best_only=True, mode='min')
callbacks = [checkpoint]

**NOTE** - We have not split the training and testing data here, as we are not interested in the accuracy. Deep learning models require huge amounts of data and time to train, so we will be using all the data we have access to.

In [None]:
model.fit(X, Y, epochs=5, batch_size=128, callbacks=callbacks) 

<h2>Generating random input to predict next word</h2>

In [None]:
# Generate random sequence
rand_val = numpy.random.randint(0, len(inp))
rand_val

In [None]:
input_sentence = inp[rand_val]
input_sentence

In [None]:
X = numpy.reshape(input_sentence, (1, len(input_sentence), 1))

In [None]:
predict_word = model.predict(X, verbose=0)
index = numpy.argmax(predict_word)

In [None]:
result = inv_dict[index]
sent_in = [inv_dict[value] for value in input_sentence]
print(sent_in)
print ("\n")
print(result)

So, given the 25 input words, it's predicting the word “u” as the next
word. Of course, its not making much sense, since it has been trained on
much less data and epochs. Make sure you have great computation power
and train on huge data with high number of epochs.

Through this, we were successful in creating a model that can predict the next word based on a given sequence. This can further be improved with much larger corpus of text and bigger networks.