<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-4-text-classification/5_deep_learning_for_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for Text Classification

Deep learning is a family of machine learning algorithms where the learning happens through different kinds of multilayered neural network architectures. Over the past few years, it has shown remarkable improvements on standard machine learning tasks, such as:-

- Image classification, 
- Speech recognition,
- Machine translation. 

This has resulted in widespread interest in using deep learning for various tasks, including text classification. So far, we’ve seen how to train different machine learning classifiers, using BoW and different kinds of embedding
representations. 

Now, let’s look at how to use deep learning architectures for text
classification.

Two of the most commonly used neural network architectures for text classification are:-

- Convolutional neural networks (CNNs)
- Recurrent neural networks (RNNs).

Long short-term memory (LSTM) networks are a popular form of RNNs. Recent
approaches also involve starting with large, pre-trained language models and finetuning them for the task at hand. 

In this notebook, we’ll learn how to train CNNs and LSTMs and how to tune a pre-trained language model for text classification using the IMDB sentiment classification dataset.

## Setup

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant

import os
import sys
import numpy as np

Here we set all the paths of all the external datasets and models such as [glove](https://nlp.stanford.edu/projects/glove/) and [IMDB reviews dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

In [2]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-11-06 09:44:14--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-11-06 09:44:14--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-11-06 09:44:14--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-1

In [None]:
# unzip glove file
!unzip glove.6B.zip
# untar IMDB dataset
!tar xvzf aclImdb_v1.tar.gz

In [11]:
# copy to data directory
!mkdir data
!mv aclImdb  data

mkdir: cannot create directory ‘data’: File exists


In [12]:
# delete "unsup" directory
!rm -rf data/aclImdb/train/unsup

In [6]:
# copy glove file 
!mkdir data/glove.6B
!mv *.txt data/glove.6B

In [7]:
BASE_DIR = 'data'                                        #change this to your local folder with these below datasets
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')           #source: https://nlp.stanford.edu/projects/glove/
TRAIN_DATA_DIR = os.path.join(BASE_DIR, 'aclImdb/train') #source: http://ai.stanford.edu/~amaas/data/sentiment/
TEST_DATA_DIR = os.path.join(BASE_DIR, 'aclImdb/test') 

# Within these, I only have a pos/ and a neg/ folder containing text files
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000 
EMBEDDING_DIM = 100 
VALIDATION_SPLIT = 0.2

In [8]:
#  load the data from the dataset into the notebook. Will be called twice - for train and test.
def get_data(data_dir):
  # list of text samples
  texts = []
  # dictionary mapping label name to numeric id
  labels_index = {"pos": 1, "neg": 0}
  # list of label ids
  labels = []

  for name in sorted(os.listdir(data_dir)):
    path = os.path.join(data_dir, name)
    if os.path.isdir(path):
      label_id = labels_index[name]
      for fname in sorted(os.listdir(path)):
        fpath = os.path.join(path, fname)
        text = open(fpath).read()
        texts.append(text)
        labels.append(label_id)
  return texts, labels

In [13]:
train_texts, train_labels = get_data(TRAIN_DATA_DIR)
test_texts, test_labels = get_data(TEST_DATA_DIR)

labels_index = {"pos": 1, "neg": 0}

#Just to see how the data looks like. 
print(train_texts[0])
print(train_labels[0])
print(test_texts[24999])
print(test_labels[24999])

Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
0
I've seen this story before but my kids haven't. Boy with troubled past joins military, faces his past, falls in love and becomes a man. The mentor this time is played perfectly by Kevin Costner; An ordinary man with common everyday problems who lives an extraordinary conviction, to save lives. After losing his team he takes a teaching posi

## Loading and Preprocessing

The first step toward training any ML or DL model is to define a feature representation. This step has been relatively straightforward in the approaches we’ve seen so far, with BoW or embedding vectors. 

However, for neural networks, we need further processing of input vectors Let’s quickly recap the steps involved in converting training and test data into a format suitable for the neural network input layers:

1. Tokenize the texts and convert them into word index vectors.
2. Pad the text sequences so that all text vectors are of the same length.
3. Map every word index to an embedding vector. We do that by multiplying word
index vectors with the embedding matrix. The embedding matrix can either be
populated using pre-trained embeddings or it can be trained for embeddings on
this corpus.
4. Use the output from Step 3 as the input to a neural network architecture.

Once these are done, we can proceed with the specification of neural network architectures and training classifiers with them.

Let's vectorize these text samples into a 2D integer tensor using Keras Tokenizer.

Tokenizer is fit on training data only, and that is used to tokenize both train and test data.

In [14]:
# Step-1: Tokenize the texts and convert them into word index vectors.
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_texts)

# Converting text to a vector of word indexes
train_sequences = tokenizer.texts_to_sequences(train_texts)  
test_sequences = tokenizer.texts_to_sequences(test_texts)

word_index = tokenizer.word_index
print("Found %s unique tokens." % len(word_index))

Found 88582 unique tokens.


Now, we will convert this to sequences to be fed into neural network. Max sequence length is 1000 as set earlier initial padding of 0s, until vector is of size `MAX_SEQUENCE_LENGTH`.

In [25]:
# Step-2: Pad the text sequences so that all text vectors are of the same length.
trainvalid_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

trainvalid_labels = to_categorical(np.asarray(train_labels))
test_labels = to_categorical(np.asarray(test_labels))

# split the training data into a training set and a validation set
indices = np.arange(trainvalid_data.shape[0])
np.random.shuffle(indices)

trainvalid_data = trainvalid_data[indices]
trainvalid_labels = trainvalid_labels[indices]

num_validation_samples = int(VALIDATION_SPLIT * trainvalid_data.shape[0])

# This is the data we will use for CNN and RNN training
x_train = trainvalid_data[:-num_validation_samples]
y_train = trainvalid_labels[:-num_validation_samples]
x_val = trainvalid_data[-num_validation_samples:]
y_val = trainvalid_labels[-num_validation_samples:]

print("Training set : ", (x_train.shape, y_train.shape))
print("Validation set : ", (x_val.shape, y_val.shape))
print("Splitting the train data into train and valid is done")

Training set :  ((20000, 1000), (20000, 2))
Validation set :  ((5000, 1000), (5000, 2))
Splitting the train data into train and valid is done


**Step 3**: If we want to use pre-trained embeddings to convert the train and test data into an embedding matrix like we did in the earlier examples with Word2vec and fastText, we have to download them and use them to convert our data into the input format for the neural networks.

GloVe embeddings come with multiple dimensionalities, and we chose 100 as our dimension here. The value of dimensionality is a hyperparameter, and we can experiment with other dimensions as well.

In [26]:
# Step-3: Map every word index to an embedding vector. We do that by multiplying word index vectors with the embedding matrix. 
# The embedding matrix can either be populated using pre-trained embeddings or it can be trained for embeddings on this corpus.

# first, build index mapping words in the embeddings set to their embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, "glove.6B.100d.txt")) as f:
  for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype="float32")
    embeddings_index[word] = coefs
print("Found %s word vectors in Glove embeddings." % len(embeddings_index))
print(embeddings_index["google"])

# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
  if i > MAX_NUM_WORDS:
    continue
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    # words not found in embedding index will be all-zeros.
    embedding_matrix[i] = embedding_vector

Found 400000 word vectors in Glove embeddings.
[ 0.22575  -0.56253  -0.05156  -0.079389  1.1876   -0.48397  -0.23342
 -0.85278   0.97495  -0.33344   0.71692   0.12644   0.31962  -1.4136
 -0.57903  -0.037286 -0.0164    0.45155  -0.29005   0.52599  -0.22534
 -0.29556  -0.032407  1.5608   -0.013499 -0.064558  0.26625   0.78595
 -0.71693  -0.93025   0.80461   1.6035   -0.30602  -0.34764   0.93872
  0.38137  -0.26743  -0.56519   0.58899  -0.14554  -0.34324   0.21291
 -0.39887   0.090042 -0.8495    0.38803  -0.5045   -0.22488   1.0644
 -0.2624    1.0334    0.06348  -0.39989   0.24236  -0.65636  -1.8107
 -0.061801  0.13795   1.1658   -0.30046  -0.50143   0.16509   0.039835
  0.62541   0.56935   0.64125   0.21308   0.30276   0.39673   0.38973
  0.28183   0.79481  -0.11962  -0.49598  -0.53195  -0.14897   0.51254
 -0.39208  -0.58535  -0.078509  0.81721  -0.73497  -0.68131   0.099243
 -0.87608   0.029632  0.33402  -0.14305   0.16964  -0.035178  0.39777
  0.71769   0.25867  -0.36201   0.45698  -0.

The input layer for textual input is typically an embedding layer. The output layer, especially in the context of text classification, is a softmax layer with categorical output. If we want to train the input layer instead of using pre-trained embeddings, the easiest way is to call the Embedding layer class in Keras, specifying the input and output dimensions.

However, since we want to use pre-trained embeddings, we should create a custom
embedding layer that uses the embedding matrix we just built.

In [27]:
# Step-4: load these pre-trained word embeddings into an Embedding layer, note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words, EMBEDDING_DIM, embeddings_initializer=Constant(embedding_matrix), input_length=MAX_SEQUENCE_LENGTH, trainable=False)
print("Preparing of embedding matrix is done")

Preparing of embedding matrix is done


This will serve as the input layer for any neural network we want to use (CNN or
LSTM). Now that we know how to pre-process the input and define an input layer.

let’s move on to specifying the rest of the neural network architecture using CNNs and LSTMs.

## CNNs for Text Classification

CNNs typically consist of a series of convolution and pooling layers as the hidden layers. In the context of text classification, CNNs can be thought of as learning the most useful bag-of-words/n-grams features instead of taking the entire collection of words/n-grams as features. 

Since our dataset has only two classes—positive and negative—the output layer has two outputs, with the softmax activation function. We’ll define a CNN with three convolution-pooling layers using the Sequential model class in Keras, which allows us to specify DL models as a sequential stack of layers—one after another. 

Once the layers and their activation functions are specified, the next task is to define other important parameters, such as the optimizer, loss function, and the evaluation metric to tune the hyperparameters of the model. Once all this is done, the next step is to train and evaluate the model.

In [None]:
cnnmodel = Sequential()
cnnmodel.add(embedding_layer)

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())

cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(labels_index), activation="softmax"))

cnnmodel.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

print(f"Train on {len(x_train)} samples, validate on {len(test_data)} samples")
# Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train, batch_size=128, epochs=1, validation_data=(x_val, y_val))

# Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print("Test accuracy with CNN:", acc)

A good approach while building your models is to experiment with different settings (i.e., hyperparameters). Keep in mind that all these decisions come with some associated cost. 

For example, in practice, we have the number of epochs as 10 or above. But that also increases the amount of time it takes to train the model.

## CNN model with training your own embedding

Another thing to note is that, if you want to train an embedding layer instead of using pretrained embeddings in this model, the only thing that changes is the line `cnnmodel.add(embedding_layer)`. Instead, we can specify a new embedding layer as, for example, `cnnmodel.add(Embedding(Param1, Param2))`.

In [None]:
print("Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings")
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))

cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())

cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(labels_index), activation="softmax"))

cnnmodel.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

# Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train, batch_size=128, epochs=1, validation_data=(x_val, y_val))

# Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print("Test accuracy with CNN:", acc)

We’ll notice that, in this case, training the embedding layer on our own dataset seems to result in better classification on test data.

However, if the training data were substantially small, sticking to the pre-trained embeddings, or using the domain adaptation techniques, would be a better choice.

## LSTMs for Text Classification

LSTMs and other variants of RNNs in general have become the go-to way of doing neural language modeling in the past few years. This is primarily because language is sequential in nature and RNNs are specialized in
working with sequential data. The current word in the sentence depends on its context—the words before and after. 

However, when we model text using CNNs, this crucial fact is not taken into account. RNNs work on the principle of using this context while learning the language representation or a model of language. Hence, they’re known to work well for NLP tasks.

In this section, we’ll see an example of using RNNs for text classification. Now that we’ve already seen one neural network in action, it’s relatively easy to train another! Just replace the convolutional and pooling parts with an LSTM in the prior two code examples.

In [None]:
print("Defining and training an LSTM model, training embedding layer on the fly")

rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))

rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(len(labels_index), activation="sigmoid"))

rnnmodel.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model. Tune to validation set. 
rnnmodel.fit(x_train, y_train, batch_size=32, epochs=1, validation_data=(x_val, y_val))

# Evaluate on test set:
score, acc = rnnmodel.evaluate(test_data, test_labels, batch_size=32)
print("Test accuracy with RNN:", acc)

## LSTM Model using pre-trained Embedding Layer

In [None]:
print("Defining and training an LSTM model, using pre-trained embedding layer")

rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)

rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(len(labels_index), activation="sigmoid"))

rnnmodel2.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model. Tune to validation set. 
rnnmodel2.fit(x_train, y_train, batch_size=32, epochs=1, validation_data=(x_val, y_val))

# Evaluate on test set:
score, acc = rnnmodel2.evaluate(test_data, test_labels, batch_size=32)
print("Test accuracy with RNN:", acc)

While LSTMs are more powerful in utilizing the sequential nature of text, they’re much more data hungry as compared to CNNs. Thus, the relative lower performance of the LSTM on a dataset need not necessarily be interpreted as a shortcoming of the model itself. It’s possible that the amount of data we have is not sufficient to utilize the full potential of an LSTM.