# Tokenization & Feature Extraction using Keras' Tokenizer

After the Data Preprocessing, we are ready to use Keras' Tokenizer class to tokenize and create numeric features based on the new text corpus that contains most common phrases.

We will use the following Tokenizer methods:

> fit_on_texts (Vectorize a text corpus, by turning each text into either a sequence of integerr. Each integer represents the index of a token in a dictionary)

> texts_to_sequences (Transforms each training text in texts to a sequence of integers)

> pad_sequences (Add padding to a text)

## Import Modules

In [1]:
import time
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from gensim.models import Phrases
from collections import defaultdict, Counter, OrderedDict

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from gensim.models import Word2Vec

## Load Train and Test Data

In [2]:
df_train = pd.read_csv('Datasets/SST5_master_train.csv')
df_train.Processed_Reviews = df_train.Processed_Reviews.astype(str)
df_train.head()

Unnamed: 0.1,Unnamed: 0,label,review,type,Processed_Reviews
0,0,4,The Rock is destined to be the 21st Century 's...,train,the rock is destined to be the 21st century ne...
1,1,5,The gorgeously elaborate continuation of `` Th...,train,the gorgeously elaborate continuation of the l...
2,2,4,Singer/composer Bryan Adams contributes a slew...,train,singer composer bryan adam contributes slew of...
3,3,3,You 'd think by now America would have had eno...,train,you think by now america would have had enough...
4,4,4,Yet the act is still charming here .,train,yet the act is still charming here


In [3]:
df_test = pd.read_csv('Datasets/SST5_master_test.csv')
df_test.head()

Unnamed: 0.1,Unnamed: 0,label,review,type,Processed_Reviews
0,9645,4,It 's a lovely film with lovely performances b...,test,it lovely film with lovely performance by buy ...
1,9646,3,"No one goes unindicted here , which is probabl...",test,no one go unindicted here which is probably fo...
2,9647,4,And if you 're not nearly moved to tears by a ...,test,and if you re not nearly moved to tear by coup...
3,9648,5,"A warm , funny , engaging film .",test,warm funny engaging film
4,9649,5,Uses sharp humor and insight into human nature...,test,us sharp humor and insight into human nature t...


## Tokenize the text

Keras' Tokenizer class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) 

In [4]:
%%time

'''
Tokenizing the text

- num_words: the maximum number of words to keep
- oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls
'''
# We will keep only the top max_words number of words (high-frequency tokens) from the dataset.
# This will be used to define the fixed length of the feature vectors.
max_words = 100 

tokenizer = Tokenizer(num_words = max_words, oov_token = '<OOV>')

# Fit the Tokenizer object on the training data.
# This updates internal vocabulary based on a list of tokenized texts.
tokenizer.fit_on_texts(df_train['Processed_Reviews'])

Wall time: 209 ms


In [5]:
'''
Full list of words are available through the "word_index" property of tokenizer.
It returns a dictionary of key-value pairs, in which each word is a key,
and its index is a value.

'''
word_index = tokenizer.word_index
print("Number of unique words (tokens): %d" % len(word_index))

# Print the index of the word "the"
print("\nIndex of the word 'the':", word_index.get("the"))

vocab_size = len(word_index) + 1
print("\nSize of vocabulary: ", vocab_size)

Number of unique words (tokens): 14796

Index of the word 'the': 2

Size of vocabulary:  14797


In [6]:
'''
Transforms each text in texts to a sequence of integers.
'''

# Transforms each training text in texts to a sequence of integers.
sequences_train = tokenizer.texts_to_sequences(df_train['Processed_Reviews']) 

# Transforms each test text in texts to a sequence of integers.
sequences_test = tokenizer.texts_to_sequences(df_test['Processed_Reviews']) 

In [7]:
'''
Add padding to the beginning of a text (sentence).
The number of padds is determined based on the length of the longest text.

The "pad_sequences" function transforms a list (of length num_samples) of sequences (lists of integers) 
into a 2D Numpy array of shape (num_samples, num_timesteps). 
The num_timesteps is either the maxlen argument if provided, or the length of the longest sequence in the list.
Sequences that are shorter than num_timesteps are padded with value until they are num_timesteps long.

Arguments:
- maxlen=None
- dtype='int32'
- padding='pre' (padding is added at the beginning)
- truncating='pre' (if a sentence is longer than the "maxlen", then cut the sentence at the beginning)

'''

maxlen = 100 # Cuts off reviews after 100 words 

padded_data_train = pad_sequences(sequences_train, maxlen=maxlen)

padded_data_test = pad_sequences(sequences_test, maxlen=maxlen)


# Transform the labels as a numpy array
labels_train = np.asarray(df_train['label'])
labels_test = np.asarray(df_test['label'])

# Show output array shapes
print("\nShape of the Padded Training Data Tensor: ", padded_data_train.shape)
print("Shape of the Training Label Tensor: ", labels_train.shape)
print("\nShape of the Padded Test Data Tensor: ", padded_data_test.shape)
print("Shape of the Test Label Tensor: ", labels_test.shape)


Shape of the Padded Training Data Tensor:  (9645, 100)
Shape of the Training Label Tensor:  (9645,)

Shape of the Padded Test Data Tensor:  (1101, 100)
Shape of the Test Label Tensor:  (1101,)


## Create Train and Test sets

In [8]:
X_train = padded_data_train 
y_train = labels_train

print("\nTraining Data: ", X_train.shape)
print("Training Label: ", y_train.shape)

X_test = padded_data_test 
y_test = labels_test

print("\nTest Data: ", X_test.shape)
print("Test Label: ", y_test.shape)


Training Data:  (9645, 100)
Training Label:  (9645,)

Test Data:  (1101, 100)
Test Label:  (1101,)


## Create Validation set

In [9]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

## Create Embedding matrix

In [11]:
import tensorflow as tf
import tensorflow_hub as hub
# url = "https://tfhub.dev/google/elmo/2"
# embed = hub.Module(url)

In [None]:
# tfhub_dir = '/data/jupyter/common/model/text/tfhub'

In [None]:
# import sys, os
# def add_aion(curr_path=None):
#     if curr_path is None:
#         dir_path = os.getcwd()
#         target_path = os.path.dirname(dir_path)
#         if target_path not in sys.path:
#             print('Added %s into sys.path.' % (target_path))
#             sys.path.insert(0, target_path)
            
# add_aion()

In [None]:
# from aion.embeddings.elmo import ELMoEmbeddings

# import tensorflow as tf
# from keras import backend as K
# from keras.layers import Input, Lambda, Dense, Embedding, BatchNormalization, Concatenate, LSTM
# from keras.models import Model

# elmo_embs = ELMoEmbeddings(layer='elmo', verbose=20)
# elmo_embs.load(dest_dir=tfhub_dir)

In [12]:
# # Input Layers
# word_input_layer = Input(shape=(None, ), dtype='int32')
# elmo_input_layer = Input(shape=(None, ), dtype=tf.string)

# # Output Layers
# word_output_layer = Embedding(input_dim=vocab_size, output_dim=256)(word_input_layer)

# elmo_output_layer = Lambda(elmo_embs.to_keras_layer, output_shape=(None, 1024))(elmo_input_layer)

# output_layer = Concatenate()([word_output_layer, elmo_output_layer])
# output_layer = BatchNormalization()(output_layer)
# output_layer = LSTM(256, dropout=0.2, recurrent_dropout=0.2)(output_layer)
# output_layer = Dense(4, activation='sigmoid')(output_layer)

In [13]:
# # The pretrained embedding vectors has length (dimension) 300.
# embedding_dim = 300

# embedding_matrix = np.zeros((vocab_size, embedding_dim))

# pretrained_embeddings = 0

# # Create a weight matrix for words in training docs
# for word, i in word_index.items():
#     embedding_vector = getVector(word)
#     if embedding_vector is not None:
#         embedding_matrix[i] = embedding_vector
#         pretrained_embeddings +=1

In [14]:
# print("Number of vocabulary words that are not present in the pre-trained dictionary: ", 
#       vocab_size - pretrained_embeddings )

# print("\nPercentage of pre-trained vectors used: %.2f" % ((pretrained_embeddings*100.0)/vocab_size))

# print("\nWeight Matrix shape: ", embedding_matrix.shape)

# Train the Classifier to Fine-tune & Learn Embeddings

We train the classifier to fine-tune the pretrained embedding as well as to learn embeddings for the words that were not present in the pretrained dictionary.

We need to set the Keras Embedding layer for the fine-tuning purpose.

**Set the Embedding Layer using Pretrained Embeddings**


Embedding layer has two mandatory arguments:
- input_dim: vocab_size

          -- The number of unique words in the input dataset. 

- output_dim: embedding_dim 

        -- The size of Embedding word vectors. For the pretrained Word2vec embeddings the embedding_dim is 300.


To use pre-trained word vectors, we need to set two more parameters.
- embedding_matrix: it is the weights parameter 
- trainable: it should be set to False to keep the embeddings fixed

In [15]:
# Import our dependencies
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
from keras import backend as K
import keras.layers as layers
from keras.models import Model, load_model
from keras.engine import Layer
import numpy as np


# Create a custom layer that allows us to update weights (lambda layers do not have trainable parameters!)

class ElmoEmbeddingLayer(Layer):
    def __init__(self, **kwargs):
        self.dimensions = 32
        self.trainable=True
        super(ElmoEmbeddingLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable,
                               name="{}_module".format(self.name))

        self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
        super(ElmoEmbeddingLayer, self).build(input_shape)

    def call(self, x, mask=None):
        result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
                      as_dict=True,
                      signature='default',
                      )['default']
        return result

    def compute_mask(self, inputs, mask=None):
        return K.not_equal(inputs, '--PAD--')

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.dimensions)

Using TensorFlow backend.


In [16]:
# Delete the TensorFlow graph before creating a new model, otherwise memory overflow will occur.
keras.backend.clear_session()

# Set random seed for reproducable results
np.random.seed(42)
tf.random.set_seed(42)

In [17]:

# Function to build model
def build_model(): 
    input_text = layers.Input(shape=(1,), dtype="string")
    
    embedding = ElmoEmbeddingLayer()(input_text)
    
    dense = layers.Dense(100, activation='relu')(embedding)
    
    pred = layers.Dense(1, activation='sigmoid')(dense)

    model = Model(inputs=[input_text], outputs=pred)

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary()

    return model

## Create biLSTM Model

In [18]:
# model = keras.models.Sequential()
# model.add(keras.layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False))
# model.add(keras.layers.Bidirectional(keras.layers.LSTM(100)))
# model.add(keras.layers.Dense(1, activation='sigmoid'))

# model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [19]:
# # Create a path for the log sub-directory as curdir + Logs + currdatetime + modelname
# model_name = "Embedding-biLSTM"
# model_name_format = "Embedding-Dense-biLSTM.h5"
# run_logdir = os.path.join(os.curdir, "Logs", time.strftime("run_%Y_%m_%d-%H_%M_%S"), model_name)

In [28]:
# Build and fit
model = build_model()
model.fit(train_text, 
          train_label,
          validation_data=(test_text, test_label),
          epochs=1,
          batch_size=128)

AttributeError: module 'keras.backend' has no attribute 'tf'

In [30]:
tf.config.list_physical_devices('GPU') 

[]

In [None]:
# batch_size = 128
# epochs = 5

# params={
#     "batch_size":batch_size,
#     "epochs":epochs
# }

# # experiment.log_parameters(params)

# t0 = time.time()

# # with experiment.train():
# history = model.fit(X_train, y_train,
#                         epochs = epochs,
#                         batch_size = batch_size,
#                         verbose = True,
#                         validation_data = (X_val, y_val),
#                         callbacks= [tensorboard_cb, loss_history_cb]) 

# t1 = time.time()

# model.save(model_name_format)

# duration_Pretraining_sec = t1-t0
# duration_Pretraining = convertTime(t1 - t0)

# print("\nTraining Time: ", duration_Pretraining)

## Evaluate Classifier on Test Data

In [None]:
numOfEpochs = 5
print("Epochs: ", numOfEpochs)

# It will log metrics with the prefix 'test_'
test_loss_mlp, test_accuracy_mlp = model.evaluate(X_test, y_test, verbose=0)

metrics = {
    'loss':test_loss_mlp,
    'accuracy':test_accuracy_mlp
}

#experiment.log_metrics(metrics)

print("\nTest Accuracy: {:.3f}".format(test_accuracy_mlp))
print("Test Loss: {:.3f}".format(test_loss_mlp))

#y_test_predicted = (model.predict(X_test)>0.5)

y_test_predicted_proba = model.predict(X_test)

y_test_predicted = np.zeros((len(y_test),), dtype=int)

for i in range(len(y_test)):
    if(y_test_predicted_proba[i] > 0.5):
        y_test_predicted[i] = 1

true = 0
for i, y in enumerate(y_test):
    if y == y_test_predicted[i]:
        true += 1

In [None]:
print("Test: Correct Predictions: {}".format(true))
print("Test: Incorrect Predictions: {}".format(len(y_test_predicted) - true))

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))
# experiment.log_confusion_matrix(y_test.ravel(), y_test_predicted)

print(classification_report(y_test, y_test_predicted))

## Visualize Learning Curves

In [None]:
def plot_learning_curves(history, numOfEpochs, savePlot=False, plotName=None):
    '''Function For Generating Learning Curves (Accuracy & Loss)'''
    
    plt.figure(figsize=(18,6))

    plt.subplot(121)
    plt.plot(range(1,numOfEpochs+1),history.history['val_accuracy'],label='validation')
    plt.plot(range(1,numOfEpochs+1),history.history['accuracy'],label='training')
    plt.legend(loc=0)
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.xlim([1,numOfEpochs])
    plt.grid(True)

    
    plt.subplot(122)
    plt.plot(range(1,numOfEpochs+1),history.history['val_loss'],label='validation')
    plt.plot(range(1,numOfEpochs+1),history.history['loss'],label='training')
    plt.legend(loc=0)
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.xlim([1,numOfEpochs])
    plt.grid(True)
    
    if(savePlot == True):
        plt.savefig(plotName, dpi=300)

    
    plt.show() 


def plot_learning_rate(loss_history_lschedule, numOfEpochs, momentumPlot=False):
    '''Function to plot learning rate and momentum'''
    plt.figure(figsize=(10,6))
    plt.plot(range(1,numOfEpochs+1),loss_history_lschedule.lr,label='learning rate')
    plt.xlabel("Epoch")
    plt.xlim([1,numOfEpochs+1])
    plt.ylabel("Learning rate")
    
    if(momentumPlot==True):
        plt.plot(range(1,numOfEpochs+1),loss_history_lschedule.mom,'r-', label='momentum')
        plt.ylabel("Learning rate & Momentum")
    
    
    plt.legend(loc=0)
    plt.grid(True)
    plt.show()

In [None]:
plot_learning_curves(history, numOfEpochs, savePlot=False)

In [None]:
plot_learning_rate(loss_history_cb, numOfEpochs,  momentumPlot=False)

## Results and Observation