
In this notebook, we will try the process of implementing RNN with Keras in order to classify text sentences.

I.   **Firstly**, we'll import useful packages.

II.   **Then**, we'll load the data and create a word embedding matrix using Glove.

III.  **We'll try a simple RNN model** and then we will evaluate its performances.

IV. Finally, we'll use techniques to increase our model's accuracy.

**Task 1:** Setting Fre GPU in this Google Colab notebook.

## Mounting Google Drive locally
**Task 2:** Mount the Google Driver into the Google Colab Driver.


In [1]:
## TYPE YOUR CODE for task 2 here:

from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


# I. Let import all useful packages.

In [2]:
!pip install tensorflow-addons

Collecting tensorflow-addons
  Downloading tensorflow_addons-0.15.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[?25l[K     |▎                               | 10 kB 22.3 MB/s eta 0:00:01[K     |▋                               | 20 kB 25.4 MB/s eta 0:00:01[K     |▉                               | 30 kB 19.3 MB/s eta 0:00:01[K     |█▏                              | 40 kB 15.2 MB/s eta 0:00:01[K     |█▌                              | 51 kB 5.6 MB/s eta 0:00:01[K     |█▊                              | 61 kB 6.1 MB/s eta 0:00:01[K     |██                              | 71 kB 5.5 MB/s eta 0:00:01[K     |██▍                             | 81 kB 6.1 MB/s eta 0:00:01[K     |██▋                             | 92 kB 6.1 MB/s eta 0:00:01[K     |███                             | 102 kB 5.4 MB/s eta 0:00:01[K     |███▎                            | 112 kB 5.4 MB/s eta 0:00:01[K     |███▌                            | 122 kB 5.4 MB/s eta 0:00:01[K     |██

In [84]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import tensorflow.keras
import datetime
from tensorflow.keras import backend as K
import tensorflow.keras.optimizers as Optimizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau, EarlyStopping, Callback
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.metrics import TruePositives, FalsePositives, FalseNegatives
import tensorflow_addons as tfa

from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
from sklearn.metrics import confusion_matrix as CM
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
import matplotlib.pyplot as plot
import seaborn as sn

**Task 3**: Copy the dataset from Google Drive into Colab

In [4]:
## TYPE YOUR CODE for task 3 here:
!cp gdrive/MyDrive/dataset/asm2/train.csv .
!cp gdrive/MyDrive/dataset/asm2/glove.6B.50d.txt .

# II. Load the data.

## About dataset.
An invalid question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is invalid:

* Has a non-neutral tone.
* Is disparaging or inflammatory.
* Isn't grounded in reality.
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The data includes the question that was asked, and whether it was identified as invalid (target = 1). 

**Task 4**: Load the dataset.
* Load the data from CSV file.
* Remove all the rows with NA values.
* Split the data into 3 set: Training set, validation set and test set (0.9/0.05/0.05, random_seed = 9) with a same ratio of data number beween each class.
* Print out these dataset's description.




In [5]:
from sklearn.model_selection import train_test_split

def load_data(data_link):
    '''
    input: data link.
    output:
        train_set, validation_set and test_set(0.95/0.05/0.05) without NA values.
    '''
    ## TYPE YOUR CODE for task 4 here:
    df = pd.read_csv(data_link).dropna().iloc[:, 1:]  # drop id
    df.columns = ['text', 'label']

    # Split 0.9 for train
    train, validation_n_test = train_test_split(df, train_size=0.9, random_state=9, stratify=df['label'])
    # Split half for validation and half test (0.05 each)
    validation, test = train_test_split(validation_n_test, test_size=0.5, random_state=9, stratify=validation_n_test['label'])

    return train, validation, test

train_set, validation_set, test_set = load_data('train.csv')
print(train_set['label'].describe())
print(validation_set['label'].describe())
print(test_set['label'].describe())

count    1.175509e+06
mean     6.187022e-02
std      2.409198e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+00
Name: label, dtype: float64
count    65306.000000
mean         0.061863
std          0.240908
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: label, dtype: float64
count    65307.000000
mean         0.061877
std          0.240934
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: label, dtype: float64


# Encoding text data.
Let declare some fundamental parameters first:

In [6]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
max_len = 50 # max number of words in a question to use

**Task 5:** Encode the dataset using Tokenizer and one-hot encoding vector.
* Encode the text (question_text column) by turning each question text into a list of word indexes using [Tokenizer](https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do) with **max_features** and all the text sentences from the training and the validation set. 
* Turn each list of word indexes into an equal length - **max_len** (with truncation or padding as needed) using [pad_sequences](https://keras.io/preprocessing/sequence/).
* Encode the label (label column) using [to_categorical](https://keras.io/utils/) function on Keras.

In [7]:
def encoding_textdata(train_set, validation_set, test_set, max_features, max_len):
    '''
    Input:
    - Train/validation/test dataset.
    - max_features, max_len.
    Output:
    - X train/validation/test, y train/validation/test.
    - Tokenizer.
    '''
    ## TYPE YOUR CODE for task 5 here:
    tokenizer = Tokenizer(num_words=max_features)
    tokenizer.fit_on_texts(pd.concat([train_set['text'], validation_set['text']]))
    X_tr = tokenizer.texts_to_sequences(train_set['text'])
    X_tr = pad_sequences(X_tr, maxlen=max_len)    # use default padding and truncation i.e. 'pre'
    y_tr = to_categorical(train_set['label'])
    X_va = tokenizer.texts_to_sequences(validation_set['text'])
    X_va = pad_sequences(X_va, maxlen=max_len)
    y_va = to_categorical(validation_set['label'])
    X_te = tokenizer.texts_to_sequences(test_set['text'])
    X_te = pad_sequences(X_te, maxlen=max_len)
    y_te = to_categorical(test_set['label'])

    return (X_tr, y_tr), (X_va, y_va), (X_te, y_te), tokenizer

(X_tr, y_tr), (X_va, y_va), (X_te, y_te), tokenizer = encoding_textdata(train_set, validation_set, test_set, max_features, max_len)

In [8]:
# Check result
print(X_tr.shape)
X_tr

(1175509, 50)


array([[    0,     0,     0, ...,  3767,   391,   258],
       [    0,     0,     0, ..., 18261,    46,  1864],
       [    0,     0,     0, ...,     4, 16538,   562],
       ...,
       [    0,     0,     0, ...,  6150,    27,   286],
       [    0,     0,     0, ...,    23,   951,  2184],
       [    0,     0,     0, ...,     1,   933, 11291]], dtype=int32)

In [9]:
print(y_tr.shape)
y_tr

(1175509, 2)


array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

**Task 6**: Create word embedding matrix.
* Firstly, write a function to [load the GloVe dictionary.](https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db)
* Then, create a word embedding matrix using GloVe dictionary with these parameters:
    - Word embedding matrix shape: (Number of word, embed_size).
    - Embed size: 50.
    - Number of words: The minimum of (max_features, len(word_index)), while word_index is the dictionary of word which contains in tokenizer.
    - If a word occurs in GloVe dictionary, we should take its initialization value as in GloVe dictionary. Otherwise, take a normal random value with mean and std as mean and std of GloVe dictionary value.
    



In [10]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')
def get_GloVe_dict(GloVe_link):
    '''
    input: GloVe link.
    output: GloVe dictionary.
    '''
    ## TYPE YOUR CODE for task 6 here:
    GloVe_dict = {}
    with open(GloVe_link) as f:
        for line in f:
            arr = line.split(' ')
            key, val = get_coefs(arr[0], *arr[1:])
            GloVe_dict[key] = val
    return GloVe_dict

    
GloVe_link = 'glove.6B.50d.txt'
GloVe_dict = get_GloVe_dict(GloVe_link)
GloVe_dict['the']  # Check result with word 'the'

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
      dtype=float32)

In [52]:
def create_embedding_matrix(GloVe_dict, tokenizer, max_features):
    '''
    input: GloVe dictionaray, tokenizer from training and validation dataset, number of max features.
    output: Word embedding matrix.
    '''
    ## TYPE YOUR CODE for task 6 here:
    word_num = min(max_features, len(tokenizer.index_word))

    # Calculate mean and std
    df = pd.DataFrame.from_dict(GloVe_dict, orient='index')
    embed_size = df.shape[1]
    coefs = df.stack()
    mean = coefs.mean()
    std = coefs.std()

    # List of words in tokenizer (sort by index i.e. from max count)
    words = [tokenizer.index_word[idx] for idx in range(1, word_num)]

    # Create embedding matrix with world indices from tokenizer and coefs from GloVe_dict
    embedding_matrix = np.zeros((word_num, embed_size))
    # For unknow word
    embedding_matrix[0, :] = 0
    for idx in range(1, word_num):
        embedding_matrix[idx, :] = GloVe_dict.get(words[idx-1], np.random.normal(mean, std, embed_size))

    return embedding_matrix

embedding_matrix = create_embedding_matrix(GloVe_dict, tokenizer, max_features)
print(embedding_matrix.shape)
embedding_matrix


(20000, 50)


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.45322999,  0.059811  , -0.10577   , ...,  0.53240001,
        -0.25103   ,  0.62546003],
       ...,
       [-0.10721   , -1.36220002,  1.33169997, ...,  0.64011002,
         0.063936  , -1.74650002],
       [ 0.18455   , -0.68822998, -0.20072   , ..., -0.95519   ,
        -0.030273  , -0.31542   ],
       [-0.34103999,  0.17725   , -0.54510999, ..., -1.19029999,
        -0.20367999, -0.169     ]])

III. Modelling
There are some steps we need to finish:
Build the model.

Compile the model.

Train / fit the data to the model.

Evaluate the model on the testing set.

## Build the model
**Task 7:** We can build an easy model composed of different layers such as:
* [Embedding](https://keras.io/layers/embeddings/) layer with max_features, embed_size and embedding_matrix.
* [Bidirectional LSTM layer](https://keras.io/examples/nlp/bidirectional_lstm_imdb/?fbclid=IwAR3fEd6aWyeIDEhZSspjtCRiP0c0Jnz5-XdnUHQYwX8Tp8k9Ni4I8Q5tP9o) with number of hidden state = 50, dropout_rate = 0.1 and recurrent_dropout_rate = 0.1.
* GlobalMaxPool1D.
* Dense with number of unit = 50, activation = 'relu'.
* Dropout with rate = 0.1.
* Final dense with number of unit = number of class, activation = 'sigmoid'.

In [137]:
def create_model(max_len, max_features, embed_size):
    '''
    input: max_len, max_features, embed_size
    output: model.
    '''
    ## TYPE YOUR CODE for task 7 here:
    i = Input(shape=(max_len,))
    x = Embedding(max_features, embed_size, mask_zero=True,
                  embeddings_initializer=tensorflow.keras.initializers.constant(embedding_matrix),
                  trainable=False)(i)
    x = Bidirectional(LSTM(units=25, dropout=0.1, recurrent_dropout=0.1, return_sequences=True))(x)
    x = GlobalMaxPool1D()(x)
    x = Dense(50, activation='relu')(x)
    x = Dropout(0.1)(x)
    o = Dense(2, activation='sigmoid')(x)
    model = Model(i, o)
    return model

model = create_model(max_len, max_features, embed_size)



**Task 8:** Compile the model and setup the callback. Then print out the model summary.
* [Compile](https://keras.io/models/model/#compile) the model with Adam Optimizaer, lr = 1e-2, suitable loss for binary classification problem and ["F1-score"](https://github.com/tensorflow/addons/issues/825) as metric.
* Print out the model summary.

In [147]:
class F1Callback(Callback):
    def __init__(self, X_va, y_va):
        super().__init__()
        self.X_va = X_va
        self.y_va = y_va.argmax(1)
    def on_epoch_end(self, epoch, logs=None):
        logs['val_f1_score'] = f1_score(self.y_va, self.model.predict(self.X_va).argmax(1), average='weighted')

class PrintCallback(Callback):
    def on_epoch_begin(self, epoch, logs=None):
        print("Epoch {}:".format(epoch), end='')
    def on_epoch_end(self, epoch, logs=None):
        print(" loss: {:.3}, acc: {:.3}, val_acc: {:.3}, val_f1: {:.3}".format(logs['loss'],
                                                                               logs['accuracy'],
                                                                               logs['val_accuracy'],
                                                                               logs['val_f1_score']))

In [148]:
def optimize(model):
    '''
    Input: 
        Model.
    Return: 
        Complied model.
    '''
    ## TYPE YOUR CODE for task 8 here:
    model.compile(optimizer=optimizers.Adam(learning_rate=1e-2),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

model = optimize(model)
print(model.summary())

Model: "model_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_14 (InputLayer)       [(None, 50)]              0         
                                                                 
 embedding_13 (Embedding)    (None, 50, 50)            1000000   
                                                                 
 bidirectional_13 (Bidirecti  (None, 50, 50)           15200     
 onal)                                                           
                                                                 
 global_max_pooling1d_13 (Gl  (None, 50)               0         
 obalMaxPooling1D)                                               
                                                                 
 dense_26 (Dense)            (None, 50)                2550      
                                                                 
 dropout_13 (Dropout)        (None, 50)                0  

**Task 9**: Setup callback.
* Create the [tensorboard callback](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks) to save the logs.
* Create the [checkpoint callback](https://machinelearningmastery.com/check-point-deep-learning-models-keras/) to save the checkpoint with the best accuracy after each epoch.
* Create the [ReduceLROnPlateau](https://keras.io/callbacks/#reducelronplateau) callback with factor=0.3, patience=1 and "Validation F1-score" monitor.
* Create the [early stopping callback](https://keras.io/callbacks/#earlystopping) with patience=7, mode = 'max' and "Validation F1-score" monitor.



In [149]:
def callback_model(checkpoint_name, logs_name):
    '''
    Input: 
        Best checkpoint name, logs name.
    Return: 
        Callback list, which contains tensorboard callback and checkpoint callback.
    '''
    ## TYPE YOUR CODE for task 9 here:
    f1_cb = F1Callback(X_va, y_va)
    pr_cb = PrintCallback()
    ts_cb = TensorBoard(logs_name)
    cp_cb = ModelCheckpoint(checkpoint_name,
                            monitor='val_f1_score',
                            save_best_only=True,
                            save_weights_only=True,
                            mode='max',
                            save_freq='epoch')
    lr_cb = ReduceLROnPlateau(monitor='val_f1_score', factor=0.3, patience=1, mode='max')
    es_cb = EarlyStopping(monitor='val_f1_score', patience=7, mode='max', min_delta=0.002)

    return [f1_cb, pr_cb, ts_cb, cp_cb, lr_cb, es_cb]

checkpoint_name = 'weights.best.hdf5'
logs_name = 'training_logs'
callbacks_list = callback_model(checkpoint_name, logs_name)

**Task 10:** Train the model.

* Train the model with 20 epochs with batch_size = 4096.
* Return the model with best-checkpoint weights.

*Hint*: Fit the model first, then reload the model (load_model function) with best-checkpoint weights.

In [None]:
def train_model(model, callbacks_list):
    '''
    Input: 
        Model and callback list,
    Return: 
        Model with best-checkpoint weights.
    '''
    ## TYPE YOUR CODE for task 10 here:
    model.fit(X_tr, y_tr,
              epochs=20,
              batch_size=4096,
              validation_data=(X_va, y_va),
              callbacks=callbacks_list)
    model.load_weights(callbacks_list[3].filepath)
    return model

model = train_model(model, callbacks_list)


Epoch 0:Epoch 1/20
Epoch 1:Epoch 2/20
Epoch 2:Epoch 3/20
Epoch 3:Epoch 4/20
Epoch 4:Epoch 5/20
Epoch 5:Epoch 6/20
Epoch 6:Epoch 7/20
Epoch 7:Epoch 8/20
Epoch 8:Epoch 9/20
Epoch 9:Epoch 10/20
Epoch 10:Epoch 11/20

In [None]:
!cp weights.best.hdf5 gdrive/MyDrive/dataset/asm2/

In [None]:
!cp -r training_logs gdrive/MyDrive/dataset/asm2/

**Task 11:** Show the tensorboard in the notebook.

In [None]:
## TYPE YOUR CODE for task 11 here:
%load_ext tensorboard
%tensorboard --logdir training_logs

**Task 12:** Prediction on test set.

* Complete the get_prediction_classes function.
* Print out the precision, recall and F1 score.

In [None]:
def get_prediction_classes(model, X, y):
    ## TYPE YOUR CODE for task 13 here:
    '''
    Input: 
        Model and prediction dataset.
    Return: 
        Prediction list and groundtrurth list with predicted classes.
    '''
    groundtruths = y.argmax(1)
    prediction = model.predict(X).argmax(1)
    return predictions, groundtruths

test_predictions, test_groundtruths = get_prediction_classes(model,  X_te, y_te)
print(precision_score(test_predictions, test_groundtruths), average='weighted')
print(recall_score(test_predictions, test_groundtruths), average='weighted')
print(f1_score(test_predictions, test_groundtruths), average='weighted')

**Task 13:** Perform the predicted result on test set using confusion matrix. Remember to show the class name in the confusion matrix.

In [None]:
def plot_confusion_matrix(predictions, groundtruth, class_names):
    ## TYPE YOUR CODE for task 13 here:
    p = ConfusionMarixDisplay(confusion_matrix(groundtruth, predictions))
    plot.show()
class_names = ['valid', 'invalid']
plot_confusion_matrix(test_predictions, test_groundtruths, class_names)

**Task 14**: Model finetuning - fine tune the model using some of these approachs:
* Increase max epochs, change batch size.
* Replace LSTM by GRU units and check if it changes anything.
* Add another layer of LSTM/GRU, see if things improve.
* Play around with Dense layers (add/# units/etc).
* Find preprocessing rules you could add to improve the quality of the data.
* Find another GloVe dictionary.
Requirement: The F1 score should increase by 2-3%.

In [None]:
## TYPE YOUR CODE for task 14 here: