# Models

We will cover following models:
* Embedding => Class
* Embedding => Simple RNN => Class
* Embedding => Bi-directional RNN => Class

### Load data
Load the toxic comment classification challenge dataset
and split the dataset into training, validation, testing
#### Input text for training
1. Get the training data
    * read the csv data file using pandas
    * tokenize the data
    * assign a dimension to each word
    * convert into embeddings

#### Read CSV

In [1]:
import pandas as pd
train_csv = './data/toxic-comments/train.csv'
train_df = pd.read_csv(train_csv)
# ToDo : sort the df based on size of comments (no. of words in comment)

In [2]:
rowsums=train_df.iloc[:,2:].sum(axis=1)
train_df['clean']=(rowsums==0)
train_texts = train_df['comment_text']
train_labels = train_df['clean']

#### Output labels (class/target) for training
2. Get the training label
    * read the labels and convert into one-class labels
    * We will focus on 2 class problem: toxic and non toxic comments
    * We will label all different types of toxic comments into same category of toxic label:
        * 0 for toxic comment
        * 1 for non-toxic comments
    * Later we can explore how to make it multiclass classifier

### Tokenization
Now we have training data in two separate array: an ordered array consisting of comments (input for the network) and another array consisting of class lables in same order (output of the network).

We have to transform this data into network input format and output format. This step is called pre-processing.  
Steps of preprocessing:

1. Tokenize the text into words
2. Assign each word a dimension


To accompolish step 1 and 2 we will use inbuilt Tokenizer

In [3]:
from keras.preprocessing.text import Tokenizer
max_vocab_size = 10000
tokenizer = Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokes.' % len(word_index))

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


[688, 75, 1, 126, 130, 177, 29, 672, 4511, 1116, 86, 331, 51, 2278, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]
Found 210337 unique tokes.


### Batching and Preprocessing (padding) for Embedding
Now once we have the tokens, we will do following steps to create word embeddings  

3. Then use this dimension assignment to define embedding
4. Use word embedding to create word vector for a comment


We will use a specific type of Layer for this, which is called Embedding Layer. The above generated tokens will go as input to Embedding layer, which will output word embeddings as output to next layer:  

   **Input**: 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers (output of above code).  
   **Output**: 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality).  

Sequence length can be variable per batch. But in a single batch sequence length will be same for all sequences.  

So from data we have to create batches of sequence of similar length and then pad or truncate each sequence to have same sequence length within a particular batch. And we can use each batch as a training input for embedding layer.  

For sample case: we take 10k sequence from 160k for training in a single batch. And take max sequence length of 50 words.


In [4]:
from keras import preprocessing
training_sequences = sequences[:10000]
training_labels = train_labels[:10000]
seq_max_len = 20
# training padded sequences
train_seq_pad = preprocessing.sequence.pad_sequences(sequences=training_sequences, maxlen=seq_max_len)

# testing padded sequences
testing_sequences = sequences[10000:11000]
testing_labels = train_labels[10000:11000]
test_seq_pad = preprocessing.sequence.pad_sequences(sequences=testing_sequences, maxlen=seq_max_len)

### Model 1. : Embedding to Class

#### Define the model 1
Model 1 is made of 4 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer (Hidden Layer)
    - Layer 2 is Flatten Layer (Flattens the embedding layer)
    - Layer 3 is Dense Layer (output layer)
    
**Embedding Layer**: This layer help us create word embedding (discussed in Sequence Representation section). For a single input (a sentence which comes as a seq. of integer) its output is 2D. Each integer(representing a word) gets transformed into a vector; so for a seq. of int. it generates a 2D matrix.

**Flatten Layer**: Embedding layer outputs in 2D matrix, to use the output in a Dense layer upstream the output need to transformed into 1D and flatten layer does that.

In [5]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.layers.embeddings import Embedding

model_1 = Sequential()

# no. of unique words in the text data, each word in vocab will be assigned an index (dimension).
vocab_size = 10000 

# max length of single input data point i.e. count of words present in an input sentence
# short seq are padded and long ones are truncated, done above
# input of the network
seq_max_len = 20 

# dimension of word embedding model (output dimension of embedding layer)
embedding_dim = 8 
# input to layer 0 is data of shape: [batch_size, seq_max_len]
# add layer 1 in the network
model_1.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
# output of layer 1 is data of shape: [batch_size, embedding_dim, seq_max_len]

## layer 2: flatten the input of shape [batch_size, embedding_dim, seq_max_len] 
#          to output of shape [batch_size, embedding_dimension*seq_max_len]
model_1.add(Flatten())

## layer 3(output layer): Dense layer - all nodes from previous layers are connected to each nodes from this layer
#          this has 1 unit/node for classification(toxic/non-toxic)
#          and activation for 2 classes: sigmoind
model_1.add(Dense(1, activation='sigmoid'))

## compile: configure the model for training
# optimizer: it is the method use to update the network, 
#            it is generally variant of stochastic gradient descent (SGD)  
#            this method is use iteratively to update the network weights
# loss:      it is the (objective) function that will be minimised
# metrics:   this is use to measure the performance of network
model_1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

Instructions for updating:
Colocations handled automatically by placer.


In [6]:
# prints the summary of the model
model_1.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [7]:
# fit: trains the network for a fixed no. of epoch
history_1 = model_1.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Instructions for updating:
Use tf.cast instead.
Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<img src="img/m11.png" alt="Visual representation of one hot encodding and word embedding" style="width: 600px;"/>  


Created using [NN SVG Tool](http://alexlenail.me/NN-SVG/index.html)  

For above diagram, following configs are used(1/4th of the ones used in code):
1. seq_max_len = 5
2. embedding_dim = 2
3. flatten layer = 5x2 = 10
4. desnse output layer = 1




#### Test the model 1

We will take a small test data from the unused training data to test our basic model.  

`model_1.evaluate` method is use to evaluate the model. For evaluation we give input the test data in the same format as of training data together with label data for the test data to compare with.

Ref: Listing 6.7 Deep Learning with Python book  

In [8]:
print(model_1.metrics_names)
model_1.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.2028051826953888, 0.925]

### Model 2: Embedding => RNN => Output
In this model 2 we will extend the Model 1 by adding an RNN layer in between the Embedding layer and output layer.

#### Define the model 2
Model 2 is made of 4 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer
    - Layer 2 is RNN layer
    - Layer 3 is Dense Layer (output/classification layer) 
    
    
**RNN : Recurrent Neural Network**


<img src="img/rnn.png" alt="Recurrent Neural Network" style="width: 300px;"/>  

Source: [Deep Learning with Python, Book by François Chollet](https://www.manning.com/books/deep-learning-with-python)

RNN is a neural network has following properties:
 - processes each element(word) of a sequence(sentence) one by one 
 - and output of intermediate element is fed back together with the next element.
 - The state of RNN is reset between two indepeent sequence
 
Input for Dense layer is only the output at the end of the sequence.

![Unrolled RNN](img/RNN-unrolled.png)  


Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In this model, we don't need Flatten layer as by default SimpleRNN layer output only the last element from the processed output (h_t)

In [9]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

# model definition
model_2 = Sequential()
model_2.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
model_2.add(SimpleRNN(32))
model_2.add(Dense(1, activation='sigmoid'))
model_2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [10]:
model_2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                1568      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 161,601
Trainable params: 161,601
Non-trainable params: 0
_________________________________________________________________


#### Train the model 2

In [11]:
history_2 = model_2.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the model 2

In [12]:
print(model_2.metrics_names)
model_2.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.5610463897287845, 0.854]

We see that above model didn't have good accuracy compared to much simpler model. We didn't use most of the data, training data is very less and also value of seq_len was less for training data and more for testing data.


We can extend the model by adding more RNN layers in between and for the above we didn't use the out of intermediate output of RNN layer.

#### Extended model 2
Extended model 2 is made of 6 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer
    - Layer 2 is RNN layer (return full sequence)
    - Layer 3 is RNN layer (return full sequence)
    - Layer 4 is RNN layer (return last output)
    - Layer 5 is Dense Layer (output/classification layer) 
    
In this setup we have to pass full processed output for all but last RNN layer.  
For above model 1, it is many to one  
For model 2, it is many to many from below diagram


![RNN types](img/rnn.jpeg)  


Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [13]:
model_2_ext = Sequential()
model_2_ext.add(Embedding(vocab_size, embedding_dim))
# for intermediate layers, we want to return output of each cell of RNN, 
# so that it forms a seq. which is processed by next RNN layer
model_2_ext.add(SimpleRNN(32, return_sequences=True))
model_2_ext.add(SimpleRNN(64, return_sequences=True))
# in final RNN layer we will not return the sequence but only the final output,
# which is use in the next non RNN layer e.g. Dense layer in this case
model_2_ext.add(SimpleRNN(32))
model_2_ext.add(Dense(1, activation='sigmoid'))
model_2_ext.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [14]:
model_2_ext.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, None, 32)          1568      
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, None, 64)          6208      
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 32)                3104      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 170,913
Trainable params: 170,913
Non-trainable params: 0
_________________________________________________________________


#### Train the ext. model 2

In [15]:
history_2_ext = model_2_ext.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the ext. model 2

In [16]:
print(model_2_ext.metrics_names)
model_2_ext.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[1.6925468402430415, 0.755]

### Model 3: Embedding => Bidirectional RNN => Output
In this model 3 we will extend the Model 2 by wrapping the RNN layer with a Bidirectional wrapper.

#### Define the model 3
Model 3 is made of 4 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer
    - Layer 2 is Bidirectional RNN layer (return last output)
    - Layer 3 is Dense Layer (output/classification layer)  

**Bidirectional Layer**  
Bidirectional layer of two hidden layers with opposite direction for input, it processes the sequence in both order backward and forward and generates a combined output.  
Ref for more details: 
 1. https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks
 2. https://d2l.ai/chapter_recurrent-neural-networks/bi-rnn.html


![Bidirectional RNN](img/bidirectional-rnn.png)  


Source: https://www.wandb.com/classes/intro/class-9-notes

In [19]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.layers.wrappers import Bidirectional

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

# model definition
model_3 = Sequential()
model_3.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
# [1] This will create two copies of the hidden layer, 
# one fit in the input sequences as-is and one on a reversed copy of the input sequence. 
# By default, the output values from these LSTMs will be concatenated.
model_3.add(Bidirectional(SimpleRNN(32)))
model_3.add(Dense(1, activation='sigmoid'))
model_3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [20]:
model_3.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64)                3136      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
Total params: 163,201
Trainable params: 163,201
Non-trainable params: 0
_________________________________________________________________


#### Train model 3

In [21]:
history_3 = model_3.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Testing model 3

In [22]:
print(model_3.metrics_names)
model_3.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.38808986937999723, 0.9]

Similarly like model 2, model 3 can be extended by adding more bidirectional layers in between.  

#### Extended model 3
Extended model 3 is made of 6 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer
    - Layer 2 is Bidirectional RNN layer (return full sequence)
    - Layer 3 is Bidirectional RNN layer (return full sequence)
    - Layer 4 is Bidirectional RNN layer (return last output)
    - Layer 5 is Dense Layer (output/classification layer) 


In [23]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.layers.wrappers import Bidirectional

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

In [24]:
# model definition
model_3_ext = Sequential()
model_3_ext.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
model_3_ext.add(Bidirectional(SimpleRNN(32, return_sequences=True)))
model_3_ext.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
model_3_ext.add(Bidirectional(SimpleRNN(32)))
model_3_ext.add(Dense(1, activation='sigmoid'))
model_3_ext.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 20, 64)            3136      
_________________________________________________________________
bidirectional_4 (Bidirection (None, 20, 128)           16512     
_________________________________________________________________
bidirectional_5 (Bidirection (None, 64)                10304     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 190,017
Trainable params: 190,017
Non-trainable params: 0
_________________________________________________________________


In [25]:
model_3_ext.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 20, 64)            3136      
_________________________________________________________________
bidirectional_4 (Bidirection (None, 20, 128)           16512     
_________________________________________________________________
bidirectional_5 (Bidirection (None, 64)                10304     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 190,017
Trainable params: 190,017
Non-trainable params: 0
_________________________________________________________________


#### Train ext. model 3

In [26]:
history_3_ext = model_3_ext.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test ext. model 3

In [27]:
print(model_3_ext.metrics_names)
model_3_ext.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.6692711886763573, 0.895]

##### Ref.:
1. https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
