# Model 2

Model 2 can be represented as follow:
* Input => Embedding => RNN => Class

### Steps for training:
* Load Data
    * Train Data
* Pre-processing: Tokenization
* Batching and Padding
* Model definition
* Training and valildation
* Evaluation
* Excercise

All steps are similar to Model 1 notebook.  
We can directly jump to Model definition step.

### Load Data
Load the toxic comment classification challenge dataset
and split the dataset into training, validation, testing

#### Data for training
For training, we need dataset in 2 groups (pair: comment and its corresponding output label):
1. __Input data:__ wikipedia comments
2. __Output label:__ whether the comment is toxic or not


#### Read CSV
* read the csv data file using pandas

In [1]:
import pandas as pd
train_csv = './storage/dataset/train.csv'
train_df = pd.read_csv(train_csv)
# To Do: sort the df based on size of comments (no. of words in comment)

#### Training Data Preperation
* read the labels and convert into one-class labels
* we will focus on 2 class problem: toxic and non toxic comments
* we will label all different types of toxic comments into same category of toxic label:
    * 0 for toxic comment
    * 1 for non-toxic comments
* later we can explore how to make it multiclass classifier

In [2]:
# each toxic class is labelled as 1
toxic_row_sums = train_df.iloc[:,2:].sum(axis=1)
# if sum of toxic class is 0 then it is a clean comment
train_df['clean'] = (toxic_row_sums==0)
# Input Data
train_texts = train_df['comment_text']
# Output Label
train_labels = train_df['clean']

### Pre-processing : Tokenization
Now we have training data in two separate dataframe columns (arrays/list): an ordered array consisting of comments (input for the network) and another array consisting of class lables in same order (output of the network).

We have to transform this data into network input format and output format. This step is called pre-processing.  
Steps of pre-processing:

1. Tokenize the text into words
2. Assign each word a dimension


To accompolish step 1 and 2 we will use inbuilt __Tokenizer__ class

In [3]:
from keras.preprocessing.text import Tokenizer
# set size of vocabulary
# To Do: try different size 
max_vocab_size = 10000
tokenizer = Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.


[688, 75, 1, 126, 130, 177, 29, 672, 4511, 1116, 86, 331, 51, 2278, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]
Found 210337 unique tokens.


### Batching and Padding for Embedding
Now once we have the tokens and each token(word) has a dimension assigned to it, we will do following steps to create word embeddings  

3. use this dimension assignments to define embedding for individual word
4. use word embedding to create word vector for a comment


We will use a specific type of network layer for this, which is called __Embedding Layer__. The above generated tokens (sequence of number) will go as input to Embedding layer, which will output word embeddings as output to next layer.  

Input and Output of Neural Network are done is batches. A batch is a group of input data which are fed together to the network. As the network can process individual data element in parallel, the training will be faster.

In case of Embedding Layer, Inpupt and Output in a batch can be seen as follows:  

   **Input**: 2D tensor of integers, of shape (# seq. samples in particular batch, sequence_length), where each entry is a sequence of integers (output of above code).  
   **Output**: 3D floating-point tensor of shape (# seq. samples in particula patch, sequence_length, embedding_dimensionality).  

Sequence length can be variable per batch. But in a single batch sequence length will be same for all sequences.  

So from data we have to create batches of sequence of similar length and to do that we have to pad or truncate each sequence to have same sequence length. And we can use each batch as a training input for embedding layer.  

For sample case: we take 10k sequence from 160k for training in a single batch. And take max sequence length of 20 words.


In [4]:
from keras import preprocessing
training_sequences = sequences[:10000]
training_labels = train_labels[:10000]
seq_max_len = 20
# training padded sequences
train_seq_pad = preprocessing.sequence.pad_sequences(sequences=training_sequences, maxlen=seq_max_len)

# testing padded sequences
testing_sequences = sequences[10000:11000]
testing_labels = train_labels[10000:11000]
test_seq_pad = preprocessing.sequence.pad_sequences(sequences=testing_sequences, maxlen=seq_max_len)

# To Do: try more training data, try different sequence max length

### Model 2: Embedding => RNN => Output
In this model 2 we will extend the Model 1 by adding an RNN layer in between the Embedding layer and output layer.

#### Define the model 2
Model 2 is made of 4 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer
    - Layer 2 is RNN layer
    - Layer 3 is Dense Layer (output/classification layer) 
    
    
**RNN : Recurrent Neural Network**


<img src="img/rnn.png" alt="Recurrent Neural Network" style="width: 300px;"/>  

Source: [Deep Learning with Python, Book by François Chollet](https://www.manning.com/books/deep-learning-with-python)

RNN is a neural network has following properties:
 - processes each element(word) of a sequence(sentence) one by one 
 - and output of intermediate element is fed back together with the next element.
 - The state of RNN is reset between two indepeent sequence
 
Input for Dense layer is only the output at the end of the sequence.

![Unrolled RNN](img/RNN-unrolled.png)  


Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In this model, we don't need Flatten layer as by default SimpleRNN layer output only the last element from the processed output (h_t)

In [5]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

# model definition
model_2 = Sequential()
model_2.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
model_2.add(SimpleRNN(32))
model_2.add(Dense(1, activation='sigmoid'))
model_2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [6]:
model_2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                1568      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 161,601
Trainable params: 161,601
Non-trainable params: 0
_________________________________________________________________


#### Train the model 2

In [7]:
history_2 = model_2.fit(train_seq_pad, training_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the model 2

In [8]:
print(model_2.metrics_names)
model_2.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.34354388737678526, 0.90500000000000003]

We see that above model didn't have good accuracy compared to much simpler model. We didn't use most of the data, training data is very less and also value of seq_len was less for training data and more for testing data.

## Excercises
Try to improve performance of the model:

* Sort comments after reading CSV file, to group comments of similar size in a batch
* Try different vocab size during tokenization e.g. set size dynamically based on some logic e.g. select top 90% frequent words or words with frequency more than some value
*