# Models

We will cover following models:
* Embedding => Class
* Embedding => Simple RNN => Class
* Embedding => Bi-directional RNN => Class

### Load data
Load the toxic comment classification challenge dataset
and split the dataset into training, validation, testing
#### Input text for training
1. Get the training data
    * read the csv data file
    * tokenize the data
    * assign a dimension to each word
    * convert into embeddings


#### Read CSV

In [3]:
import csv
data_folder = './data/toxic-comments/'
train_texts = []
with open(data_folder+'train.csv') as train_file:
    reader = csv.DictReader(train_file)
    for row in reader:
        train_texts.append(row['comment_text'])
print(len(train_texts))

159571


#### Output labels for training
2. Get the training label
    * read the labels and convert into one-class labels
    * We will focus on 2 class problem: toxic and non toxic comments
    * We will label all different types of toxic comments into same category of toxic label:
        * 0 for toxic comment
        * 1 for non-toxic comments
    * Later we can explore how to make it multiclass classifier

In [4]:
train_labels = []
toxic_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
with open(data_folder+'train.csv') as train_file:
    reader = csv.DictReader(train_file)
    for row in reader:
        not_toxic = True
        # check for toxic labels
        for label in toxic_labels:
            if(row[label] == '1'):
                train_labels.append(0)
                not_toxic = False
                break
        if not_toxic:
            train_labels.append(1)

print(len(train_labels))    

159571


### Tokenization
Now we have training data in two separate array: an ordered array consisting of comments (input) and another array consisting of class lables in same order (output).

We have to transform this data into network input format and output format.
Steps of preprocessing:

1. Tokenize the text into words
2. Assign each word a dimension


To accompolish step 1 and 2 we will use inbuilt Tokenizer

In [5]:
from keras.preprocessing.text import Tokenizer
max_vocab_size = 10000
tokenizer = Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokes.' % len(word_index))

[688, 75, 1, 126, 130, 177, 29, 672, 4511, 1116, 86, 331, 51, 2278, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]
Found 210337 unique tokes.


### Batching and Preprocessing (padding) for Embedding
Now once we have the tokens, we will do following steps to create word embeddings  

3. Then use this dimension assignment to define embedding
4. Use word embedding to greate word vector for a comment


We will use a specific type of Layer for this, which is called Embedding Layer. The above generated tokens will go as input to Embedding layer, which will output word embeddings as output to next layer:  

   **Input**: 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers (output of above code).  
   **Output**: 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality).  

Sequence length can be variable per batch. But in a single batch sequence length will be same for all sequences.  

So from data we have to create batches of sequence of similar length and then pad or truncate each sequence to have same sequence length within a particular batch. And we can use each batch as a training input for embedding layer.  

For sample case: we take 10k sequence from 160k for training in a single batch. And take max sequence length of 50 words.


In [6]:
sample_sequences = sequences[:10000]
sample_labels = train_labels[:10000]
seq_max_len = 20

from keras import preprocessing

train_seq_pad = preprocessing.sequence.pad_sequences(sequences=sample_sequences, maxlen=seq_max_len)


In [7]:
print(train_seq_pad[1])

[   0    0    0   52 2635   13  555 3809   73 4556 2706   21   94   38
  803 2679  992  589 8377  182]


In [8]:
test_sequences = sequences[10000:11000]
test_labels = train_labels[10000:11000]
seq_max_len = 50
max_len = 20
test_seq_pad = preprocessing.sequence.pad_sequences(sequences=test_sequences, maxlen=max_len)

### Model 1. : Embedding to Class

#### Define the model 1
Model 1 is made of 2 layers:
    - Layer 1 is Embedding layer
    - Layer 2 is classification (Dense) Layer

In [14]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.layers.embeddings import Embedding

model1 = Sequential()
# layer 0: add an embedding layer:
vocab_size = 10000 # no. of unique words in the text data, each word in vocab will be assigned an index (dimension).
embedding_dim = 8 # dimension of word embedding model, output of this layer
max_len = 20 # max length of single input data e.g. count of words present in an input sentence, input of this layer
model1.add(Embedding(vocab_size, embedding_dim, input_length=max_len))
# input to above layer will be data of shape: [batch_size, max_len]
# output of above layer will be data of shape: [batch_size, embedding_dimension, max_len]
# layer 1: flatten the input of shape [batch_size, embedding_dimension, max_len] 
#          to out of shape [batch_size, embedding_dimension*max_len]
model1.add(Flatten())
# layer 2: Dense layer - all nodes from previous layers are connected to each nodes from this layer
#          this has 1 unit/node for classification; and activation for 2 classes: sigmoind
model1.add(Dense(1, activation='sigmoid'))
# compile: configure the model for training
#   optimizer: it is the method use to update the network, it is generally variant of stochastic gradient descent (SGD)  
#              this method is use iteratively to update the network weights
#   loss: it is the (objective) function that will be minimised
#   metrics: this is use to measure the performance of network
model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
# todo: check this method
model1.summary()
# fit: trains the network for a fixed no. of epoch
history1 = model1.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the model 1

We will take a small test data from the unused training data to test our basic model.

In [15]:
print(model1.metrics_names)
model1.evaluate(x=test_seq_pad, y=test_labels)

['loss', 'acc']


[0.1986279919743538, 0.928]

Ref: Listing 6.7 Deep Learning with Python book  

//todo explain above code and add network diagram  
`model1.evaluate` method is use to evaluate the model. For evaluation we give input the test data in the same format as of training data together with label data for the test data to compare with.


### Model 2: Embedding => RNN => Output
In this model 2 we will extend the Model 1 by adding an RNN layer in between the Embedding layer and output layer.

#### Define the model 2
Model 2 is made of 3 layers:
    - Layer 1 is Embedding layer
    - Layer 2 is RNN layer
    - Layer 3 is classification (Dense) layer 

In [16]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

# model definition
model2 = Sequential()
model2.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
model2.add(SimpleRNN(32))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                1568      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 161,601
Trainable params: 161,601
Non-trainable params: 0
_________________________________________________________________


#### Train the model 2

In [17]:
history2 = model2.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the model 2

In [18]:
print(model2.metrics_names)
model2.evaluate(x=test_seq_pad, y=test_labels)

['loss', 'acc']


[0.4412386049926281, 0.878]

We see that above model didn't have good accuracy compared to much simpler model. We didn't use most of the data, training data is very less and also value of seq_len was less for training data and more for testing data.


We can extend the model by adding more RNN layers in between and for the above we didn't use the out of intermediate output of RNN layer.

#### Extended model 2
Extended model 2 is made of 5 layers:

- Layer 1 is Embedding layer
- Layer 2 is RNN layer (return full sequence)
- Layer 3 is RNN layer (return full sequence)
- Layer 4 is RNN layer (return last output)
- Layer 5 is classification (Dense) layer 


In [21]:
model2ext = Sequential()
model2ext.add(Embedding(vocab_size, embedding_dim))
# for intermediate layers, we want to return output of each cell of RNN, 
# so that it forms a seq. which is processed by next RNN layer
model2ext.add(SimpleRNN(32, return_sequences=True))
model2ext.add(SimpleRNN(64, return_sequences=True))
# in final RNN layer we will not return the sequence but only the final output,
# which is use in the next non RNN layer e.g. Dense layer in this case
model2ext.add(SimpleRNN(32))
model2ext.add(Dense(1, activation='sigmoid'))
model2ext.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model2ext.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, None, 32)          1568      
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, None, 64)          6208      
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 32)                3104      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
Total params: 170,913
Trainable params: 170,913
Non-trainable params: 0
_________________________________________________________________


#### Train the ext. model 2

In [22]:
history2ext = model2ext.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test the ext. model 2

In [23]:
print(model2ext.metrics_names)
model2ext.evaluate(x=test_seq_pad, y=test_labels)

['loss', 'acc']


[0.6179015170745552, 0.893]

# Model 3: Embedding => Bidirectional RNN => Output
In this model 3 we will extend the Model 2 by wrapping the RNN layer with a Bidirectional wrapper.

#### Define the model 3
Extended model 3 is made of 3 layers:

- Layer 1 is Embedding layer
- Layer 2 is Bidirectional RNN layer (return last output)
- Layer 3 is classification (Dense) layer 

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.layers.wrappers import Bidirectional

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

# model definition
model3 = Sequential()
model3.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
# [1] This will create two copies of the hidden layer, 
# one fit in the input sequences as-is and one on a reversed copy of the input sequence. 
# By default, the output values from these LSTMs will be concatenated.
model3.add(Bidirectional(SimpleRNN(32)))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64)                3136      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 163,201
Trainable params: 163,201
Non-trainable params: 0
_________________________________________________________________


#### Train model 3

In [9]:
history3 = model3.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Testing model 3

In [13]:
print(model3.metrics_names)
model3.evaluate(x=test_seq_pad, y=test_labels)

['loss', 'acc']


[0.3804259589314461, 0.911]

Similarly like model 2, model 3 can be extended by adding more bidirectional layers in between.  

#### Extended model 3
Extended model 3 is made of 5 layers:

- Layer 1 is Embedding layer
- Layer 2 is Bidirectional RNN layer (return full sequence)
- Layer 3 is Bidirectional RNN layer (return full sequence)
- Layer 4 is Bidirectional RNN layer (return last output)
- Layer 5 is classification (Dense) layer 


In [1]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras.layers.wrappers import Bidirectional

# model configurations
vocab_size = 10000
seq_max_len = 20 # this can be removed as it is not required for next layer which is RNN
embedding_dim = 16

Using TensorFlow backend.


In [2]:
# model definition
model3ext = Sequential()
model3ext.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
model3ext.add(Bidirectional(SimpleRNN(32, return_sequences=True)))
model3ext.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
model3ext.add(Bidirectional(SimpleRNN(32)))
model3ext.add(Dense(1, activation='sigmoid'))
model3ext.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model3ext.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 16)            160000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 20, 64)            3136      
_________________________________________________________________
bidirectional_2 (Bidirection (None, 20, 128)           16512     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 64)                10304     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 190,017
Trainable params: 190,017
Non-trainable params: 0
_________________________________________________________________


#### Train ext. model 3

In [9]:
history3ext = model3ext.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Test ext. model 3

In [10]:
print(model3ext.metrics_names)
model3ext.evaluate(x=test_seq_pad, y=test_labels)

['loss', 'acc']


[0.8440574564933777, 0.862]

### Plotting the above results

//ToDo: train the above m


In [None]:
import matplotlib.pyplot

##### Ref.:
1. https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
