# Models

We will cover following models:
* Embedding => Class
* Embedding => Simple RNN => Class
* Embedding => Bi-directional RNN => Class

### Load data
Load the toxic comment classification challenge dataset
and split the dataset into training, validation, testing
#### Training input text
1. Get the training data
    * read the csv data file
    * tokenize the data
    * assign a dimension to each word
    * convert into embeddings


In [10]:
import csv
data_folder = './data/toxic-comments/'
train_texts = []
with open(data_folder+'train.csv') as train_file:
    reader = csv.DictReader(train_file)
    for row in reader:
        train_texts.append(row['comment_text'])
print(len(train_texts))

159571


#### Training output labels
2. Get the training label
    * read the labels and convert into one-class labels
    * We will focus on 2 class problem: toxic and non toxic comments
    * We will label all different types of toxic comments into same category of toxic label:
        * 0 for toxic comment
        * 1 for non-toxic comments
    * Later we can explore how to make it multiclass classifier

In [11]:
train_labels = []
toxic_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
with open(data_folder+'train.csv') as train_file:
    reader = csv.DictReader(train_file)
    for row in reader:
        not_toxic = True
        # check for toxic labels
        for label in toxic_labels:
            if(row[label] == '1'):
                train_labels.append(0)
                not_toxic = False
                break
        if not_toxic:
            train_labels.append(1)

print(len(train_labels))    

159571


#### Tokenization
Now we have training data in two separate array: an ordered array consisting of comments (input) and another array consisting of class lables in same order (output).

We have transform this data into network input format and output format.
Steps of preprocessing:

1. Tokenize the text into words
2. Assign each word a dimension


To accompolish step 1 and 2 we will use inbuilt Tokenizer

In [12]:
from keras.preprocessing.text import Tokenizer
max_vocab_size = 10000
tokenizer = Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokes.' % len(word_index))

[688, 75, 1, 126, 130, 177, 29, 672, 4511, 1116, 86, 331, 51, 2278, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]
Found 210337 unique tokes.


#### Batching and Preprocessing (padding) for Embedding
Now once we have the tokens, we will do following steps to create word embeddings  

3. Then use this dimension assignment to define embedding
4. Use word embedding to greate word vector for a comment


We will use a specific type of Layer for this, which is called Embedding Layer. The above generated tokens will go as input to Embedding layer, which will output word embeddings as output to next layer:  

   **Input**: 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers (output of above code).  
    **Output**: 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality).  

Sequence length can be variable per batch. But in a single batch sequence length will be same for all sequences.  

So from data we have to create batches of sequence of similar length and then pad or truncate each sequence to have same sequence length within a particular batch. And we can use each batch as a training input for embedding layer.  

For sample case: we take 10k sequence from 160k for training in a single batch. And take max sequence length of 50 words.


In [19]:
sample_sequences = sequences[:10000]
sample_labels = train_labels[:10000]
seq_max_len = 50

from keras import preprocessing

train_seq_pad = preprocessing.sequence.pad_sequences(sequences=sample_sequences, maxlen=max_len)


In [22]:
print(train_seq_pad[1])

[   0    0    0   52 2635   13  555 3809   73 4556 2706   21   94   38
  803 2679  992  589 8377  182]


### Model 1. : Embedding to Class

#### Define the model
Our model is made of 2 layers. Layer 1 is embedding layer
Layer 2 a classification Layer

In [23]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.layers.embeddings import Embedding

model = Sequential()
# layer 0: add an embedding layer:
vocab_size = 10000 # no. of unique words in the text data, each word in vocab will be assigned an index (dimension).
embedding_dim = 8 # dimension of word embedding model, output of this layer
max_len = 20 # max length of single input data e.g. count of words present in an input sentence, input of this layer
model.add(Embedding(vocab_size, embedding_dim, input_length=max_len))
# input to above layer will be data of shape: [batch_size, max_len]
# output of above layer will be data of shape: [batch_size, embedding_dimension, max_len]
# layer 1: flatten the input of shape [batch_size, embedding_dimension, max_len] 
#          to out of shape [batch_size, embedding_dimension*max_len]
model.add(Flatten())
# layer 2: Dense layer - all nodes from previous layers are connected to each nodes from this layer
#          this has 1 unit/node for classification; and activation for 2 classes: sigmoind
model.add(Dense(1, activation='sigmoid'))
# compile: configure the model for training
#   optimizer: it is the method use to update the network, it is generally variant of stochastic gradient descent (SGD)  
#              this method is use iteratively to update the network weights
#   loss: it is the (objective) function that will be minimised
#   metrics: this is use to measure the performance of network
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
# todo: check this method
model.summary()
# fit: trains the network for a fixed no. of epoch
history = model.fit(train_seq_pad, sample_labels, epochs=10, batch_size=32, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Test the Model 1

In [26]:
test_sequences = sequences[10000:11000]
test_labels = train_labels[10000:11000]
seq_max_len = 50
test_seq_pad = preprocessing.sequence.pad_sequences(sequences=test_sequences, maxlen=max_len)

print(model.metrics_names)
model.evaluate(x=test_seq_pad, y=test_labels)


['loss', 'acc']


[0.1994455663561821, 0.927]

Ref: Listing 6.7 Deep Learning with Python  

//todo explain above code and add network diagram
Embedding(

### Model 2: Embedding => RNN => Output
In this model 2 we will extend the Model 1 by adding an RNN layer in between the Embedding layer and output layer.