# Model 1

Model 1 can be represented as follow:
* Input => Embedding => Class

### Steps for training:
* Load Data
    * Train Data
* Pre-processing: Tokenization
* Batching and Padding
* Model definition
* Training and valildation
* Evaluation
* Excercise

### Load Data
Load the toxic comment classification challenge dataset
and split the dataset into training, validation, testing

#### Data for training
For training, we need dataset in 2 groups (pair: comment and its corresponding output label):
1. __Input data:__ wikipedia comments
2. __Output label:__ whether the comment is toxic or not


#### Read CSV
* read the csv data file using pandas

In [1]:
import pandas as pd
train_csv = './storage/dataset/train.csv'
train_df = pd.read_csv(train_csv)
# To Do: sort the df based on size of comments (no. of words in comment)

#### Training Data Preperation
* read the labels and convert into one-class labels
* we will focus on 2 class problem: toxic and non toxic comments
* we will label all different types of toxic comments into same category of toxic label:
    * 0 for toxic comment
    * 1 for non-toxic comments
* later we can explore how to make it multiclass classifier

In [2]:
# each toxic class is labelled as 1
toxic_row_sums = train_df.iloc[:,2:].sum(axis=1)
# if sum of toxic class is 0 then it is a clean comment
train_df['clean'] = (toxic_row_sums==0)
# Input Data
train_texts = train_df['comment_text']
# Output Label
train_labels = train_df['clean']

### Pre-processing : Tokenization
Now we have training data in two separate dataframe columns (arrays/list): an ordered array consisting of comments (input for the network) and another array consisting of class lables in same order (output of the network).

We have to transform this data into network input format and output format. This step is called pre-processing.  
Steps of pre-processing:

1. Tokenize the text into words
2. Assign each word a dimension


To accompolish step 1 and 2 we will use inbuilt __Tokenizer__ class

In [3]:
from keras.preprocessing.text import Tokenizer
# set size of vocabulary
# To Do: try different size 
max_vocab_size = 10000
tokenizer = Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
print(sequences[0])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.


[688, 75, 1, 126, 130, 177, 29, 672, 4511, 1116, 86, 331, 51, 2278, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]
Found 210337 unique tokens.


### Batching and Padding for Embedding
Now once we have the tokens and each token(word) has a dimension assigned to it, we will do following steps to create word embeddings  

3. use this dimension assignments to define embedding for individual word
4. use word embedding to create word vector for a comment


We will use a specific type of network layer for this, which is called __Embedding Layer__. The above generated tokens (sequence of number) will go as input to Embedding layer, which will output word embeddings as output to next layer.  

Input and Output of Neural Network are done is batches. A batch is a group of input data which are fed together to the network. As the network can process individual data element in parallel, the training will be faster.

In case of Embedding Layer, Inpupt and Output in a batch can be seen as follows:  

   **Input**: 2D tensor of integers, of shape (# seq. samples in particular batch, sequence_length), where each entry is a sequence of integers (output of above code).  
   **Output**: 3D floating-point tensor of shape (# seq. samples in particula patch, sequence_length, embedding_dimensionality).  

Sequence length can be variable per batch. But in a single batch sequence length will be same for all sequences.  

So from data we have to create batches of sequence of similar length and to do that we have to pad or truncate each sequence to have same sequence length. And we can use each batch as a training input for embedding layer.  

For sample case: we take 10k sequence from 160k for training in a single batch. And take max sequence length of 20 words.


In [4]:
from keras import preprocessing
training_sequences = sequences[:10000]
training_labels = train_labels[:10000]
seq_max_len = 20
# training padded sequences
train_seq_pad = preprocessing.sequence.pad_sequences(sequences=training_sequences, maxlen=seq_max_len)

# testing padded sequences
testing_sequences = sequences[10000:11000]
testing_labels = train_labels[10000:11000]
test_seq_pad = preprocessing.sequence.pad_sequences(sequences=testing_sequences, maxlen=seq_max_len)

# To Do: try more training data, try different sequence max length

### Model 1. : Embedding to Class

#### Define the model 1
Model 1 is made of 4 layers:
    - Layer 0 is input layer
    - Layer 1 is Embedding layer (Hidden Layer)
    - Layer 2 is Flatten Layer (Flattens the embedding layer)
    - Layer 3 is Dense Layer (output layer)
    
**Embedding Layer**: This layer help us create word embedding (discussed in Sequence Representation section). For a single input (a sentence which comes as a seq. of integer) its output is 2D. Each integer(representing a word) gets transformed into a vector; so for a seq. of int. it generates a 2D matrix.

**Flatten Layer**: Embedding layer outputs in 2D matrix, to use the output in a Dense layer upstream the output need to transformed into 1D and flatten layer does that.

In [5]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.layers.embeddings import Embedding

model_1 = Sequential()

# no. of unique words in the text data, each word in vocab will be assigned an index (dimension).
# = max_vocab_size defined above
vocab_size = 10000 

# max length of single input data point i.e. count of words present in an input sentence
# short seq are padded and long ones are truncated, done above
# sequence size of single input for the network
seq_max_len = 20 

# dimension of word embedding model (output dimension of embedding layer)
embedding_dim = 8 

## layer 1: add Embedding Layer in the network
#  input to this layer is data of shape: [batch_size, seq_max_len]
model_1.add(Embedding(vocab_size, embedding_dim, input_length=seq_max_len))
#  output of layer 1 is data of shape: [batch_size, embedding_dim, seq_max_len]

## layer 2: flatten the input of shape [batch_size, embedding_dim, seq_max_len] 
model_1.add(Flatten())
#  output of shape [batch_size, embedding_dimension x seq_max_len]

## layer 3 (final/output layer): Dense layer 
#  all nodes from previous layers are connected to each nodes from this layer
#  this has 1 unit/node for classification(toxic/non-toxic)
#  and activation for 2 classes: sigmoind
model_1.add(Dense(1, activation='sigmoid'))

## compile:   configure the model for training
#  optimizer: it is the method use to update the network, 
#             it is generally variant of stochastic gradient descent (SGD)  
#             this method is use iteratively to update the network weights
#  loss:      it is the (objective) function that will be minimised
#  metrics:   this is use to measure the performance of network
model_1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [6]:
# prints the summary of the model
model_1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [7]:
# fit: trains the network for a fixed no. of epoch
history_1 = model_1.fit(train_seq_pad, training_labels, epochs=10, batch_size=100, validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<img src="img/m11.png" alt="Visual representation of one hot encodding and word embedding" style="width: 600px;"/>  


Created using [NN SVG Tool](http://alexlenail.me/NN-SVG/index.html)  

For above diagram, following configs are used(1/4th of the ones used in code):
1. seq_max_len = 5
2. embedding_dim = 2
3. flatten layer = 5x2 = 10
4. desnse output layer = 1




#### Test the model 1

We will take a small test data from the unused training data to test our basic model.  

`model_1.evaluate` method is use to evaluate the model. For evaluation we give input the test data in the same format as of training data together with label data for the test data to compare with.

Ref: Listing 6.7 Deep Learning with Python book  

In [8]:
print(model_1.metrics_names)
model_1.evaluate(x=test_seq_pad, y=testing_labels)

['loss', 'acc']


[0.1935289661884308, 0.92900000000000005]

In [10]:
test_csv = './storage/dataset/test.csv'
test_csv_df = pd.read_csv(test_csv)
test_csv_df.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [11]:
test_csv_df.shape

(153164, 2)

In [12]:
# Input Data
test_csv_texts = test_csv_df['comment_text']
test_csv_texts_sequences = tokenizer.texts_to_sequences(test_csv_texts)
testing_csv_sequences = test_csv_texts_sequences[:1000]
seq_max_len = 20
# training padded sequences
test_csv_seq_pad = preprocessing.sequence.pad_sequences(sequences=testing_csv_sequences, maxlen=seq_max_len)
# make a prediction
ynew = model_1.predict_classes(test_csv_seq_pad)


In [None]:
# show the inputs and predicted outputs
for i in range(len(test_csv_seq_pad)):
    print("Predicted=%s : Comment=%s" % (ynew[i], test_csv_texts[i]))
    print("--------------------------------------------------------")

## Excercises
Try to improve performance of the model:
* Sort comments after reading CSV file, to group comments of similar size in a batch
* Try different vocab size during tokenization e.g. set size dynamically based on some logic e.g. select top 90% frequent words or words with frequency more than some value
