# SDS 203 Deep Learning Project 
IMT - Atlantique <br>
Brest, France <br>

Author : Raymond Klutse <br>
Date : 21 June , 2019


### Introduction

Text classification has become a very important part of Natural Language Processing. It is often used for Sentiment Analysis and Identification of harmful messages on social media networks such as Twitter and Facebook. Achieving this aim is quite difficult hence the numerous researches on going in this subject area. This project is a final semester project for SDS 203 Deep Learning at IMT - Atlantique, based on an article by Zhang et al on [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) , where convolutional networks were used to extract information from text data. Application of CNN to text is often done at the word level where words are vectorised in order for them to be fed into a neural network for training. In this article, Zhang et al propose vectorisation of text at the character level instead of the word level. This allows the CNN to gain more insight about the data. Also,this CNNN does not require any prior knowledge of the words used to train the networks.  <br> The implementation of this project follows a [CRISP-DM](https://docs.oracle.com/cd/B19306_01/datamine.102/b14339/5dmtasks.htm) Methodology for data mining.<br> 


### Business Understanding

The aim of this project is to implement a Convolutional Neural Network for Text Classification. The model trained should be able to classify new data into one of two classes.

###  Data Understanding

We will first import the necessary libraries that will be to understand and explore the data.Data used to implement the solution provided in the article  is available on [Amazon Reviews for Sentiment Analysis](https://www.kaggle.com/bittlingmayer/amazonreviews). Data available from amazon reviews is divided into train and test data. We will first load both datasets into different dataframes using pandas. A label label__1  represent negative reviews whilst label__2 represent positive reviews 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
#Read train dataset
train = pd.read_csv('amazonreviews/train.ft.txt',delimiter="\n",header=None)
print("Train dataset shape : ",train.shape)
train.head()

Train dataset shape :  (3600000, 1)


Unnamed: 0,0
0,__label__2 Stuning even for the non-gamer: Thi...
1,__label__2 The best soundtrack ever to anythin...
2,__label__2 Amazing!: This soundtrack is my fav...
3,__label__2 Excellent Soundtrack: I truly like ...
4,"__label__2 Remember, Pull Your Jaw Off The Flo..."


In [3]:
#Read test dataset
test = pd.read_csv('amazonreviews/test.ft.txt',delimiter="\n",header=None)
print("Test dataset shape : ",test.shape)
test.head()

Test dataset shape :  (400000, 1)


Unnamed: 0,0
0,__label__2 Great CD: My lovely Pat has one of ...
1,__label__2 One of the best game music soundtra...
2,__label__1 Batteries died within a year ...: I...
3,"__label__2 works fine, but Maha Energy is bett..."
4,__label__2 Great for the non-audiophile: Revie...


We observe from the dataframe that each row consists of a label and its respective review.

In the article,an alphabet set of 70 were used to represent characters in a text review. From observation, there was a duplicate of character '-' in the alphabet set , hence my implementation used 69 alphabets. We first prepare the alphabets that are going to be used in the model. Also, we will create functions to clean our text data and separated model from text review.

In [4]:
#Alphabets used for one hot encoding
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
digits = '0123456789'
punctuation = '-,;.!?:"/\|_\'@#$%^&*~`+=<>[](){}'
whitespace = '\n'

In [5]:
print(ascii_lowercase), print("Number of english letters :",len(ascii_lowercase))

print(digits), print("Number of digits :",len(digits))

print(punctuation), print("Number of punctuations :",len(punctuation))

print(whitespace), print("Number of whitespace :",len(whitespace))

abcdefghijklmnopqrstuvwxyz
Number of english letters : 26
0123456789
Number of digits : 10
-,;.!?:"/\|_'@#$%^&*~`+=<>[](){}
Number of punctuations : 32


Number of whitespace : 1


(None, None)

In [6]:
import itertools
alphabet = list(itertools.chain(ascii_lowercase,digits, punctuation , whitespace))
print("Size of alphabet :",len(alphabet))
alphabet[40:50]

Size of alphabet : 69


['!', '?', ':', '"', '/', '\\', '|', '_', "'", '@']

In [7]:
#method to clean up string data
import re
def clean_str(string):
    s = string.replace(" ", "")
    #s = re.sub(r"[\t]", "", string)
    return s.strip().lower()

In [8]:
#method to convert label into a binary value
def convertlabeltobinary(label):
    label = 0 if label == '__label__1' else 1
    return label

We now split our text into label and text and store it in a dataframe

In [9]:
train = train.loc[:,0].str.split(' ', 1)
train.head()

0    [__label__2, Stuning even for the non-gamer: T...
1    [__label__2, The best soundtrack ever to anyth...
2    [__label__2, Amazing!: This soundtrack is my f...
3    [__label__2, Excellent Soundtrack: I truly lik...
4    [__label__2, Remember, Pull Your Jaw Off The F...
Name: 0, dtype: object

In [10]:
train = pd.DataFrame(list(train),columns = ['label','text'])

In [11]:
from sklearn.utils import shuffle
train= shuffle(train)
train = train.reset_index(drop=True)
train.head()

Unnamed: 0,label,text
0,__label__1,Rushed and incomplete: After reading this book...
1,__label__2,Superb Yet Darker: Now if you like me blast Th...
2,__label__1,Two Serious Flaws: I also found the the displa...
3,__label__2,very good book for young readers: THis was a v...
4,__label__2,Amazing second album from Billy Talent: Billy ...


We then read the label column and apply the binary function to it. 

In [12]:
train_label = train.loc[0:30000,'label'].apply(lambda x:convertlabeltobinary(x))
train_label = train_label.reset_index(drop=True)
train_label = pd.DataFrame(train_label)

In [13]:
print("Train label dataset shape : ",train_label.shape)
train_label.head()

Train label dataset shape :  (30001, 1)


Unnamed: 0,label
0,0
1,1
2,0
3,1
4,1


In [14]:
train_label.describe()

Unnamed: 0,label
count,30001.0
mean,0.50125
std,0.500007
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [16]:
train_label.label.value_counts()

1    15038
0    14963
Name: label, dtype: int64

We now read the text column and apply the clean function to it.

In [None]:
train_text = train.loc[0:30000,'text'].apply(lambda x:clean_str(x))
train_text = train_text.reset_index(drop=True)
train_text = pd.DataFrame(train_text)

In [None]:
print("Train text dataset shape : ",train_text.shape)
train_text.head()

In order for the network to function properly, we need to feed it numeric data. In view of this, we vectorise our text data. [One hot encoding](http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple) is applied on characters in the alphabet set where the length of each encoded character is the length of the alphabet ,which in our case is 69. <br>
In order for the model to capture enough insight from the data, the data samples must be of equal sizes. The maximum length of a review text is 1014. In order to acheive this, we pad all text reviews that are lower than this number with zeros.


In [None]:
#method to create padding vector
def zerolistmaker(n):
    listofzeros = [0] * n
    return listofzeros

In [None]:
#method to pad vector
def pad_vector(vector):
    while len(vector) < 1014 : vector.append(zerolistmaker(69))
    return vector

In [None]:
#method to vectorise text
import string
def string_vectorizer(string):
    vector = [[0 if char != letter else 1 for char in alphabet ] for letter in string]
    #vector = pd.get_dummies(pd.Series(list(string)).astype('category', categories=alphabet)).values.tolist()
    vector = pad_vector(vector)
    return np.asarray(vector)

We now vectorise out text using one hot encoding.

In [None]:
train_text_vec = train_text.loc[0:,'text'].apply(lambda x : string_vectorizer(x))
train_text_vec = pd.DataFrame(train_text_vec)

In [None]:
print("Train text vector shape : ",train_text.shape)
train_text_vec.head()

We store the vectorised train text and our binary label in a new data frame. We then split our train data into train and validation data (30% of training data)

In [None]:
train_numeric = pd.concat([train_label,train_text_vec], axis=1)
print("Train numeric shape : ",train_numeric.shape)
train_numeric.head()

In [None]:
#method to split train and validation 
def validation_train_split(train,validation_size):
    validation = train.iloc[validation_size:,:]
    validation = validation.reset_index(drop=True)
    train_size = validation_size -1
    train = train.iloc[0:train_size,:]
    return train,validation

In [None]:
train_numeric= shuffle(train_numeric)
train_numeric = train_numeric.reset_index(drop=True)

train_numeric,validation_numeric = validation_train_split(train_numeric,26001)
print("Train numeric shape : ",train_numeric.shape)
train_numeric.head()

All the encoded text is stored in x_train for the network and their repsective labels are stored in y_train. The shape is in the form of (samples,sample rows, sample columns)

In [None]:
from keras.utils.np_utils import to_categorical
x_train = np.stack(train_numeric.loc[:,'text'])
y_train = list(train_numeric.loc[:,'label'])
#y_train = to_categorical(y_train)
print('X Training data shape:' ,x_train.shape)
print('Y Training data shape:',len(y_train))

In [None]:
print("Validation numeric shape : ",validation_numeric.shape)
validation_numeric.head()

All the encoded text is stored in x_validation for the network and their repsective labels are stored in y_validation. The shape is in the form of (samples,sample rows, sample columns)

In [None]:
x_validation = np.stack(validation_numeric.loc[0:,'text'])
y_validation = list(validation_numeric.loc[0:,'label'])
print('X Validation data shape:' ,x_validation.shape)
print('Y Validation data shape:',len(y_validation))

We now prepare the test data by performing the same functions applied to the train and validation set.

In [None]:
test = test.loc[:,0].str.split(' ', 1)
test.head()

In [None]:
test = pd.DataFrame(list(test),columns = ['label','text'])

In [None]:
print("Test data shape : ",test.shape)
test= shuffle(test)
test = test.reset_index(drop=True)
test.head()

In [None]:
test_label = test.loc[0:3000,'label'].apply(lambda x:convertlabeltobinary(x))
test_label = test_label.reset_index(drop=True)
test_label = pd.DataFrame(test_label)

In [None]:
print("Test label shape : ",test_label.shape)
test_label.head()

In [None]:
test_text = test.loc[0:3000,'text'].apply(lambda x:clean_str(x))
test_text = test_text.reset_index(drop=True)
test_text = pd.DataFrame(test_text)

In [None]:
print("Test text shape : ",test_text.shape)
test_text.head()

In [None]:
test_text_vec = test_text.loc[0:,'text'].apply(lambda x : string_vectorizer(x))
test_text_vec = pd.DataFrame(test_text_vec)

In [None]:
print("Test text shape : ",test_text_vec.shape)
test_text_vec.head()

In [None]:
test_numeric = pd.concat([test_label,test_text_vec], axis=1)
print("Test numeric shape : ",test_numeric.shape)
test_numeric.head()

In [None]:
from keras.utils.np_utils import to_categorical
x_test = np.stack(test_numeric.loc[:,'text'])
x_test 
y_test = list(test_numeric.loc[:,'label'])
#y_train = to_categorical(y_train)
print('X Test data shape:' ,x_test.shape)
print('Y Test data shape:',len(y_test))

In [None]:
type(x_test)

### Modeling

Gradients are obtained by back-propagation in order to perform optimization. 

One key module that helped us to train deeper models is temporal max-pooling.

The non-linearity used in our model is the rectifier or thresholding function 

The algorithm used is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum 0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times

Each epoch takes a fixed number of random training samples uniformly sampled across classe

Our models accept a sequence of encoded characters as input.

Each character is quantized using one-hot-encoding

The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10 digits, 33 other characters and the new line character. The non-space characters are:
               abcdefghijklmnopqrstuvwxyz0123456789
               -,;.!?:’’’/\|_@#$%ˆ&* ̃‘+-=<>()[]{}

##### Building the Network

In [None]:
import keras
keras.__version__

In [None]:
from keras.activations import relu
from keras.models import Sequential, Model
from keras.layers import Input, Dense, Embedding, Flatten, Dropout, concatenate
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.utils.vis_utils import model_to_dot

input_dim = x_train[0].shape
print('Input Shape :' ,input_dim)

In [None]:
model = Sequential()
input_layer = Input(shape = input_dim)

cov_1 = Conv1D(filters=256, kernel_size=7, strides=1,activation='relu')(input_layer)
pool_1= MaxPooling1D(pool_size=3)(cov_1)

cov_2 = Conv1D(filters=256, kernel_size=7,strides=1,activation='relu')(pool_1)
pool_2= MaxPooling1D(pool_size=3)(cov_2)

cov_3 =Conv1D(filters=256, kernel_size=3,strides=1,activation='relu')(pool_2)

cov_4 =Conv1D(filters=256, kernel_size=3,strides=1,activation='relu')(cov_3)

cov_5 =Conv1D(filters=256, kernel_size=3,strides=1,activation='relu')(cov_4)

cov_6 =Conv1D(filters=256, kernel_size= 3,strides=1,activation='relu')(cov_5)
pool_6= MaxPooling1D(pool_size= 3)(cov_6)

flat = Flatten()(pool_6)

dense_1 = Dense(1024, activation='relu')(flat)
drop_1 = Dropout(0.5)(dense_1 )

dense_2 = Dense(1024, activation='relu')(drop_1)
drop_2 = Dropout(0.5)(dense_2)

dense_3 = Dense(1, activation='sigmoid')(drop_2)

model = Model(inputs= input_layer,outputs=dense_3)
model.summary()

[Loss function for binary classification](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/) <br>
[Choosing batch size](https://arxiv.org/pdf/1609.04836v1.pdf)

In [None]:
from keras.optimizers import SGD, Adam
learning_rate = 0.01
epochs=10
batch_size=128
decay_rate = learning_rate / epochs
momentum = 0.9

sgd = SGD(lr=learning_rate , decay=decay_rate, momentum=momentum, nesterov=False)
adam = Adam(lr=0.001) 
model.compile(loss='binary_crossentropy',optimizer= sgd, metrics=['accuracy'])
decay_rate

In [None]:
#for epoch in range(epochs):
model_train = model.fit(x_train, y_train,
                        epochs=epochs,
                        batch_size= batch_size,
                        validation_data=(x_validation, y_validation))
    

In [None]:
validation_eval = model.evaluate(x_validation, y_validation, verbose=1)

In [None]:
print('Test loss:', validation_eval[0])
print('Test accuracy:', validation_eval[1])

In [None]:
test_eval = model.evaluate(x_test, y_test,batch_size=128, verbose=1)

In [None]:
print('Test loss:', test_eval[0])
print('Test accuracy:', test_eval[1])

In [None]:
model.save('model_53_perc.h5')


In [None]:
model.save_weights('model_53_weights.h5')

In [None]:
#loaded_model('model_53_perc.h5')

In [None]:
#model.load_weights('model_53_weights.h5')

In [None]:
accuracy = model_train.history['acc']
val_accuracy = model_train.history['val_acc']
loss = model_train.history['loss']
val_loss = model_train.history['val_loss']
epochs = range(len(accuracy))
plt.plot(epochs, accuracy, 'bo', label='Training accuracy')
plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

## Using a trained model to generate predictions on new data

In [None]:
predictions = model.predict(x_test)
predicted_label = y_test[np.argmax(predictions[5])]
print("File ->", x_test[5], "Predicted label: " ,predicted_label)

In [None]:
 x_test[5][2]

In [None]:
test_text.loc[5,'text']

In [None]:
test_label.loc[5,'label']

### Deployment 

### Conclusion