# SDS 203 Deep Learning Project 
IMT - Atlantique <br>
Brest, France <br>

Author : Raymond Klutse <br>
Date : 21 June , 2019


### Introduction

This project is a final semester project for SDS 203 Deep Learning in IMT - Atlantique. It is based on an article by Zhang et al on [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf).<br>
Data used to implement the solution provided in the article about is available on [Amazon Reviews for Sentiment Analysis](https://www.kaggle.com/bittlingmayer/amazonreviews). The implementation of this project will follow a [CRISP-DM](https://docs.oracle.com/cd/B19306_01/datamine.102/b14339/5dmtasks.htm) Methodology for data mining.

### Business Understanding

The aim of this project is to implement a Convolutional Neural Network for Text Classification.

###  Data Understanding

We will first import the necessary libraries that will be to understand and explore the data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Data available from amazon reviews is divided into train and test data. We will first load both datasets into different data frames using pandas.

In [2]:
#Explore the train dataset
train = pd.read_csv('amazonreviews/train.ft.txt',delimiter="\n",header=None)
print(train.shape)
train.head()

(3600000, 1)


Unnamed: 0,0
0,__label__2 Stuning even for the non-gamer: Thi...
1,__label__2 The best soundtrack ever to anythin...
2,__label__2 Amazing!: This soundtrack is my fav...
3,__label__2 Excellent Soundtrack: I truly like ...
4,"__label__2 Remember, Pull Your Jaw Off The Flo..."


In [3]:
#Explore the train dataset
test = pd.read_csv('amazonreviews/test.ft.txt',delimiter="\n",header=None)
print(train.shape)
test.head()

(3600000, 1)


Unnamed: 0,0
0,__label__2 Great CD: My lovely Pat has one of ...
1,__label__2 One of the best game music soundtra...
2,__label__1 Batteries died within a year ...: I...
3,"__label__2 works fine, but Maha Energy is bett..."
4,__label__2 Great for the non-audiophile: Revie...


### Data Preparation

We will first prepare the train data

We insert a new column to contain only the text

In [4]:
train.insert(1, "text", "Any")
train.columns = ['label', 'text']
train.head()

Unnamed: 0,label,text
0,__label__2 Stuning even for the non-gamer: Thi...,Any
1,__label__2 The best soundtrack ever to anythin...,Any
2,__label__2 Amazing!: This soundtrack is my fav...,Any
3,__label__2 Excellent Soundtrack: I truly like ...,Any
4,"__label__2 Remember, Pull Your Jaw Off The Flo...",Any


In [5]:
train.shape[0]

3600000

In [6]:
import re
def clean_str(string):
    s = re.sub(r"[\"^]", "", string)
    s = s.replace(" ", "")
    return s.strip().lower()

In [None]:
for i in range(train.shape[0]): #train.shape[0]
    label,text = train.loc[i,'label'].split(' ', 1)
    train.loc[i,'label']= label
    train.loc[i,'text']= clean_str(text)

In [None]:
train.iloc[:,:].label[train.iloc[:,:].label == '__label__1'] = 1
train.iloc[:,:].label[train.iloc[:,:].label == '__label__2'] = 2
print(train.iloc[:,:].head()) 

In [None]:
train.loc[30,:]

In [None]:
train.head(10)

For each row, we copy the text's label to the Label column and the text to the Text Column

In [None]:
train.loc[0,'text']

In [None]:
import string
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
    vector = [[0 if char != letter else 1 for char in alphabet or numbers or punctuation] 
                 for letter in strng]
    return vector

In [None]:
vector = []
for i in range(40): #train.shape[0]
    text = train.loc[i,'text']
    vector.append(string_vectorizer(train.loc[i,'text']))
print(len(vector))
len(vector[1])

In [None]:
test.iloc[0:10,0]

In [None]:
test.iloc[0:40,0] = label_encoder.fit_transform(test.iloc[0:40,0])
test.iloc[0:40,0].head(10)

In [None]:
len(train['label'].unique())

We will now prepare the test data

We insert a new column to contain only the text

In [None]:
test.insert(1, "text", "Any")
test.columns = ['label', 'text']
test.head()

In [None]:
test.loc[0,'text']

In [None]:
for i in range(5): #train.shape[0]
    label,text = test.loc[i,'label'].split(' ', 1)
    test.loc[i,'label']= label
    test.loc[i,'text']= text

In [None]:
test.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
test.iloc[0:10,0] = label_encoder.fit_transform(test.iloc[0:10,0])
test.iloc[0:10,0].head()

We now tokenize our text into words

In [None]:
import nltk
row_text = train.iloc[0,1]
row_text

In [None]:
nltk_tokens = nltk.word_tokenize(row_text)
nltk_tokens

In [None]:
train.label.describe()

In [None]:
test.describe()

### Modeling

Gradients are obtained by back-propagation in order to perform optimization. 

One key module that helped us to train deeper models is temporal max-pooling.

The non-linearity used in our model is the rectifier or thresholding function 

The algorithm used is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum 0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times

Each epoch takes a fixed number of random training samples uniformly sampled across classe

Our models accept a sequence of encoded characters as input.

Each character is quantized using one-hot-encoding

The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10 digits, 33 other characters and the new line character. The non-space characters are:
               abcdefghijklmnopqrstuvwxyz0123456789
               -,;.!?:’’’/\|_@#$%ˆ&* ̃‘+-=<>()[]{}

##### Building the Network

In [None]:
import keras
keras.__version__

In [None]:
from keras.activations import relu
from keras.models import Sequential, Model
from keras.layers import Input, Dense, Embedding, Flatten, Conv1D, MaxPooling1D
from keras.layers import Dropout, concatenate
from keras.utils.vis_utils import model_to_dot

#create model
model = Sequential()
#add model layers
model.add(Conv1D(32, kernel_size=(5, 5), strides=(1, 1),
                 activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model.add(Conv1D(64, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv1D(32, kernel_size=(5, 5), strides=(1, 1),
                 activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model.add(Conv1D(64, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv1D(32, kernel_size=(5, 5), strides=(1, 1),
                 activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model.add(Conv1D(64, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

In [None]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data_train.iloc[:,0] = label_encoder.fit_transform(data_train.iloc[:,0])
data_train.iloc[:,0].head()

In [None]:
train_labels

In [None]:
test_labels

In [None]:
from keras import optimizers

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

In [None]:
model.fit(x_train, y_train,
          batch_size=128,
          epochs=3,
          verbose=1,
          validation_data=(x_test, y_test),
          callbacks=[history])

In [None]:
test_loss, test_acc = network.evaluate(test_images, test_labels)

In [None]:
print('test_acc:', test_acc)

#### Validating our approach
In order to monitor during training the accuracy of the model on data that it has never seen before, we will create a "validation set" by setting apart 10,000 samples from the original training data:

In [None]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

## Using a trained network to generate predictions on new data

In [None]:
model.predict(x_test)

### Evaluation

### Deployment 

### Conclusion