In [None]:
# Supervised learning (Convolutional Neural Network)

## Table of Contents


[Requirements](##Requirements)  
[Classification for Sentiment Analysis](#Classification-for-Sentiment-Analysis)  
[CNN](#CNN)  
[History](#History)  
[Model overview](#CNN-Model-overview)  
[Proc and Cons](#CNN-Proc-and-Cons)  
[Main params](#CNN-Main-params)  
[Practice](#CNN-Practice)  
[Useful links](#CNN-Useful-links)  


## Requirements

1. Python 3.x (or Anaconda3 for Python 3.5, https://www.continuum.io/downloads)
2. Scikit-learn 0.18.x (pip install scikit-learn==0.18.1, http://scikit-learn.org/)
3. Keras latest (https://keras.io/#installation)
4. Pandas latest (http://pandas.pydata.org/)
5. For datasets more than 1M reviews min Hardware Requirements (SDRAM >= 8 GB)

[To the table of contents](#Table-of-Contents)

# Classification for Sentiment Analysis


Main tasks:
- supervised learning
- focus on the binary classification problem in which you can take on only two values, 0 and 1.
- predict sentiment of users review text

We are trying to build a sentiment classifier for users reviews about movies (consumers goods, books, etc.),  
then x(i) may be some features of user review, and y may be 1 if it is a piece of spam mail, and 0 otherwise.  
0 is also called the negative class, and 1 is the positive class,  
and they are sometimes also denoted by the symbols “-” and “+.” Given x (i) ,  
the corresponding y (i) is also called the label for the training example.


[To the table of contents](#Table-of-Contents)  

# CNN 

### History

1957, Frank Rosenblatt - perceptron, Cornell Aeronautical Laboratory

1959, Hubel & Wiesel found that cells in animal visual cortex are responsible for detecting light in receptive fields

1975, Kunihiko Fukushima - cognitron as an extension of original perceptron

1980, Kunihiko Fukushima - neocognitron - predecessor of CNN

1990, LeCun et all showed that back-propagation and several of its generalizations could be derived rigorously using Lagrange functions [LeCun, 1988]

1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting

2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work

2013, Matthew Zeiler and Rob Fergus developed the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters

2014, Szegedy (and others). GoogLeNet - the main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M)

2014, Karen Simonyan and Andrew Zisserman of the University of Oxford, VGGNet - its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance. It scored first place on the image localization task and second place on the image classification task. Localization is finding where in the image a certain object is, described by a bounding box. Classification is describing what the object in the image is. This predicts a category label

2015, Kaiming He (and others), Residual Network. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016)

2016, Xingyu Zeng (and others), Gated Bi-directional CNN for Object Detection

[To the table of contents](#Table-of-Contents)  

### CNN Model overview  

A basic CNN consists of several types of layers: convolutional, pooling and fully connected layer and we will try to get an intuition about how it works under the hood.

** The theory of convolution **

Convolution is a technique widely used in the image processing. It is the process of adding each element of the image to its local neighbors, weighted by the kernel. 
<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif" style="width: 100%;"/>
It results in a wide range of effects such as blurring, sharpening, embossing, edge detection and more depending on a filter.
<img src="http://cs231n.github.io/assets/cnn/convnet.jpeg" alt="hidden layers visualisation" style="width: 100%;"/>

** Underlying biological context of a filter output **  

Depending on filter weights and input image after summation we can have any positive or negative values as outcome but main goal is to check if current image region match the filter pattern. In another words we need to solve a simple binary classification problem, that is why the activation function is used. The most common is ReLU (Rectified Linear Unit, f(x)=max(0,x)) because of its efficiency. This mechanism is based on a real neuron behavior.
<img src="http://cs231n.github.io/assets/nn1/neuron.png" style="width: 50%;"/>
<img src="http://cs231n.github.io/assets/nn1/neuron_model.jpeg" style="width: 50%;"/>

** Filters **

In the regular sense a filter (aka kernel) is a matrix of predefined parameters (or weights). But in CNN we use a self adjusted filters. This is achieved through the usage of backpropagation algorithm. 
At first step the filter weights are initialized with relatively small random values which leads to asymmetrical diversity among "neurons" and better classification capability (zero initialisation is not recommended because every "neuron" in the network will compute the same output)


** A magic behind the backprop **
<img src="https://cdn-images-1.medium.com/max/1000/1*-1trgA6DUEaafJZv3k0mGw.jpeg" style="width: 60%;"/>
To simplify we can describe backpropagation process in a few steps:

 1. Feed-forward computation
 2. Backpropagation to the output layer
 3. Backpropagation to the hidden layer
 4. Weight updates

The algorithm is stopped when the value of the error function has become sufficiently small.
Backprop employs Local Gradient Descent for its optimisation technique. More math:
(https://drive.google.com/open?id=0B_ZhtIWnWellNW5FOGhHdkYyaWM)


** Pooling **
 
To reduce the dimensionality of obtained intermediate data and to condense the sparsity the pooling layer is used. We simply chose a region size and pick the maximum value (MaxPooling) or average among the region (AveragePooling).


<img src="http://ufldl.stanford.edu/tutorial/images/Pooling_schematic.gif" style="width: 100%;"/>

** Fully connected layer **

The final layer which takes as an input the data from convolutional or pooling layer and outputs an N dimensional vector where N is the number of classes that the program has to choose from

** CNN for sentiment analysis **

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.
<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png" style="width: 100%;"/>

[To the table of contents](#Table-of-Contents)

### CNN Proc and Cons

** Pros **

    + Accuracy
    + The best choice for image recognition and position-dependent data

** Cons **
    - Shows better results on large datasets
    - Train could be time-consuming



[To the table of contents](#Table-of-Contents)

### CNN Main params

** keras.layers.Conv1D **

** filters **: Integer, the dimensionality of the output space (i.e. the number output of filters in the convolution).

** kernel_size **: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window.

** activation **: Activation function to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).


[To the table of contents](#Table-of-Contents)

### CNN Practice

In [1]:
import pandas as pd
import re
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout, GlobalAveragePooling1D
from keras.models import Sequential

ImportError: No module named model_selection

In [2]:
NB_EPOCH = 2
MAX_SEQUENCE_LENGTH = 50
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 50
METRICS = ['accuracy', 'fmeasure', 'precision', 'recall']
labels_index = {'negative': 0, 'positive': 1}

In [3]:
cached_stopwords = ['href','quot','amp','br',
                    'an','by','did','does','was',
                    'were','the','to','at','on',
                    'in','with','it','he','she',
                    'this','that','is']
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in cached_stopwords])

In [5]:
data_rt = pd.read_csv("../data/reviews_rt_all.csv", sep="|")
data_imdb = pd.read_csv("../data/imdb_small.csv", sep="|")

IOError: File ../data/reviews_rt_all.csv does not exist

In [6]:
data_df = pd.concat([data_rt, data_imdb], ignore_index=True, copy=False)
data_df = shuffle(data_df)

NameError: name 'data_rt' is not defined

In [11]:
print(data_df.shape)

(152610, 2)


In [12]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 152610 entries, 48943 to 67704
Data columns (total 2 columns):
label    152610 non-null int64
text     152610 non-null object
dtypes: int64(1), object(1)
memory usage: 3.5+ MB


In [13]:
# df.describe()
# df.describe(include=['object'])
# df['label'].value_counts()
data_df['label'].value_counts(normalize=True)

1    0.587498
0    0.412502
Name: label, dtype: float64

In [28]:
def create_seq_model(layer):
    model = Sequential()
    model.add(layer)
    model.add(Conv1D(64, 3, activation='relu'))
    model.add(Conv1D(64, 3, activation='relu'))
    model.add(MaxPooling1D(3))
    model.add(Conv1D(128, 3, activation='relu'))
    model.add(Conv1D(128, 3, activation='relu'))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.5))
    model.add(Dense(len(labels_index), activation='sigmoid'))
    return model

In [29]:
def clean_str(string):
    string = re.sub(r"can\'t", "can not", str(string))
    string = re.sub(r"\'s", " is", string)
    string = re.sub(r"\'ve", " have", string)
    string = re.sub(r"n\'t", " not", string)
    string = re.sub(r"\'re", " are", string)
    string = re.sub(r"\'d", " would", string)
    string = re.sub(r"\'ll", " will", string)
    string = re.sub(r"\b[A-Za-z]{1}\b", ' ', string)
    string = re.sub(r"[^A-Za-z-_]", " ", string)
    string = re.sub(r'\.{1,10}', ' ', string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip()

def preprocessing(dataframe):
    dataframe = dataframe[:500]
    dataframe['text'] = dataframe['text'].apply(clean_str)
    dataframe['text'] = dataframe['text'].apply(remove_stopwords)
    dataframe = dataframe[dataframe['text'].notnull()]
    return dataframe

In [16]:
data_rt = preprocessing(data_rt)
data_imdb = preprocessing(data_imdb)

In [21]:
X_train_rt, X_test_rt, y_train_rt, y_test_rt  = train_test_split(
                                                        data_rt.text, 
                                                        data_rt.label,
                                                        test_size=0.2, 
                                                        random_state=42)

ImportError: No module named model_selection

In [None]:
X_train_imdb, X_test_imdb, y_train_imdb, y_test_imdb  = train_test_split(
                                                        data_imdb.text, 
                                                        data_imdb.label,
                                                        test_size=0.2, 
                                                        random_state=42)

In [None]:
X_train = pd.concat([X_train_rt, X_train_imdb])
X_test = pd.concat([X_test_rt, X_test_imdb])
y_train = pd.concat([y_train_rt, y_train_imdb])
y_test = pd.concat([y_test_rt, y_test_imdb])

In [30]:
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)

NameError: name 'Tokenizer' is not defined

In [None]:
sequences = tokenizer.texts_to_sequences(X_train)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(y_train))

In [None]:
sequences_test = tokenizer.texts_to_sequences(X_test)
data_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
labels_test = to_categorical(np.asarray(y_test))

In [None]:
embedding_layer = Embedding(len(tokenizer.word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

### challenge

<details>
  <summary>Click to see answer</summary>
  <img alt="Smiley face" align="left" src="http://1.1m.yt/_FT_6m0.png">
  <p align='left'>Are you sure you tried to solve it on your own?</p>
      <pre>
          <code>
              embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
# At this point we can leverage our embedding_index dictionary and our word_index to compute our embedding matrix:

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
# We load this embedding matrix into an Embedding layer. Note that we set trainable=False to prevent the weights from being updated during training.

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
          </code>
      </pre>

</details>

[To the table of contents](#Table-of-Contents)

In [None]:
model = create_seq_model(embedding_layer)

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=METRICS)

In [None]:
model.fit(data, labels, validation_data=(data_test, labels_test),
          nb_epoch=NB_EPOCH, batch_size=128, shuffle=True)

### CNN Useful links

http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

http://cs231n.github.io/neural-networks-1/ 

https://keras.io/getting-started/sequential-model-guide/

https://keras.io/models/sequential/

https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html 

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

[To the table of contents](#Table-of-Contents)