<!--<h1 style="font-size:40px; font-family:Verdana" align="center"> UDS-Club Workshop </h1> -->
<h2 style="font-size:34px; font-family:Verdana" align="center"> Convolutional Neural Networks </h2>
<img src='http://i.piccy.info/i9/666d78be04fbcf04fdb321d5953d1fa5/1492256847/123248/1137898/ua_parrots.jpg'/>
<h4 style="font-size:18px; font-family:Verdana" align="right"> by Iryna Melnyk <br> <pre>    2017-04-23</pre> </h4>

## Table of Contents


[Requirements](##Requirements)  
[Classification for Sentiment Analysis](#Classification-for-Sentiment-Analysis)  
[CNN](#CNN)  
[History](#History)  
[Model overview](#CNN-Model-overview)  
[Proc and Cons](#CNN-Proc-and-Cons)  
[Main params](#CNN-Main-params)  
[Practice](#CNN-Practice)   
[Conclusions](#Conclusions)   
[Useful links](#CNN-Useful-links)  



## Requirements

1. Python 3.x (or Anaconda3 for Python 3.5, https://www.continuum.io/downloads)
2. Scikit-learn 0.18.x (pip install scikit-learn==0.18.1, http://scikit-learn.org/)
3. Keras latest (https://keras.io/#installation)
4. Pandas latest (http://pandas.pydata.org/)
5. For datasets more than 1M reviews min Hardware Requirements (SDRAM >= 8 GB)

[To the table of contents](#Table-of-Contents)

# Classification for Sentiment Analysis


Main tasks:
- supervised learning
- focus on the binary classification problem in which you can take on only two values, 0 and 1.
- predict sentiment of users review text

We are trying to build a sentiment classifier for users reviews about movies (consumers goods, books, etc.),  
then x(i) may be some features of user review, and y may be 1 if it is a piece of spam mail, and 0 otherwise.  
0 is also called the negative class, and 1 is the positive class,  
and they are sometimes also denoted by the symbols “-” and “+.” Given x (i) ,  
the corresponding y (i) is also called the label for the training example.


[To the table of contents](#Table-of-Contents)  

# CNN 

### History

1957, Frank Rosenblatt - perceptron, Cornell Aeronautical Laboratory

1959, Hubel & Wiesel found that cells in animal visual cortex are responsible for detecting light in receptive fields

1975, Kunihiko Fukushima - cognitron as an extension of original perceptron

1980, Kunihiko Fukushima - neocognitron - predecessor of CNN

1990, LeCun et all showed that back-propagation and several of its generalizations could be derived rigorously using Lagrange functions [LeCun, 1988]

1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting

2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work

2013, Matthew Zeiler and Rob Fergus developed the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters

2014, Szegedy (and others). GoogLeNet - the main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M)

2014, Karen Simonyan and Andrew Zisserman of the University of Oxford, VGGNet - its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance. It scored first place on the image localization task and second place on the image classification task. Localization is finding where in the image a certain object is, described by a bounding box. Classification is describing what the object in the image is. This predicts a category label

2015, Kaiming He (and others), Residual Network. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016)

2016, Xingyu Zeng (and others), Gated Bi-directional CNN for Object Detection

[To the table of contents](#Table-of-Contents)  

### CNN Model overview  

A basic CNN consists of several types of layers: convolutional, pooling and fully connected layer and we will try to get an intuition about how it works under the hood.

** The theory of convolution **

Convolution is a technique widely used in the image processing. It is the process of adding each element of the image to its local neighbors, weighted by the kernel. 
<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif" style="width: 60%;"/>
It results in a wide range of effects such as blurring, sharpening, embossing, edge detection and more depending on a filter.
<img src="http://cs231n.github.io/assets/cnn/convnet.jpeg" alt="hidden layers visualisation" style="width: 100%;"/>

** Filters **

In the regular sense a filter (aka kernel) is a matrix of predefined parameters (or weights). But in CNN we use a self adjusted filters. This is achieved through the usage of backpropagation algorithm. 
At first step the filter weights are initialized with relatively small random values which leads to asymmetrical diversity among "neurons" and better classification capability (zero initialisation is not recommended because every "neuron" in the network will compute the same output)

** Pooling **
 
To reduce the dimensionality of obtained intermediate data and to condense the sparsity the pooling layer is used. We simply chose a region size and pick the maximum value (MaxPooling) or average among the region (AveragePooling).
<img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-3-38-39-am.png?w=494" style="width: 80%;"/>

** Fully connected layer **

The final layer which takes as an input the data from convolutional or pooling layer and outputs an N dimensional vector where N is the number of classes that the program has to choose from.

** CNN for sentiment analysis **

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.
<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png" style="width: 100%;"/>

[To the table of contents](#Table-of-Contents)

### CNN Proc and Cons

** Pros **

    + Accuracy is high when we have enough data to train
    + The best choice for Computer Vision tasks and position-dependent data

** Cons **
    - Shows better results on large datasets
    - Train could be time-consuming
    - High computational cost



[To the table of contents](#Table-of-Contents)

### CNN Main params

** keras.layers.Conv1D **

** filters **: Integer, the dimensionality of the output space (i.e. the number output of filters in the convolution).

** kernel_size **: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window.

** activation **: Activation function to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).


[To the table of contents](#Table-of-Contents)

### CNN Practice

In [1]:
import pandas as pd
import re
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout, GlobalAveragePooling1D
from keras.models import Sequential

Using TensorFlow backend.


In [2]:
NB_EPOCH = 2
MAX_SEQUENCE_LENGTH = 50
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 50

METRICS = ['accuracy', 'fmeasure']
labels_index = {'negative': 0, 'positive': 1}

In [3]:
data_train = pd.read_csv("../data/movie_reviews.csv", sep=',')
data_test = pd.read_csv("../data/test.csv", sep=',')

In [4]:
print(data_train.shape)
print(data_test.shape)

(152610, 2)
(10660, 2)


In [5]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152610 entries, 0 to 152609
Data columns (total 2 columns):
label    152610 non-null int64
text     152610 non-null object
dtypes: int64(1), object(1)
memory usage: 2.3+ MB


In [6]:
data_train['label'].value_counts(normalize=True)

1    0.587498
0    0.412502
Name: label, dtype: float64

In [7]:
stopwords = ['href','quot','amp','br',
                    'an','by','did','does','was',
                    'were','the','to','at','on',
                    'in','with','it','he','she',
                    'this','that','is']
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stopwords])

In [8]:
def clean_str(string):
    string = re.sub(r"can\'t", "can not", str(string))
    string = re.sub(r"\'s", " is", string)
    string = re.sub(r"\'ve", " have", string)
    string = re.sub(r"n\'t", " not", string)
    string = re.sub(r"\'re", " are", string)
    string = re.sub(r"\'d", " would", string)
    string = re.sub(r"\'ll", " will", string)
    string = re.sub(r"\b[A-Za-z]{1}\b", ' ', string)
    string = re.sub(r"[^A-Za-z-_]", " ", string)
    string = re.sub(r'\.{1,10}', ' ', string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def preprocessing(dataframe):
#     dataframe = dataframe[:500]
    dataframe['text'] = dataframe.loc[:,'text'].apply(clean_str)
    dataframe['text'] = dataframe.loc[:,'text'].apply(remove_stopwords)
    dataframe = dataframe[dataframe['text'].notnull()]
    return dataframe

In [9]:
def create_seq_model(layer):
    model = Sequential()
    model.add(layer)
    model.add(Conv1D(64, 3, activation='relu'))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(128, 3, activation='relu'))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(256, 3, activation='relu'))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.5))
    model.add(Dense(len(labels_index), activation='softmax'))
    return model

### Challenge

Lets try to make our own layered pie! Implement the model with the following parameters:

<details>
  <summary>Click to see more</summary>
  <img src="https://raw.githubusercontent.com/udsclub/workshop/master/pictures/model.png" style="width: 100%;"/>
  Don't forget to uncomment the model creation and recompile it
</details>

In [10]:
def create_another_model(layer):
    model = Sequential()
    model.add(layer)
    # place your code here
    return model

In [11]:
data_train = preprocessing(data_train)
data_test = preprocessing(data_test)

In [12]:
print (data_train[:5])

   label                                               text
0      1  entire generation of filmgoers just might repr...
1      1  pixar classic one of best kids movies of all time
2      1  apesar de representar um imenso avan o tecnol ...
3      1  when woody perks up opening scene not only toy...
4      1  introduced not one but two indelible character...


In [13]:
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(data_train.text)

In [14]:
sequences = tokenizer.texts_to_sequences(data_train.text)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(data_train.label))

In [15]:
X_train, X_test, y_train, y_test  = train_test_split(
                                                    data, 
                                                    labels,
                                                    test_size=0.2, 
                                                    random_state=42)

In [16]:
embedding_layer = Embedding(len(tokenizer.word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

### Homework challenge

We can replace our embedding_layer using GloVe and retrain our model.

<details>
  <summary>Click to see more</summary>
  <p align='left'>You can use this code snippet to solve the challenge. For this task we need to download glove.6B.zip from https://nlp.stanford.edu/projects/glove/</p>
      <pre>
          <code>
GLOVE_DIR = "path/to/glove"
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# At this point we can leverage our embedding_index dictionary and our word_index to compute our embedding matrix:

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
# We load this embedding matrix into an Embedding layer. Note that we set trainable=False to prevent the weights from being updated during training.

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
          </code>
      </pre>

</details>

In [17]:
model = create_seq_model(embedding_layer)
# model = create_another_model(embedding_layer)
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 50, 50)        5811950     embedding_input_1[0][0]          
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 48, 64)        9664        embedding_1[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 24, 64)        0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 22, 128)       24704       maxpooling1d_1[0][0]             
___________________________________________________________________________________________

In [18]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=METRICS) # SGD(lr=0.001, momentum=0.9)

In [19]:
model.fit(X_train, y_train, validation_data=(X_test, y_test),
          nb_epoch=NB_EPOCH, batch_size=128, shuffle=True)

Train on 122088 samples, validate on 30522 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fe6f11c56d8>

In [20]:
data_test['label'].value_counts(normalize=True)

1    0.5
0    0.5
Name: label, dtype: float64

In [21]:
sequences = tokenizer.texts_to_sequences(data_test['text'])
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(data_test['label']))

In [22]:
scores = model.predict(data)

In [23]:
def f(x):
    return 1 if x > 0.5 else 0
f = np.vectorize(f, otypes=[np.int])
b_scores = f(scores)
print(classification_report(labels, b_scores))

             precision    recall  f1-score   support

          0       0.86      0.67      0.75      5330
          1       0.73      0.89      0.80      5330

avg / total       0.79      0.78      0.78     10660



In [24]:
score = model.evaluate(data, labels)



In [25]:
print("%s: %.2f%%" % (model.metrics_names[0], score[0]*100))
print("%s: %.2f%%" % (model.metrics_names[1], score[1]*100))
print("%s: %.2f%%" % (model.metrics_names[2], score[2]*100))

loss: 45.73%
acc: 78.00%
fmeasure: 78.00%


### Conclusions

ConvNets are based on the ability to generalize the position-dependent data and they are useful not only in image classification and NLP. The most recent win of Google’s AlphaGo Project over Lee Sedol in the Go game series relied on a CNN at its core. The self-driving cars which, in the coming years, will arguably become a regular sight on our streets, rely on CNNs for steering. 

[To the table of contents](#Table-of-Contents)

### CNN Useful links

http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

http://cs231n.github.io/neural-networks-1/ 

https://keras.io/getting-started/sequential-model-guide/

https://keras.io/models/sequential/

https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html 

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

[To the table of contents](#Table-of-Contents)