## Introduction 
This tutorial will introduce some basic ideas concerning Convolutional Neural Network and how to implement an CNN model to solve supervised learning problem, Twitter feeds sentiment analysis, using Tensorflow convolutional neural network module. 
As the input data generated in our daily life has included more high dimensional data like image and text rather than only structured data like numbers, we need deep learning model to solve complex practical problems like image classification. For some traditional model like linear regression, it will be hard to project a high dimensional data into a vector without loosing some information or keep the number of free parameters in a reasonable level, while in deep learning, we can handle this in a state-of-the-art way. Convolutional neural network is an important part of deep learning. Following is a picture illustrate how CNN can classify the images.
<img src="https://cdn-images-1.medium.com/max/1200/1*oB3S5yHHhvougJkPXuc8og.gif">
We can see from the image that there are a fixed input layer and fixed output layer and also multiple hidden layers. Although CNN firstly introduced to solve image classification problem, it is also proved efficient subsequently on the semantic parse for natural language process(1). 

### Tutorial content
This tutorial will show how to build a simple convolutional neural network model to identify the sentiment information concerning Twitter feeds, taking advantage of [Tensorflow](https://www.tensorflow.org/tutorials/deep_cnn) and [pretrained google Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) model. The pretrained word vector file can be downloaded [here](https://code.google.com/archive/p/word2vec/)

This tutorial will use Twitter Sentiment Analysis Data Set from [Ibrahim Naji's blog](http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/). Because there will be a lot of mannual working to do the annotation, I just used the free annotated dataset available on Internet, this csv file contains nearly 1.6 million records with the Twitter feeds text content and sentiment flag(0 stand for negative, 1 stand for positive).

We will cover the following topics in this tutorial:
- [Convolutional Neural Network Overview](#Convolutional-Neural-Network-Overview)
- [Libraries installation](#Installing-the-libraries)
- [Loading and preprocessing data](#Loading-data-and-preprocessing)
- [Model implementation details](#Model-implementation)
- [Code integration](#Code-integration)
- [Conclusion](#Conclusion)
- [Reference](#Reference)

### Convolutional Neural Network Overview

<img src="http://www.mdpi.com/information/information-07-00061/article_deploy/html/images/information-07-00061-g001.png">

From the picture above, we can have a brief overview of convolutional neural network. 
First comes the input layer, this layer is usually a high dimensional data, for example a couple of 2-D matrix can form a 3-D input layer. When it comes to convolutional layer, there is a filter, which works like a flashlight, it has fixed size in our case. The filter will go the path we configured through all the data in input layer and form a smaller size of data layer, the process is just like the gif below. Then comes the pooling layer, this layer is mainly for downsampling strategy which will further reduce data set, different from the convolutional layer, pooling layer ysyally has no overlapping and is also used to avoid overfitting. After we have down the most featured part of CNN, we will have two fully connected layer similar to normal neural network. Note: there may be many convolutional layers and pooling layers according to the model design, while for tutorial purpose, only one convolutional layer and one polling layer will are included in our model.

<img src="http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif">

## Installing the libraries

Before build the model with tensorflow, many libraries and packages need to be installed to continue with the following process.  You can install Tensorflow and gensim use anaconda:

### To install tensorflow, you need to execute the following codes.
Create tensorflow conda environment:

$conda create -n tensorflow pip python=3.5 

Active conda environment:

$activate tensorflow(the prompt will change to tensorflow)

Install tensorflow in conda environment(First is CPU-only and second is GPU version):

$pip install --ignore-installed --upgrade tensorflow 

$pip install --ignore-installed --upgrade tensorflow-gpu

CPU-only version is enough in the scope of this tutorial to execute model training and evaluation.

### To install gensim, you need to execute the following code:
$ conda install -c anaconda gensim
Note: to use the pretrained word vectore model, you still need to download the google pretrained model in this [site](https://code.google.com/archive/p/word2vec/) and put that file under the same folder with this file.
After tou have installed all the needed packages and files, make sure that the following code work for you:

In [None]:
import pandas as pd
import re
import nltk
import numpy as np
import tensorflow as tf
import os
import time
import datetime
import string
from tensorflow.contrib import learn
from gensim.models import KeyedVectors

## Loading data and preprocessing

For we have installed and loaded all the libraries we need, let's load our data set first and do some preprocessing. The file downloaded from this [site](http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/), the click the Twitter Sentiment Analysis Dataset hyerlink in that page(in the beginning of the fourth paragraph). Then a Sentiment-Analysis-Dataset.zip file will be downloaded and upzip it, there is only one file 'Sentiment Analysis Dataset.csv' inside it. Put the csv file in this dictionary so that we can load the data using the following command:

In [2]:
dataset = pd.read_csv('Sentiment Analysis Dataset.csv',error_bad_lines=False)

b'Skipping line 8836: expected 4 fields, saw 5\n'
b'Skipping line 535882: expected 4 fields, saw 7\n'


After we run the code, we see there are two warning message saying the two line of bad data is ignore. After looking into the original dataset, we found that it is because of some parse errors, for we have had enough training set, so we can just ignore the two records and go on with our data exploration and preprocessing with current data set:

In [3]:
print(dataset.head())

   ItemID  Sentiment SentimentSource  \
0       1          0    Sentiment140   
1       2          0    Sentiment140   
2       3          1    Sentiment140   
3       4          0    Sentiment140   
4       5          0    Sentiment140   

                                       SentimentText  
0                       is so sad for my APL frie...  
1                     I missed the New Moon trail...  
2                            omg its already 7:30 :O  
3            .. Omgaga. Im sooo  im gunna CRy. I'...  
4           i think mi bf is cheating on me!!!   ...  


From the printed head 10 records we can see that only the second and last column are needed to train our model. In addition, for the input layer format in Tensorflow, y is a list of length n, where n is the number of classes in this model, in our case, we just have two classes(potitive and engative). So we still need to encode the 'Sentiment' column to a list with length 2([0,1] stand for negative, [1,0] stand for positive)

In [4]:

#list which store the twitter 
feed_list = []

# list which store the labels of every record
label = []
# record the largest length of one feed
twitter_length = 0
# Iterate through the dataset to get the columns we want and encode the label column
for index, row in dataset.iterrows():
    text = row['SentimentText']
    # first trnasfer all the letters to lower case
    text = text.lower()
    
    # write regular expression to clean the unstructured data and only get the words inside it
    text = text.replace("\'s", "")
    text = text.replace("\'", "")
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    #join words of one twitter feeds with space and append it to feed_list
    clean = regex.sub(' ', text)
    words = nltk.word_tokenize(clean)
    twitter_length = max(twitter_length,len(words))
    feed_list.append(' '.join(words))
    #encode label
    if row['Sentiment'] == 0:
        label.append([0,1])
    else:
        label.append([1,0])

## Model implementation

### Input Layer

For CNN model can only deal with high dimensional numbers, we need to transform each word to a vector and then form a 3-D array and store them in a tensor(the trainable unit in Tensorflow). As we mentioned above, we are going to use the Google pretrained model to get the word vector. If some word dose not existed in this pretrained model, then we are going to use certain noise word to replace the word. On th other hand, because the size of input data has to be the same for every  record, we need to fill every feed to the same size, the largest text length, 'twitter_length'(the number we get in data preprocessing). Again we will use certain noise word to fill the sentence.(in this tutorial we use 'the'). Then we are ready to implement the code to get the input tensor for our model:

In [5]:
# Initiate the pretrained word vector model
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [6]:
# model is the pretrained model. We pass it as parameter so we just need to load the model once
# feeds is a list of twitter feeds we get from data loading part
def get_vectors(model,feeds):
    # high dimensional data which will be assigned to input tensor
    final_input = []
    for sentence in feeds:
        #the matrix of one certain feed
        mid_input = []
        i = 0
        for word in sentence.split(' '):
            i = i + 1
            if word in model.vocab:
                mid_input.append(model.word_vec(word))
            else:
                mid_input.append(model.word_vec('the'))
        #fill the feed to the largest length
        while i < twitter_length:
            mid_input.append(model.word_vec('the'))
            i = i+1
        final_input.append(mid_input)
    return final_input

### Hidden Layers

As we have get all the knowledge and input data we need to feed our CNN model(hopefully), then we configure the hidden layers inside our network, including number of hidden layers, filter size, pooling size. Regarding features of our data and experience, we will have three hiden layers totally: one convolutional layer, one pooling layer and one fully connected layer. Then for every layer, we need to initiate weights for every neuron. The configuration code is below:

#### Convolutional & Pooling Layer
In convolutional layer, we are going to implement the convolution to our input data. How tensorflow doing so is first flat the high dimensional data, including both the input data and weight matrix and then right multiply weight matrix with flatted input matrix. In our case, we would like to set the filter length to 3, because 3-gram seems enough to get the sentiment for certain sentenece, and to simplify the model and speed up the training process, we are going to set the filter width to 300, so the filter in convolutional layer will be look like this:<img src = "http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM.png">
note that there are three size of filter in the image: 2,3 and 4, while in our case, we only consider the filter length 3. And we are using Relu to calculate the output of each filter. 
For the pooling layer, just like the image above, we will use a filter length of 'twitter_length-2', the number of filters can be defined by user, it defined how many filters will be applied to the input data, the following is the code sample of the two layers:

In [7]:
# Layer configuration
# You can set filter_number to any number you want.
# Bigger the filter_number, more complex will the model be. The model will also fit better to training data
filter_number = 300
filter_length = 3
filter_width = 300
conv_layer = {'filter_length' : filter_length ,'filter_width': filter_width,'out_channel':filter_number}
pool_layer = {'filter_length':filter_number}

In [None]:
"""
DO NOT EXECUTE
"""
# NOTE: This chunck is only for illustrating the use of functions. Please do not run it.

# input_data is the high dimensional data we get by get_vectors method.
# input_data is of size[batch,twitter_length,300,1]
# weight is the randomely genrated weight of certain size
# weight is of size[filer_length*filter_width*in_channel,out_channel]
conv_mid = tf.nn.conv2d(input_data,weight,strides=[1, 1, 1, 1],padding='VALID')
# After conv2d function we can get a relatively smaller high dimensional data compared to input_data
# conv_mid is of size[batch,twitter_length-filter_length+1,1,filter_number]
# Apply Relu after add bias
# bias will be initiated as a constant with size [filter_number]
conv_out = tf.nn.relu(tf.nn.bias_add(conv_mid, bias))
# The max_pool function will generate a tensor of size[1,1,1,filter_number]
pool_mid = tf.nn.max_pool(conv_out,ksize=[1,twitter_length-filter_length+1, 1, 1],strides=[1, 1, 1, 1],padding='VALID')
# For there are many redundancy on the dimension(many 1s in shape), so we decide to flat the tensor before we process to the output layer
pool_out = tf.reshape(pool_mid, [-1, filter_number])
# Then finally we get a tensor of size[1,filter_number] and are ready to go to output layer

Details of relevant functions can be found below:

 [tf.nn.conv2d](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d)
 
 [tf.nn.relu](https://www.tensorflow.org/api_docs/python/tf/nn/relu)
 
 [tf.nn.max_pool](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool)
 
 [tf.reshape](https://www.tensorflow.org/api_docs/python/tf/reshape)

#### Output Layer
The output layer is a fully connected layer and will generate a tuple(two numbers), each indicating the scores each class get considering the input data. Then we use argmax function to get the predicted label of input data. We are also going to calculate the accuracy and error in this layer for model development and evaluation. The code sample is below:

In [None]:
"""
DO NOT EXECUTE
"""
# NOTE: This chunck is only for illustrating the use of functions. Please do not run it.

# This function will calculate the final results
# weight is randomly initiated with size [filter_number,2]
# bias is also randomly generated with size [2]
results = tf.nn.xw_plus_b(pool_out, weight, bias)
# use argmax to get the predicted label(index) of the input data
# For a batch of data with n records, predictions will be a list of index
predictions = tf.argmax(scores, 1, name='predictions')

Details of relevant functions can be found below:

 [tf.nn.xw_plus_b](https://www.tensorflow.org/api_docs/python/tf/nn/xw_plus_b)

### Training and Evaluation

In this part, we are going to build the training, development and evaluation function. The purpose of development is to evaluation the model during the process of training, so we can tunning our model better. Then before going to training and evaluation, we need to define several tensors needed to be optimized over time. In our case, we choose to use gradient decend to do batch optimization.

In [None]:
"""
DO NOT EXECUTE
"""
# NOTE: This chunck is only for illustrating the use of functions. Please do not run it.


# Here we use softmax cross entropy to calculate the loss of the model
# This function will calculate the probability error in our task
#First we calculate the sum of loss of every record compared to label
losses = tf.nn.softmax_cross_entropy_with_logits(labels = label, logits = results) 
# Calculate the mean loss to calculate the gradient of every trainable variable(weights)
loss = tf.reduce_mean(losses)
# get the tensor the same size of predictions, 1 indicates correct prediction, vice versa
correct_number = tf.equal(predictions, tf.argmax(label, 1))

# calculate the number of records classified correctly/total number of records
# This function first reduce the dimension and then do an average operation
accuracy = tf.reduce_mean(tf.cast(correct_number, 'float'))

# Define training process
# Variables used to track the step, start from 0
step = tf.Variable(0, name="global_step", trainable=False)

# In this case we are using AdamOptimizer to implement batch gradient decend
optimizer = tf.train.AdamOptimizer(1e-8)

#Calculate the gradients according to average error
gradients = optimizer.compute_gradients(loss)

# Optimize all the trainable variables using the gradients calculated above
training = optimizer.apply_gradients(gradients, global_step=step)

Details of relevant functions can be found below:

[tf.nn.softmax_cross_entropy_with_logits](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits)

[tf.reduce_mean](https://www.tensorflow.org/api_docs/python/tf/reduce_mean)

[tf.train.AdamOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)

After we have defined all the tensors needed in training and evaluation process, we will continue with the training, development and evaluation function:

In [9]:
# x is batch of input data, y is the encoded label
# Loss and accu are the tensors which store the losses and accuracy of model over the time
# Training are the traning tensor which define the training behavior
# Step is the tensor which record the current step
def train(x, y,loss,accu,training,step):
    # Define the input data
    feed_dict = {input_x: x,input_y: y}
    
    # Execute the training process with input data defined above and get the output
    _, step, l, a = sess.run([training, step, loss, accu],feed_dict)
    
    # print the output(Summary)
    print("step {}, loss {:g}, accuracy {:g}".format(step, l, a))

def development(x, y, loss,accu,step):
     # Define the development data set
    feed_dict = {input_x: x,input_y: y}
    
    # Calculate the loss and accuracy on development dataset
    s, l, a = sess.run([step,loss,accu],feed_dict)
    
    # Print the summaries
    print("step {}, loss {:g}, acc {:g}".format(s,l,a))

# Predictions is the tensor which defines the predict behavior of the model
# x is the raw text of feeds
# y is the label of feeds(optional)
# model is the pretrained word vector model
def evaluation(x, predictions, model, y = None):
    # Calculate the prediction of the evaluation dataset
    preds = np.array(predictions.eval(feed_dict={input_x: np.array(get_vectors(model,x))}))
    
    # print the formatted output
    if y == None:
        print(np.column_stack((np.array(x), preds)))
    else:
        print(np.column_stack((np.array(x), preds, np.argmax(y,1))))


### Code integration

For we have get all the modules we want in our model, we are going to integrated them and do a one time running session. There are some things need to be noted:
1: Because a large data set training will cost a lot of time. In this case, we are choosing a batch size of 1000 from the original dataset, and we will run totally 50 batchs. So for training we will feed 50*1000 records into our model. For further use or extension, you can definitly randomely select different batch size and different batchs from the original data set. 
2: We will choose another 1000 development data set except training data to get some fair statistics on accuracy.(training : develop = 9 : 1)
3: For evaluation data, you can create it by youself or just randomly choose from the original data. When you input the data manually, remember to put as certain format as in sample. It will output a readable table where first column is the raw text, the is prediction, last column is the true label of that data(if applicable)
4: Inside the whole graph, we have to build all the trainable variables like weight and bias to tensors. There are usually two ways to generate tensors, one is through tf.Variables(), another is through tf.placeholder().

In [11]:
# Initialize the model
with tf.Graph().as_default():
    # Configure the model
    session_conf = tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)
    sess = tf.Session(config=session_conf)
    
    # Begin the session
    with sess.as_default():
        # Define the input data
        input_x = tf.placeholder(tf.float32, [None, twitter_length,300], name='input_x')
        input_y = tf.placeholder(tf.float32, [None, 2], name='input_y')
        # Extend the input data because Tensorflow CNN model will only accept Rank 4 input data 
        input_expanded = tf.expand_dims(input_x, -1)
       
        # Convolutional Layer
        filter_shape = [conv_layer['filter_length'], conv_layer['filter_width'], 1, filter_number]
        conv_w = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1),name='conv_w')
        conv_b = tf.Variable(tf.constant(0.1, shape=[filter_number]),name='conv_b')
        conv_mid = tf.nn.conv2d(input_expanded,conv_w,strides=[1, 1, 1, 1],padding='VALID')
        conv_out = tf.nn.relu(tf.nn.bias_add(conv_mid, conv_b))
        
        # Pooling Layer
        pool_mid = tf.nn.max_pool(conv_out,ksize=[1, twitter_length - conv_layer['filter_length'] + 1, 1, 1],strides=[1, 1, 1, 1],padding='VALID',name='pool')
        pool_out = tf.reshape(pool_mid, [-1, pool_layer['filter_length']])
       
        # Output Layer
        out_w = tf.get_variable('out_w',shape=[filter_number, 2],initializer=tf.contrib.layers.xavier_initializer())
        out_b = tf.Variable(tf.constant(0.1, shape=[2]), name='out_b')
        outputs = tf.nn.xw_plus_b(pool_out, out_w , out_b, name='outputs')
        predictions = tf.argmax(outputs, 1, name='predictions')
       
        # Help neuron in training and evaluation process
        losses = tf.nn.softmax_cross_entropy_with_logits(labels = input_y, logits = outputs) #  only named arguments accepted            
        loss = tf.reduce_mean(losses)
        correct = tf.equal(predictions, tf.argmax(input_y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'), name='accuracy')
        step = tf.Variable(0, name="step", trainable=False)
        optimizer = tf.train.AdamOptimizer(1e-3)
        gradients = optimizer.compute_gradients(loss)
        training = optimizer.apply_gradients(gradients, global_step=step)
       
        # Start model running
        sess.run(tf.global_variables_initializer())
        echo = 50
        location = 0
        while echo > 0:
            end = location + 1000
            x_input = np.array(get_vectors(model,feed_list[location:end]))
            train(x_input, label[location:end],loss,accuracy,training,step)
            current_step = tf.train.global_step(sess,step)
            if current_step % 10 == 0:
                print("\nEvaluation:")
                development(np.array(get_vectors(model,feed_list[50000:55000])), label[50000:55000], loss, accuracy,step)
                print("")
            echo -= 1
            location = location + 1000
        evaluation(feed_list[:50],predictions,model,label[:50])
        
        # When you want to input your own data
        # evaluation(['i really like that cat'],predictions,model)

step 1, loss 0.672146, accuracy 0.64
step 2, loss 0.687244, accuracy 0.547
step 3, loss 0.591995, accuracy 0.716
step 4, loss 0.546992, accuracy 0.77
step 5, loss 0.742128, accuracy 0.665
step 6, loss 0.949542, accuracy 0.531
step 7, loss 0.899836, accuracy 0.433
step 8, loss 0.683185, accuracy 0.568
step 9, loss 0.68601, accuracy 0.563
step 10, loss 0.632335, accuracy 0.681

Evaluation:
step 10, loss 0.758563, acc 0.5758

step 11, loss 0.784145, accuracy 0.598
step 12, loss 0.778575, accuracy 0.569
step 13, loss 0.839259, accuracy 0.474
step 14, loss 0.677135, accuracy 0.563
step 15, loss 0.643198, accuracy 0.638
step 16, loss 0.646102, accuracy 0.62
step 17, loss 0.627565, accuracy 0.642
step 18, loss 0.711754, accuracy 0.583
step 19, loss 0.796025, accuracy 0.518
step 20, loss 0.762611, accuracy 0.544

Evaluation:
step 20, loss 0.676938, acc 0.6002

step 21, loss 0.684499, accuracy 0.596
step 22, loss 0.634603, accuracy 0.639
step 23, loss 0.613054, accuracy 0.67
step 24, loss 0.588

### Conclusion

Finally, we have completed our model, while when we are looking at the development and evaluation summary data, there are still some problems:

It seems like the accuracy is increasing batch by batch, while there also can have overfitting problems when we has a large amount of training data or when our model is too complex. There are many methodologies about how to avoid the overfitting problem. The dropout method, especially in CNN, will drop certain part of neuron to make the model simpler to avoid overfitting. Sample code about the dropout schema inside the hidden layers can be found here [code with dropout schema](https://www.tensorflow.org/tutorials/layers). 

In addition to overfitting, we can also increase the model performance through random training data selection or cross validation which may have a higher requirement on hardware. 

From the output of the evaluation, we can found that those which are classified incorrectly are usually words which dose not exist in our pretrained model. So it may be a better choice to use a non-static model(train the word vector over time), for the words appeared in twitter feeds are usually freestyle.

Even there are some limitations on this model, I learned a lot from the research process and still learned a very efficient way to handle high dimensional data. In the future we may even extend our input into image, audio and even video. Through CNN we can find more possibilities of data science application in our daily life.

### References

(1)Shen, X. He, J. Gao, L. Deng, G. Mesnil. 2014.Learning Semantic Representations Using Convolu-
tionalNeuralNetworksforWebSearch. InProceedings of WWW 2014.
(2)K., & Y. (2014, September 03). Convolutional Neural Networks for Sentence Classification. Retrieved March 31, 2018, from https://arxiv.org/abs/1408.5882 (This tutorial is nearly a code implementation of this research paper)
(3)Britz, D. (2016, February 05). Implementing a CNN for Text Classification in TensorFlow. Retrieved March 31, 2018, from http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/