# Twitter Sentiment Analysis

### Methodology:
I first applied LSTM and BiLSTM but they both have serious overfitting problem. After some parameter tunning and no significant improvement on the results, I decided that it is critical to perform proper data preprocessing.

I have implemented three methods to preprocess the data:
* Word2vec
* Embedding from GloVe
* Embedding on the given dataset

The last embedding method works the best in improving the accuracy.

Then I implemented LSTM and BiLSTM to perdict sentiment based on the proprocessed tweets.

### LSTM vs BiLSTM Result
I used accuracy as the metric since the number of positive and negative tweets are roughly equal. Without further instruction, I assume we want the general classification accuracy but not precision and recall. 
After proper data preprocessing and embedding, the accuracy of bi-directional LSTM and vanilla LSTM are quite close. 
* With word2vec the accuracy for LSTM and BiLSTM are 0.6562 and 0.6666. 
* With GloVe embedding the accuracy for LSTM and BiLSTM are 0.5614 and 0.6024.
* With embedding trained on this given dataset the accuracy for LSTM and BiLSTM are 0.7671 and 0.7669.

Little difference in the performance is because that we are predicting sentiment on the whole tweets text.

### LSTM vs RNN
The advantage of using an LSTM over a vanilla RNN is that LSTM can maintain long-term information by training the memory cell. While both methods train on a sequential dataset, a vanilla RNN easily losses long-term information due to vanishing gradients. The sigmoid function in forget and input/output gates ensures that the small gradients in between two highly correlated words would not corrupt the relationship. With the gates, LSTM achieves higher prediction accuracy than a vanilla RNN in text analysis.

### BiLSTM vs LSTM 
The advantage of using a bi-directional LSTM is that it learns additional future information with the whole context. BiLSTM combines the parameters trained from a positive and a reversed sequential direction to make predictions. Therefore, it preserves information from both the past and the future while a vanilla LSTM only predicts based on the past. 
BiLSTM is applicable on this dataset because we are not performing online training on the tweets dataset and have the whole tweet content available before training.

### Future Work
Futher work to improve the accuracy includes increasing training epochs, fine-tunning and further examination of the features.

In [1]:
from preprocess_data import *
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

2018-11-12 11:50:17,044 : INFO : 'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.


### Load And Process Data

In [2]:
df = pd.read_csv('data/sentiment_data.csv', encoding='latin1')
processed_data = preprocess_data(df)
processed_data.head()

progress-bar: 100%|██████████| 500000/500000 [00:00<00:00, 610496.70it/s]
2018-11-12 11:50:20,348 : INFO : Encoded Labels
progress-bar: 100%|██████████| 500000/500000 [02:42<00:00, 3067.75it/s]
2018-11-12 11:53:03,363 : INFO : Cleaned Tweets


Unnamed: 0,labels,text,sentiment,tokens
0,0,@mileycyrus http://twitpic.com/4fzo7 - noooo d...,0,"[mileycyrus, 4fzo7, noooo, dont, dye, pretti, ..."
1,4,@accentuations ty m'dear!,1,"[accentu, mdear]"
2,0,@Shauna_nkotb_ca I feel so bad for the Oz girl...,0,"[shaunankotbca, feel, bad, girlsthat, realli, ..."
3,0,I'm going to run away from home... I'm planing...,0,"[go, run, away, home, plane, escap, cant, wait..."
4,4,@seansparks hah that's awesome I had one of t...,1,"[seanspark, hah, that, awesom, one, son, hehe]"


### Contruct Training and Testing

In [3]:
x_train, x_test, y_train, y_test = train_test_split(np.array(processed_data['tokens']),
                                                    np.array(processed_data['sentiment']), test_size=0.3)

### Training
We will use 3 encoding method:1. self_trained word2vec 2. glove embedding 3. embedding layer. For each method, we will also train single layer lstm and bidirectional lstm seperately

#### Self Trained word2vec

In [4]:
# train word2vec
tweet_w2v = word2vec_build(x_train, dim = 200)
# get word2vec representation for x_train and y_train
train_vecs = word2vec_generate(x_train, tweet_w2v, dim = 200)
test_vecs = word2vec_generate(x_test, tweet_w2v, dim = 200)

100%|██████████| 350000/350000 [00:00<00:00, 2300434.86it/s]
2018-11-12 11:53:03,657 : INFO : collecting all words and their counts
2018-11-12 11:53:03,657 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-12 11:53:03,690 : INFO : PROGRESS: at sentence #10000, processed 74016 words, keeping 17205 word types
2018-11-12 11:53:03,720 : INFO : PROGRESS: at sentence #20000, processed 147965 words, keeping 28629 word types
2018-11-12 11:53:03,752 : INFO : PROGRESS: at sentence #30000, processed 220670 words, keeping 38522 word types
2018-11-12 11:53:03,783 : INFO : PROGRESS: at sentence #40000, processed 294211 words, keeping 47802 word types
2018-11-12 11:53:03,815 : INFO : PROGRESS: at sentence #50000, processed 367756 words, keeping 56723 word types
2018-11-12 11:53:03,851 : INFO : PROGRESS: at sentence #60000, processed 441403 words, keeping 65215 word types
2018-11-12 11:53:03,879 : INFO : PROGRESS: at sentence #70000, processed 514603 words, keeping 735

In [5]:
single_layer_lstm_w2v(train_vecs, y_train, test_vecs, y_test)

Epoch 1/1


2018-11-12 12:13:54,739 : INFO : Test Accuracy 68.11800000063577%


<keras.engine.sequential.Sequential at 0x7fc8ba97d0b8>

In [6]:
bidirectional_lstm_w2v(train_vecs, y_train, test_vecs, y_test)

Epoch 1/1


2018-11-12 12:40:16,428 : INFO : Test Accuracy 69.5533333346049%


<keras.engine.sequential.Sequential at 0x7fc82478d7b8>

#### glove embedding

In [12]:
glove2word2vec('data/glove.6B.100d.txt', 'data/gensim_glove_vectors.txt')
glove_embedding = KeyedVectors.load_word2vec_format('data/gensim_glove_vectors.txt')
train_vecs = word2vec_generate(x_train, glove_embedding, dim = 100)
test_vecs = word2vec_generate(x_test, glove_embedding, dim = 100)

2018-11-12 12:52:50,012 : INFO : converting 400000 vectors from data/glove.6B.100d.txt to data/gensim_glove_vectors.txt
2018-11-12 12:52:51,556 : INFO : loading projection weights from data/gensim_glove_vectors.txt
2018-11-12 12:53:32,059 : INFO : loaded (400000, 100) matrix from data/gensim_glove_vectors.txt
100%|██████████| 350000/350000 [00:18<00:00, 18951.02it/s]
2018-11-12 12:53:52,282 : INFO : word2vec generated
100%|██████████| 150000/150000 [00:07<00:00, 19266.74it/s]
2018-11-12 12:54:00,705 : INFO : word2vec generated


In [13]:
single_layer_lstm_w2v(train_vecs, y_train, test_vecs, y_test)

Epoch 1/1


2018-11-12 13:04:38,774 : INFO : Test Accuracy 57.8873333346049%


<keras.engine.sequential.Sequential at 0x7fc7cfb8bda0>

In [14]:
bidirectional_lstm_w2v(train_vecs, y_train, test_vecs, y_test)

Epoch 1/1


2018-11-12 13:18:25,866 : INFO : Test Accuracy 62.37066666793824%


<keras.engine.sequential.Sequential at 0x7fc7b5ed7eb8>

#### Add embedding layer

In [15]:
x_train_pad = create_sequence(x_train)
x_test_pad = create_sequence(x_test)

In [16]:
single_layer_lstm_embedding(x_train_pad, y_train, x_test_pad, y_test)

Epoch 1/1


2018-11-12 13:29:30,527 : INFO : Test Accuracy 54.316000000635775%


<keras.engine.sequential.Sequential at 0x7fc83a19e780>

In [17]:
bidirectional_lstm_embedding(x_train_pad, y_train, x_test_pad, y_test)

Epoch 1/1


2018-11-12 13:43:46,946 : INFO : Test Accuracy 54.413999999364215%


<keras.engine.sequential.Sequential at 0x7fc836eda2b0>