#Data & minimal pre-processing

**Data description** : Sentiment analysis on tweets during first GOP debate 2016

**Data provided by** : [Data for Everyone](https://www.figure-eight.com/data-for-everyone/)

**Dataset Link** : [Here](https://drive.google.com/open?id=19aQHO2TImE0fdvk_m1G0cBvS1lJC-jNc)

**Why not whole twitter data**? 
Because political debates have polarised and more strictly classifiable sentiments, the property of which can help in bettter training.

(Machine learning is more about intuition than code)

**Description** : According to original dataset provider 

```
We looked through tens of thousands of tweets about the early August GOP debate in Ohio and asked contributors to do both sentiment analysis and data categorization. Contributors were asked if the tweet was relevant, which candidate was mentioned, what subject was mentioned, and then what the sentiment was for a given tweet. We've removed the non-relevant messages from the uploaded dataset.
```



In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
#making all necesary imports

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

Using TensorFlow backend.


In [0]:
data = pd.read_csv('/content/gdrive/My Drive/Colab Data/Sentiment.csv')
# Keeping only the neccessary columns
data = data[['text','sentiment']]

Next, I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets. 

After that, I am filtering the tweets so only valid texts and words remain. Then, I define the number of max features as 2000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [4]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

print(data[ data['sentiment'] == 'Positive'].size)
print(data[ data['sentiment'] == 'Negative'].size)

for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')
    
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)

4472
16986


#Defining our LSTM RNN Model

Next, I compose the LSTM Network with softmax as an activation

In [5]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 28, 128)           256000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 28, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


Note that **embed_dim**, l**stm_out**, **batch_size**, **droupout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. 

#Declaring the train & test 

In [6]:
Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(7188, 28) (7188, 2)
(3541, 28) (3541, 2)


In [7]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 25, batch_size=batch_size, verbose = 2)

Epoch 1/25
 - 18s - loss: 0.4384 - acc: 0.8171
Epoch 2/25
 - 16s - loss: 0.3234 - acc: 0.8645
Epoch 3/25
 - 16s - loss: 0.2775 - acc: 0.8840
Epoch 4/25
 - 16s - loss: 0.2524 - acc: 0.8952
Epoch 5/25
 - 16s - loss: 0.2259 - acc: 0.9094
Epoch 6/25
 - 16s - loss: 0.2079 - acc: 0.9167
Epoch 7/25
 - 16s - loss: 0.1939 - acc: 0.9233
Epoch 8/25
 - 16s - loss: 0.1723 - acc: 0.9327
Epoch 9/25
 - 16s - loss: 0.1517 - acc: 0.9391
Epoch 10/25
 - 16s - loss: 0.1440 - acc: 0.9406
Epoch 11/25
 - 16s - loss: 0.1362 - acc: 0.9427
Epoch 12/25
 - 16s - loss: 0.1225 - acc: 0.9498
Epoch 13/25
 - 16s - loss: 0.1230 - acc: 0.9517
Epoch 14/25
 - 16s - loss: 0.1098 - acc: 0.9538
Epoch 15/25
 - 16s - loss: 0.1118 - acc: 0.9540
Epoch 16/25
 - 16s - loss: 0.1011 - acc: 0.9610
Epoch 17/25
 - 16s - loss: 0.0971 - acc: 0.9602
Epoch 18/25
 - 16s - loss: 0.1018 - acc: 0.9592
Epoch 19/25
 - 16s - loss: 0.0934 - acc: 0.9622
Epoch 20/25
 - 16s - loss: 0.0873 - acc: 0.9630
Epoch 21/25
 - 16s - loss: 0.0880 - acc: 0.9633
E

<keras.callbacks.History at 0x7fecd89c44a8>

Extracting a validation set, and measuring score and accuracy.

In [8]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.74
acc: 0.83


EDIT : After playing around with the model for a while, It is clear that finding negative tweets goes very well for the Network but deciding whether is positive is not up to the mark really. 

It is mainly because he positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweet 

In [9]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):
    
    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
   
    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

pos_acc 57.60517799352751 %
neg_acc 89.50461796809404 %


As is evident from the above result, my model can very nicely classify a negative tweet but has problems with positive tweets after testing and cross verifying my model over the validation set

To improve, we need 


*   More Positive tweet samples
*   Some more GPU to tackle that data



Trying out on custom tweets


In [19]:
twt = ['what the fuck']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0  48   1 543]]
negative


#END NOTES :-
Don't use my data. I used it to do things fast and easy. It is very imbalanced.

Use your own twitter data that you mailed me. Use my LSTM architecture and my methodology. It's industry standard and falls into the category of Deep learning.

Regression for NLP is outdated and not standard