# Sentiment Analysis Using RNN

We use an Sequential LSTM to create a supervised learning approach for predicting the sentiment of an article. This notebook was adapted from https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras.  

#### Data and Packages Importing

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

training_data_folder = "./Sentiment Training Data/"

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


#### Initializing the Word Dictionaries
We use Pandas to initialize the word dictionary corpus from NTU and ensure that all the words are standardized into lower case format.

In [238]:
vocabulary = pd.read_csv(training_data_folder + "vocabulary.csv")
vocabulary["Word"] = vocabulary["Word"].str.lower()
vocabulary_string = ""

for word in vocabulary["Word"]:
    vocabulary_string += word + " "
    
max_features = 2000

### Tokenizer
We tokenize the word dictionary in order to train the RNN.

In [239]:
tokenizer = Tokenizer(num_words=len(vocabulary["Word"]), split=" ", char_level=False)
tokenizer.fit_on_texts(vocabulary["Word"].values)

In [240]:
X = tokenizer.texts_to_sequences(vocabulary["Word"].values)

temp = []
for i in X:
    temp.append(len(i))
X = pad_sequences(X, maxlen = max_features)

In [241]:
Y = []

for i, row in vocabulary.iterrows():
    y = row["Sentiment"]
    Y.append(y)

#### Building the LSTM Model

In [242]:
embed_dim = 128
lstm_out = 196
max_features = X.shape[0]

model = Sequential()
model.add(Embedding(max_features, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, 2000, 128)         2433280   
_________________________________________________________________
spatial_dropout1d_18 (Spatia (None, 2000, 128)         0         
_________________________________________________________________
lstm_18 (LSTM)               (None, 196)               254800    
_________________________________________________________________
dense_18 (Dense)             (None, 2)                 394       
Total params: 2,688,474
Trainable params: 2,688,474
Non-trainable params: 0
_________________________________________________________________
None


#### Building a Training and Test Set

In [243]:
Y_dummies = []

for i in Y:
    if i > 0.5:
        Y_dummies.append([y, (1-y)])
    else:
        Y_dummies.append([(1-y), y])

Y_dummies = np.matrix(Y_dummies)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y_dummies, test_size = 0.5, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(9505, 2000) (9505, 2)
(9505, 2000) (9505, 2)


#### Training the RNN

Multiple resources recommended that the RNN should be trained with around 4000 iterations of data from start to the end of the network. After experimentation, we identified a batch_size of 32 to minimze the loss. Hence, we calculate the epoch size based on the batch size, in order to ensure to that we can achieve the 4000 iterations.

In [245]:
batch_size = 32
epochs = int(4000 / (X_train.shape[0] / batch_size))
model.fit(X_train, Y_train, epochs = epochs, batch_size=batch_size, verbose = 1)

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13


<keras.callbacks.History at 0x169d575c0>

#### Evaluating the Model

In [253]:
score,acc = model.evaluate(X_test, Y_test, verbose = 1, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.69
acc: 0.55


#### Testing RNN Accuracy on Hand-Labelled Articles

In [254]:
import pandas as pd
import numpy as np

df1 = pd.read_csv("./Sentiment Analysis Data/Classified Articles.csv")

df2 = pd.read_csv("./Sentiment Analysis Data/Articles Reading Assignment.csv")
df2 = df2.dropna()
df2["Sentiment"] += 1
df2["Sentiment"] /= 2

df2["Content"] = ["" for i in range(len(df2))]
df2["Content Length"] = [0 for i in range(len(df2))]

for i, row in df2.iterrows():
    x = row["URL"]
    
    key_words = df1[df1["source_url"] == x][:1]["contents"].values[0]
    df2.at[i, "Content"] = str(key_words)
    df2.at[i, "Content Length"] = len(key_words)

In [255]:
tokenizer.fit_on_texts(df2['Content'].values)
X = tokenizer.texts_to_sequences(df2['Content'].values)
X = pad_sequences(X, maxlen = 2000)

predictions = model.predict(X, batch_size=batch_size, verbose=1, steps=None)

numerical_predictions_assigned = []
qualitative_prediction_assigned = []

for i in predictions:
    i = i[1]
    if i > 0.60:
        numerical_predictions_assigned.append(1)
    elif i > 0.40:
        numerical_predictions_assigned.append(0.5)
    else:
        numerical_predictions_assigned.append(0)
    
    if i > 0.8:
        qualitative_prediction_assigned.append("Strongly Positive")
    elif i > 0.6:
        qualitative_prediction_assigned.append("Moderately Positive")
    elif i > 0.4:
        qualitative_prediction_assigned.append("Neutral")
    elif i > 0.2:
        qualitative_prediction_assigned.append("Moderately Negative")
    else:
        qualitative_prediction_assigned.append("Strongly Negative")

df2["Predicted Quantitative Sentiment"] = numerical_predictions_assigned
df2["Predicted Qualitative Sentiment"] = qualitative_prediction_assigned

df2.head()



Unnamed: 0,Name,URL,Sentiment,Content,Content Length,Predicted Quantitative Sentiment,Predicted Qualitative Sentiment
0,Sudarshan,https://www.wykop.pl/link/4223359/blockchain-a...,0.5,ciastka strona korzysta z plik w cookies w cel...,347,0.0,Moderately Negative
1,Sudarshan,http://www.computerweekly.com/news/252434855/C...,0.0,santiago silver fotolia criminals using crypto...,6472,0.0,Moderately Negative
2,Sudarshan,http://www.mcclatchydc.com/news/politics-gover...,0.0,franco ordo ez anita kumar fordonez mcclatchyd...,5233,0.0,Moderately Negative
3,Sudarshan,https://slashdot.org/submission/7844329/coinch...,0.0,catch stories past week beyond slashdot story ...,310,1.0,Moderately Positive
4,Sudarshan,https://cointelegraph.com/news/philippines-sen...,0.0,cointelegraph philippine senator leila de lima...,1750,1.0,Moderately Positive


In [256]:
success_count = 0
total_count = 0
for i, row in df2.iterrows():
    manual_sentiment = row["Sentiment"]
    predicted_sentiment = row["Predicted Quantitative Sentiment"]
    qualitative_sentiment = row["Predicted Qualitative Sentiment"]
    
    if manual_sentiment == predicted_sentiment:
        success_count += 1
    
    total_count += 1
    
#     print(manual_sentiment, predicted_sentiment, qualitative_sentiment)
    
print("The accuracy on the manually labelled data set is {}".format(float(success_count/total_count)))    

The accuracy on the manually labelled data set is 0.3252032520325203


#### Testing RNN Accuracy on Spring 2018 Data

In [257]:
sp18_df = pd.read_csv("./Sentiment Analysis Data/Classified Articles.csv")
print(sp18_df.shape)
sp18_df.head()

(40732, 13)


Unnamed: 0.1,Unnamed: 0,index,author,contents,description,publisher,source_url,title,date,time,label,c,marks
0,0,0,Stripe.com,complete payments platform engineered growth b...,"At Stripe, we’ve long been excited about the p...",Stripe.com,https://stripe.com/blog/ending-bitcoin-support,Ending Bitcoin Support,2018-01-23,00:00:00,0.0,"['complete', 'payments', 'platform', 'engineer...",0
1,31650,31650,fisco,bitcoin news price information analysis week s...,"In the week starting Monday, March 5, some top...",Newsbtc.com,https://www.newsbtc.com/2018/03/13/services-br...,Services Bridging Cryptocurrencies and Investo...,2018-03-13,07:39:30,0.0,"['bitcoin', 'news', 'price', 'information', 'a...",0
2,31651,31651,Apple,category video tutorial,Blockchain EOS - Discover How To Get & Send Et...,Gfxbing.com,http://gfxbing.com/video-tutorial/845272-block...,Blockchain EOS - Discover How To Get & Send Et...,2018-03-13,07:39:49,0.0,"['category', 'video', 'tutorial']",0
3,31652,31652,Steven Hay,last updated march th initial coin offerings i...,The post What are ERC-20 Tokens? A Beginner’s ...,99bitcoins.com,https://99bitcoins.com/what-are-erc-20-tokens/,What are ERC-20 Tokens? A Beginner’s Explanation,2018-03-13,07:44:37,0.0,"['last', 'updated', 'march', 'th', 'initial', ...",0
4,31653,31653,e27.co/elena.prokopets,future startups check tea talk week discussion...,"Blockchain can resolve a lot issues, for consu...",E27.co,https://e27.co/4-ways-blockchain-revolutionisi...,4 ways blockchain is revolutionising the trave...,2018-03-13,07:46:06,0.0,"['future', 'startups', 'check', 'tea', 'talk',...",0


In [258]:
tokenizer.fit_on_texts(sp18_df['contents'].values)
X = tokenizer.texts_to_sequences(sp18_df['contents'].values)
X = pad_sequences(X, maxlen = 2000)

predictions = model.predict(X, batch_size=batch_size, verbose=1, steps=None)

numerical_predictions_assigned = []
qualitative_prediction_assigned = []

for i in predictions:
    if i[1] > i[0]:
        numerical_predictions_assigned.append(1)
    else:
        numerical_predictions_assigned.append(0)
    
    i = i[1]
    if i > 0.8:
        qualitative_prediction_assigned.append("Strongly Positive")
    elif i > 0.6:
        qualitative_prediction_assigned.append("Moderately Positive")
    elif i > 0.4:
        qualitative_prediction_assigned.append("Neutral")
    elif i > 0.2:
        qualitative_prediction_assigned.append("Moderately Negative")
    else:
        qualitative_prediction_assigned.append("Strongly Negative")

sp18_df["Predicted Quantitative Sentiment"] = numerical_predictions_assigned
sp18_df["Predicted Qualitative Sentiment"] = qualitative_prediction_assigned

sp18_df.head()



Unnamed: 0.1,Unnamed: 0,index,author,contents,description,publisher,source_url,title,date,time,label,c,marks,Predicted Quantitative Sentiment,Predicted Qualitative Sentiment
0,0,0,Stripe.com,complete payments platform engineered growth b...,"At Stripe, we’ve long been excited about the p...",Stripe.com,https://stripe.com/blog/ending-bitcoin-support,Ending Bitcoin Support,2018-01-23,00:00:00,0.0,"['complete', 'payments', 'platform', 'engineer...",0,0,Moderately Negative
1,31650,31650,fisco,bitcoin news price information analysis week s...,"In the week starting Monday, March 5, some top...",Newsbtc.com,https://www.newsbtc.com/2018/03/13/services-br...,Services Bridging Cryptocurrencies and Investo...,2018-03-13,07:39:30,0.0,"['bitcoin', 'news', 'price', 'information', 'a...",0,0,Moderately Negative
2,31651,31651,Apple,category video tutorial,Blockchain EOS - Discover How To Get & Send Et...,Gfxbing.com,http://gfxbing.com/video-tutorial/845272-block...,Blockchain EOS - Discover How To Get & Send Et...,2018-03-13,07:39:49,0.0,"['category', 'video', 'tutorial']",0,0,Neutral
3,31652,31652,Steven Hay,last updated march th initial coin offerings i...,The post What are ERC-20 Tokens? A Beginner’s ...,99bitcoins.com,https://99bitcoins.com/what-are-erc-20-tokens/,What are ERC-20 Tokens? A Beginner’s Explanation,2018-03-13,07:44:37,0.0,"['last', 'updated', 'march', 'th', 'initial', ...",0,0,Moderately Negative
4,31653,31653,e27.co/elena.prokopets,future startups check tea talk week discussion...,"Blockchain can resolve a lot issues, for consu...",E27.co,https://e27.co/4-ways-blockchain-revolutionisi...,4 ways blockchain is revolutionising the trave...,2018-03-13,07:46:06,0.0,"['future', 'startups', 'check', 'tea', 'talk',...",0,1,Moderately Positive


In [259]:
success_count = 0
total_count = 0
for i, row in sp18_df.iterrows():
    manual_sentiment = row["marks"]
    predicted_sentiment = row["Predicted Quantitative Sentiment"]
    qualitative_sentiment = row["Predicted Qualitative Sentiment"]
    
    if manual_sentiment == predicted_sentiment:
        success_count += 1
    
    total_count += 1
    
#     print(manual_sentiment, predicted_sentiment, qualitative_sentiment)
    
print("The accuracy on the manually labelled data set is {}".format(float(success_count/total_count)))    

The accuracy on the manually labelled data set is 0.4852204654816852
