# Sentiment Analysis Using RNN

We use an Sequential LSTM to create a supervised learning approach for predicting the sentiment of an article. This notebook was adapted from https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras.  

#### Data and Packages Importing

Below, we import all the appropriate libraries and import the data of classified information. Currently we are using Elais' KMeans classification process as a feeder mechanism to train our RNN; however, our next would be to train the model using industry verified dataset and then predict our model approrpiately.

As far as the data imported is concerned, out of all the articles processed by Elais, we only sample 10,000 articles from the file in order to expedite the RNN processing.

In [3]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

data = pd.read_csv("Classified Articles.csv")
data = data[['c','marks']]
data = data.sample(10000)

#### Word Vector Tokenization

The Tokenizer below converts the key "buzzwords" from the input data to produce a vector for the RNN to process.

In [4]:
for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')
    
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['c'].values)
X = tokenizer.texts_to_sequences(data['c'].values)
X = pad_sequences(X, maxlen = 100)

X

array([[1059,  316,  660, ...,   37, 1104,  209],
       [ 233, 1054,  819, ..., 1190,  360,   41],
       [  93,   13, 1287, ...,  166,  670, 1860],
       ...,
       [ 913,  334,   34, ...,   47,  448,   69],
       [  49,   17,  333, ...,   22,  171,  252],
       [ 981,  785, 1301, ...,   41,  448,   69]], dtype=int32)

#### Building the LSTM Model

In [27]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 100, 128)          256000    
_________________________________________________________________
spatial_dropout1d_5 (Spatial (None, 100, 128)          0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


#### Building a Training and Test Set

The training set essentially a 67% random sample of the 10,000 samples from Elais' sentiment labeling.

In [7]:
Y = pd.get_dummies(data['marks']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(6700, 100) (6700, 2)
(3300, 100) (3300, 2)
(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([1, 0], dtype=uint8), 0) 

(array([1, 0], dtype=uint8), 0) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([1, 0], dtype=uint8), 0) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(array([0, 1], dtype=uint8), 1) 

(arr

#### Training the RNN

The method below trains the RNN with the training data. Ideally, we should have a higher epoch to better train the model, but for the sake of time, we have used 7.

In [29]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
 - 40s - loss: 0.5934 - acc: 0.7242
Epoch 2/7
 - 37s - loss: 0.5762 - acc: 0.7319
Epoch 3/7
 - 37s - loss: 0.5640 - acc: 0.7342
Epoch 4/7
 - 34s - loss: 0.5519 - acc: 0.7403
Epoch 5/7
 - 36s - loss: 0.5360 - acc: 0.7507
Epoch 6/7
 - 35s - loss: 0.5225 - acc: 0.7543
Epoch 7/7
 - 33s - loss: 0.5110 - acc: 0.7649


<keras.callbacks.History at 0x11fa21668>

#### Evaluating the Model

Based on the above assumptions and sampling, our model calculates at 0.7 accuracy based on the data from Elais' model. The next steps of improvement for the models are as follows:
- Finding better training data
- Increasing Epoch for better accuracy
- Allowing a larger maximum length of words

These next steps will allow us to train a better RNN model and subsequently make a stronger prediction of sentiment.

In [30]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.63
acc: 0.71


#### Assign Sentiments
Given that the RNN produces a probability of the sentiment of an article, we attempt to normalize the value and create binary assignments before exporting it for the time-series analysis process.

In [50]:
output_data = pd.read_csv("Classified Articles.csv")
output_data = output_data.drop("marks",axis=1)

    
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(output_data['c'].values)
X = tokenizer.texts_to_sequences(output_data['c'].values)
X = pad_sequences(X, maxlen = 100)

predictions = model.predict(X, batch_size=batch_size, verbose=2, steps=None)

predictions_assigned = []

for i in predictions:
    if i[0] > i[1]:
        predictions_assigned.append(0)
    else:
        predictions_assigned.append(1)

output_data["marks"] = predictions_assigned

output_data.to_csv("Articles with Sentiment.csv")


# APPENDIX: Older Attempts to build RNN

In [None]:
from keras.datasets import imdb

# Import libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from datetime import datetime

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.gridspec as gridspec

# Plot styling
sns.set(style='white', context='notebook', palette='deep')

import nltk

from nltk.cluster import KMeansClusterer
from nltk.cluster import euclidean_distance

from sklearn import cluster
from sklearn import metrics
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn import decomposition
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

from gensim.models import Word2Vec
from gensim.models import word2vec

from collections import defaultdict

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

eng_stopwords = nltk.corpus.stopwords.words('english')

In [None]:
## modified the code from open source website http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ 
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(next(iter(word2vec.values())))

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [None]:
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(next(iter(word2vec.values())))

    def fit(self, X):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

In [None]:
from keras.datasets import imdb
df = pd.read_csv("Classified Articles.csv")
df.head()

In [None]:
# vocabulary_size = 5000
# (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
# print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

X_train, X_test, y_train, y_test = train_test_split(df["c"], df["marks"], test_size=0.25)

In [None]:
vectorized_x_train = []

model = word2vec.Word2Vec(X_train, min_count=15)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
t = TfidfEmbeddingVectorizer(w2v)
t.fit(X_train)
vectorized_x_train.append(t.transform(X_train))

In [None]:
vectorized_x_test = []

model = word2vec.Word2Vec(X_test, min_count=15)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
t = TfidfEmbeddingVectorizer(w2v)
t.fit(X_test)
vectorized_x_test.append(t.transform(X_test))

In [None]:
X_train = vectorized_x_train
X_test = vectorized_x_test

In [None]:
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [None]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=100
model=Sequential()
model.add(Embedding(5000, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

In [None]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [None]:
# Understanding Metrics
print(max([len(x) for x in X_train]))


In [None]:
batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

In [None]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

In [49]:
df = pd.read_csv("Articles with Sentiment.csv")

df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,index,author,contents,description,publisher,source_url,title,date,time,label,c,Sentiment
0,0,0,0,Stripe.com,complete payments platform engineered growth b...,"At Stripe, we’ve long been excited about the p...",Stripe.com,https://stripe.com/blog/ending-bitcoin-support,Ending Bitcoin Support,2018-01-23,00:00:00,0.0,"['complete', 'payments', 'platform', 'engineer...",1
1,1,31650,31650,fisco,bitcoin news price information analysis week s...,"In the week starting Monday, March 5, some top...",Newsbtc.com,https://www.newsbtc.com/2018/03/13/services-br...,Services Bridging Cryptocurrencies and Investo...,2018-03-13,07:39:30,0.0,"['bitcoin', 'news', 'price', 'information', 'a...",1
2,2,31651,31651,Apple,category video tutorial,Blockchain EOS - Discover How To Get & Send Et...,Gfxbing.com,http://gfxbing.com/video-tutorial/845272-block...,Blockchain EOS - Discover How To Get & Send Et...,2018-03-13,07:39:49,0.0,"['category', 'video', 'tutorial']",1
3,3,31652,31652,Steven Hay,last updated march th initial coin offerings i...,The post What are ERC-20 Tokens? A Beginner’s ...,99bitcoins.com,https://99bitcoins.com/what-are-erc-20-tokens/,What are ERC-20 Tokens? A Beginner’s Explanation,2018-03-13,07:44:37,0.0,"['last', 'updated', 'march', 'th', 'initial', ...",1
4,4,31653,31653,e27.co/elena.prokopets,future startups check tea talk week discussion...,"Blockchain can resolve a lot issues, for consu...",E27.co,https://e27.co/4-ways-blockchain-revolutionisi...,4 ways blockchain is revolutionising the trave...,2018-03-13,07:46:06,0.0,"['future', 'startups', 'check', 'tea', 'talk',...",1
5,5,31654,31654,Reuters,never miss great news story get instant notifi...,At what organisers claimed to be Britain's fir...,The Times of India,https://economictimes.indiatimes.com/markets/s...,"Rookie crypto investors look past risks, flock...",2018-03-13,07:54:56,0.0,"['never', 'miss', 'great', 'news', 'story', 'g...",1
6,6,31655,31655,HashChain Technology Inc.,news provided mar et share article vancouver m...,"VANCOUVER, March 13, 2018 /PRNewswire/ - HashC...",Prnewswire.com,https://www.prnewswire.com/news-releases/hashc...,HashChain Technology Enhances Cryptocurrency T...,2018-03-13,08:00:00,0.0,"['news', 'provided', 'mar', 'et', 'share', 'ar...",1
7,7,31649,31649,PTI,coincheck chief operating officer yusuke otsuk...,Thieves syphoned away 523 mn units of cryptocu...,Thehindubusinessline.com,https://www.thehindubusinessline.com/news/worl...,Hacked Japan crypto exchange Coincheck refunds...,2018-03-13,07:17:53,0.0,"['coincheck', 'chief', 'operating', 'officer',...",1
8,8,31656,31656,Aayush Jindal,bitcoin news price information analysis ethere...,Key Highlights Ethereum classic price is findi...,Newsbtc.com,https://www.newsbtc.com/2018/03/13/ethereum-cl...,Ethereum Classic Price Technical Analysis – Ca...,2018-03-13,08:00:42,0.0,"['bitcoin', 'news', 'price', 'information', 'a...",1
9,9,31658,31658,John McMahon,bitcoin news price information analysis coinch...,"Coincheck reported today that all 260,000 cust...",Newsbtc.com,https://www.newsbtc.com/2018/03/13/coincheck-c...,Coincheck Completes $430 Million Customer Refund,2018-03-13,08:08:11,0.0,"['bitcoin', 'news', 'price', 'information', 'a...",1
