## Sentiment Analysis using Recurrent Neural Network

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from bs4 import BeautifulSoup
%matplotlib inline

In [2]:
train = pd.read_csv('labeledTrainData.tsv',quoting=3,header=0,delimiter='\t')
test = pd.read_csv('testData.tsv',quoting=3,header=0,delimiter='\t')
all_data = pd.read_csv('unlabeledTrainData.tsv',quoting=3,header=0,delimiter='\t')

In [3]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
train.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [5]:
def clean_review(movie_review):
    #Remove HTML tags
    review = BeautifulSoup(movie_review,"lxml").get_text()
    
    #Remove non-alphabets
    review = re.sub("[^a-zA-Z]"," ",review)
    return review.lower()

In [6]:
trainReview = [clean_review(review) for review in train.review]
testReview = [clean_review(review) for review in test.review]
all_dataReview = [clean_review(review) for review in all_data.review]

In [7]:
from passage.models import RNN
from passage.updates import Adadelta
from passage.layers import Embedding, Dense, GatedRecurrent
from passage.preprocessing import Tokenizer

In [8]:
tokenizer = Tokenizer(min_df=10, max_features=100000)
tokenizer.fit(trainReview+testReview+all_dataReview)

<passage.preprocessing.Tokenizer at 0x11ff5dad0>

In [9]:
X_train = tokenizer.transform(trainReview)
y_train = train.sentiment.values

In [10]:
layers = [
		Embedding(size=256, n_features=tokenizer.n_features),
		GatedRecurrent(size=512, activation='tanh', gate_activation='steeper_sigmoid', init='orthogonal', seq_output=False, p_drop=0.75),
		Dense(size=1, activation='sigmoid', init='orthogonal')
	]

model = RNN(layers=layers, cost='bce', updater=Adadelta(lr=0.5))

In [12]:
model.fit(X_train,y_train, n_epochs=1)

Epoch 0 Seen 24750 samples Avg cost 0.6484 Time elapsed 2826 seconds


[array(0.6835602692876249),
 array(0.6858440322438711),
 array(0.6886432352127233),
 array(0.6889696293159099),
 array(0.6875647772886956),
 array(0.6903984185872014),
 array(0.6913470969705788),
 array(0.6864612808493652),
 array(0.6959257388158249),
 array(0.689425637170128),
 array(0.6860785136647823),
 array(0.6873513337483851),
 array(0.6820968853550499),
 array(0.6970065640867107),
 array(0.6982562720138159),
 array(0.6867557036555055),
 array(0.6916175077136983),
 array(0.6888124025654861),
 array(0.6965631262252319),
 array(0.6904485235933381),
 array(0.6878855883662756),
 array(0.687480299109684),
 array(0.6910420431121765),
 array(0.6872921410456206),
 array(0.681380579887992),
 array(0.6903676966334397),
 array(0.69770800489216),
 array(0.6819219416628711),
 array(0.690200584553394),
 array(0.6955555312395942),
 array(0.686180943464286),
 array(0.6855235021445224),
 array(0.695169138611788),
 array(0.688357516622585),
 array(0.6931355771103304),
 array(0.6841040281099225),
 

In [23]:
X_test = tokenizer.transform(testReview)
result = model.predict(X_test)

In [24]:
rt = (result > 0.5).astype(int)

In [28]:
rt = rt.flatten()

In [29]:
output = pd.DataFrame(data={"id":test["id"],"sentiment":rt})

In [30]:
output.head()

Unnamed: 0,id,sentiment
0,"""12311_10""",1
1,"""8348_2""",0
2,"""5828_4""",1
3,"""7186_2""",1
4,"""12128_7""",1
