### Sentiment Analysis Using Deep Learning

This notebook traines a convolutional neural network to recognize sentiments in a sentence. The data used here is the IMDB large movie review dataset freely available online.

#### Loading raw data

First step is to load the IMDB data into RAM.

In [1]:
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard

# Using keras to load the dataset with the top_words
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Using TensorFlow backend.


In [4]:
# Pad the sequence to the same length
max_review_length = 1600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# Using embedding from Keras
embedding_vecor_length = 300
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, padding='same'))
#model.add(Convolution1D(32, 3, padding='same'))
#model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

# Log to tensorboard
tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [5]:
model.fit(X_train, y_train, epochs=3, callbacks=[tensorBoardCallback], batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x11fb87fd0>

In [6]:
# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)

In [8]:
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 88.10%


In [None]:
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

In [5]:
from keras.preprocessing.text import one_hot

X = [one_hot('If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small adventures of three teenage girls at Bromwell High. Keisha, Natella and Latrina have given exploding sweets and behaved like bitches, I think Keisha is a good leader. There are also small stories going on with the teachers of the school. Theres the idiotic principal, Mr. Bip, the nervous Maths teacher and many others. The cast is also fantastic, Lenny Henrys Gina Yashere, EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Ponys Doon Mackichan, Dead Ringers Mark Perry and Blunders Nina Conti. I didnt know this came from Canada, but it is very good. Very good!',top_words)]

# 0 is positive, 1 is negative
X = sequence.pad_sequences(X, maxlen=max_review_length)
model.predict(X)

Using TensorFlow backend.


NameError: name 'top_words' is not defined

### Twitter dataset

This section trains a deep neural network on the annotated twitter dataset.

In [2]:
import pandas as pd

annotated = pd.read_csv('./annotated/bootstrapped.csv', encoding = "ISO-8859-1")

annotated.tail()

Unnamed: 0,favorite_count,id_str,in_reply_to_user_id_str,is_retweet,retweet_count,source,text,Sentiment
32578,0,815449868739211264,,True,6847,Twitter for iPhone,RT @DonaldJTrumpJr: Happy new year everyone. #...,P
32579,0,815433444591304704,,True,6941,Twitter for iPhone,RT @EricTrump: 2016 was such an incredible yea...,P
32580,0,815433217595547648,,True,7144,Twitter for iPhone,RT @Reince: Happy New Year + God's blessings t...,P
32581,0,815432169464197120,,True,5548,Twitter for iPhone,RT @DanScavino: On behalf of our next #POTUS &...,P
32582,126230,815422340540547072,,False,32665,Twitter for iPhone,TO ALL AMERICANS-\n#HappyNewYear &amp; many bl...,P


In [3]:
print("%i annotated rows" % len(annotated))

32583 annotated rows


In [10]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(documents['text'])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 42547 unique tokens.


In [12]:

documents = annotated[1000:][['text', 'Sentiment']]

X_train = []
Y_train = []
max_len = 0

for doc in documents.itertuples():
    if doc.Sentiment == 'Z':
        continue
    
    hot = tokenizer.texts_to_sequences([doc.text])[0]
    X_train.append(hot)
    max_len = len(hot) if len(hot) > max_len else max_len
    Y_train.append(1 if doc.Sentiment == 'P' else 0)

print("%i to train" % len(X_train))
print('%i max len' % max_len)

27252 to train
49 max len


In [14]:
X_train = pad_sequences(X_train, maxlen=max_len)

print('%i data X_train' % len(X_train))

27252 data X_train


In [19]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard

model = Sequential()
embedding_vecor_length = 300
model.add(Embedding(len(word_index) + 1, embedding_vecor_length, input_length=max_len))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [20]:
model.fit(X_train, Y_train, epochs=3, callbacks=[tensorBoardCallback], batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1201740b8>

In [101]:
test_docs = annotated[:1000]

In [21]:

# serialize model to JSON
model_json = model.to_json()
with open("model_tr_data.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model_tr_data.h5")
print("Saved model to disk")

Saved model to disk


In [22]:
import pickle
f = open('tokenizer_cnn_tr.pickle', 'wb')
pickle.dump(tokenizer, f)
f.close()