### Sentiment Analysis Using Deep Learning

This notebook traines a convolutional neural network to recognize sentiments in a sentence. The data used here is the IMDB large movie review dataset freely available online.

#### Loading raw data

First step is to load the IMDB data into RAM.

In [1]:
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard

# Using keras to load the dataset with the top_words
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Using TensorFlow backend.


In [2]:
# Pad the sequence to the same length
max_review_length = 1600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# Using embedding from Keras
embedding_vecor_length = 300
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

# Log to tensorboard
tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [3]:
model.fit(X_train, y_train, epochs=3, callbacks=[tensorBoardCallback], batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x12e613048>

In [4]:
# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)

In [6]:
print("Accuracy: %.2f%%" % (scores[1]*100))
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

Accuracy: 84.73%
Saved model to disk


In [35]:
from keras.preprocessing.text import one_hot

X = [one_hot('If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small adventures of three teenage girls at Bromwell High. Keisha, Natella and Latrina have given exploding sweets and behaved like bitches, I think Keisha is a good leader. There are also small stories going on with the teachers of the school. Theres the idiotic principal, Mr. Bip, the nervous Maths teacher and many others. The cast is also fantastic, Lenny Henrys Gina Yashere, EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Ponys Doon Mackichan, Dead Ringers Mark Perry and Blunders Nina Conti. I didnt know this came from Canada, but it is very good. Very good!',top_words)]

# 0 is positive, 1 is negative
X = sequence.pad_sequences(X, maxlen=max_review_length)
model.predict(X)

array([[ 0.00130955]], dtype=float32)

### Twitter dataset

This section trains a deep neural network on the annotated twitter dataset.

In [36]:
# Loading unlabelled data
paths = glob.glob("./annotated/*.csv")
a_frames = []

for path in paths:
    partial_df = pd.read_csv(path, encoding = "ISO-8859-1") # fix weird encoding thing
    partial_df['created_at'] = pd.to_datetime(partial_df['created_at'])
    partial_df.index = partial_df['created_at']
    del partial_df['created_at']
    a_frames.append(partial_df)

annotated = pd.concat(a_frames)

In [38]:
annotated_n = len(annotated)

print("%i annotated rows" % annotated_n)

1000 annotated rows
