# Anime Twitter Sentiment Analysis with Keras

In [1]:
import pandas as pd
import numpy as np
import re
import pydotplus as pydot

import warnings
warnings.simplefilter("ignore", UserWarning)
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
#from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix

#import nltk
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize

from keras.models import Model

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

from keras.layers import Input, Dense, Embedding, MaxPooling1D
from keras.layers import SpatialDropout1D, concatenate, Dropout, BatchNormalization
from keras.layers import LSTM, GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D, Conv1D, SeparableConv1D

from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras.utils.vis_utils import plot_model

Using TensorFlow backend.
  return f(*args, **kwds)


## The Data

The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.  The dataset is based on data from the following two sources:

* University of Michigan Sentiment Analysis competition on Kaggle
* Twitter Sentiment Corpus by Niek Sanders

It can be found at http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

In [2]:
data = pd.read_csv('data/twitter_sentiment_dataset.csv', encoding='latin1', usecols=['Sentiment', 'SentimentText'])
data.columns = ['sentiment', 'text']

In [3]:
data.head()

Unnamed: 0,sentiment,text
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [4]:
print(data.shape)

(1578614, 2)


## Preprocessing

First, we need to clean the text by removing all special characters.

In [5]:
def clean_text(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

In [6]:
data['text'] = data['text'].map(clean_text)

It should be noted that I also tried to remove stop words from the text, but that actually resulted a worse accuracy for my model.

Now that the data is clean, let's prepare the training and testing sets.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], 
                                                    data['sentiment'], 
                                                    test_size=0.1, 
                                                    random_state=42,
                                                    stratify=data['sentiment'])

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1420752,) (157862,) (1420752,) (157862,)


In order to use Keras on text data, we need to tokenize the text first. This can be done using the Tokenizer function and specifying a max number of words we want.

In [8]:
MAX_WORDS = 100000
tokenizer = Tokenizer(num_words=MAX_WORDS)

tokenizer.fit_on_texts(data['text'])

In [9]:
word_index = tokenizer.word_index

print('There are {} unique tokens.'.format(len(word_index)))

There are 288603 unique tokens.


## GloVe Embedding 

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics.

For my model, I use the Pretrained Twitter Word Vector.

In [10]:
embed_size = 200

def loadGloveModel(glove_file, embed_size=200):
    print("Loading Glove Model..")
    file = open(glove_file, 'r')
    word_embedding_dict = {}
    
    for line in file:
        word_embedding = line.split()
        word = word_embedding[0]
        embedding = np.asarray([float(val) for val in word_embedding[1:]], dtype='float32')
        word_embedding_dict[word] = embedding
    
    # Remove words with wrong embedding size
    # Iterate through a copy to avoid error
    for k, v in word_embedding_dict.copy().items():
        if len(v) != embed_size:
            word_embedding_dict.pop(k)
            
    print("Done.",len(word_embedding_dict)," words loaded!")
    
    file.close()
    return word_embedding_dict

glove_embedding_dict = loadGloveModel('glove.twitter.27B.200d.txt', embed_size)

Loading Glove Model..
Done. 1193513  words loaded!


Here's an example of the embedding for the word 'heart'.

In [11]:
glove_embedding_dict['heart']

array([ 0.068954 , -0.064559 , -0.2532   ,  0.24135  ,  0.34572  ,
       -0.1799   ,  0.56555  , -0.12854  , -0.32679  , -0.24896  ,
        0.23996  , -0.19216  , -1.4679   ,  0.41168  , -0.48718  ,
        0.073335 , -0.16831  , -0.43334  ,  0.5277   , -0.2179   ,
       -0.087065 ,  0.21645  ,  0.10003  ,  0.29946  ,  0.23227  ,
        0.88475  , -0.51618  ,  0.13294  ,  0.49017  ,  0.48222  ,
       -0.33084  , -0.29034  ,  0.42365  ,  0.42157  ,  0.073902 ,
       -0.38198  , -0.20283  , -0.54664  , -0.24354  ,  0.40618  ,
       -0.54074  ,  0.16636  ,  0.45591  ,  0.26943  ,  0.0058961,
        0.15221  ,  0.5307   ,  0.20654  ,  0.11243  ,  0.20151  ,
       -0.2208   ,  0.45178  ,  0.16479  ,  0.095516 ,  0.45435  ,
       -0.31694  ,  0.45188  ,  0.58922  , -0.071485 ,  0.050712 ,
       -0.11368  ,  0.12427  ,  0.015246 ,  0.074834 ,  0.24083  ,
       -0.14761  ,  0.59176  ,  0.3154   ,  0.0070812,  0.19171  ,
       -0.14031  , -0.16693  ,  0.069561 , -0.020206 ,  0.4543

Now I build an embedding matrix to be used in my Keras model.

In [12]:
def genEmbeddingMatrix(embedding_dict, tokenizer, max_words=100000, embed_size=200):
    all_embeddings = np.stack(list(embedding_dict.values()))
    embedding_mean, embedding_std = all_embeddings.mean(), all_embeddings.std()
    embedding_matrix = np.random.normal(embedding_mean, embedding_std, (max_words, embed_size))
    
    word_index = tokenizer.word_index

    for word, i in word_index.items():
        if word in embedding_dict.keys() and i < max_words:
            embedding_matrix[i] = embedding_dict[word]
    
    return embedding_matrix

embedding_matrix = genEmbeddingMatrix(glove_embedding_dict, tokenizer, max_words=MAX_WORDS)

In order to use Keras for text and sequences, I first have to preprocess the text.  This can be done with Keras' Tokenizer class. 

In [13]:
X_train[15]

'lt This is the way i feel right now'

In [14]:
tokenizer.texts_to_sequences([X_train[15]])

[[159, 28, 9, 3, 131, 1, 110, 117, 29]]

The words are mapped into a list of integers.  The most frequent words are taken into account first.  For example, it can be seen that the word 'i' corresponds to number 1.

In [15]:
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)

Now that the tweets are a list of integers, we need to make sure the lists are all the same size in order to stack them.

In [16]:
MAX_LENGTH = 35
padded_train_sequences = pad_sequences(train_sequences, maxlen=MAX_LENGTH)
padded_test_sequences = pad_sequences(test_sequences, maxlen=MAX_LENGTH)

In [17]:
padded_train_sequences

array([[    0,     0,     0, ...,   162,   356,   224],
       [    0,     0,     0, ...,   879,  1656,   661],
       [    0,     0,     0, ...,     0,   153,  6543],
       ...,
       [    0,     0,     0, ...,  1504,  1469, 26172],
       [    0,     0,     0, ...,    55,    94,   433],
       [    0,     0,     0, ...,   193,    13,     6]], dtype=int32)

In [18]:
padded_train_sequences.shape

(1420752, 35)

# Architecture

## Embedding Layer
* In an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
* The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
* The position of a word in the learned vector space is referred to as its embedding.
* Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer.  The Embedding layer is normally initialized with random weights and will learn an embedding for all of the words in the training dataset.
* In this case I use the pretrained **GLoVe** embedding matrix.
* Setting trainable to True yielded better results for me.


## Bidirectional LSTM
* A bidirectional RNN exploits the idea of that an RNN trained on reversed sequences will learn different representations than one trained on the original sequences.
* It looks at its input sequence both ways, obtaining potentially richer representations and capturing patterns that may have been missed by the chronological-order version alone.
* Recurrent dropout is used to prevent overfitting.

## SeparableConv1D
* 1D convolution layers can recognize local patterns in a sequence. 
* Because the same input transformation is performed on every patch, a pattern learned at a certain position in a sentence can later be recognized at a different position
* I opted for a Depthwise Separable Convolution Layer because it separates the learning of spatial features and the learning of channel-wise features.

## Combining RNN & CNN
Source: <a href=http://konukoii.com/blog/2018/02/19/twitter-sentiment-analysis-using-combined-lstm-cnn-models/> here</a>
* The idea behind combining an RNN and a CNN is that the output tokens of the RNN will store information not only of the initial token, but also any previous tokens; In other words, the LSTM layer is generating a new encoding for the original input. The output of the LSTM layer is then fed into a convolution layer which we expect will extract local features. 
* Finally the convolution layer’s output will be pooled to a smaller dimension and ultimately outputted as either a positive or negative label.

## Batch Normalization
* Batch normalization is a type of layer (BatchNormalization in Keras) introduced in 2015 by Ioffe and Szegedy. It can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training. 
* Data is usually normalized before being inputted into a model, but normalization should also be taken of after every transformational layer in a network.

In [23]:
def build_lstm_cnn_model(max_words, embedding_dim, embedding_matrix=None):
    if embedding_matrix is None:
        embedding_matrix = np.random.random((max_words, embedding_dim))
    
    inp = Input(shape=(MAX_LENGTH, ))
    
    # glove embedding
    x = Embedding(input_dim=MAX_WORDS, output_dim=embedding_dim, input_length=MAX_LENGTH, 
                  weights=[embedding_matrix], trainable=True)(inp)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.2))(x)
    x = SeparableConv1D(128, kernel_size=3, padding="same", kernel_initializer="random_uniform")(x)
    x = BatchNormalization()(x)

    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    concat = concatenate([avg_pool, max_pool])
    
    outp = Dense(1, activation="sigmoid")(concat)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='RMSprop',
                  metrics=['accuracy'])
    return model

lstm_cnn_model = build_lstm_cnn_model(max_words=MAX_WORDS, embedding_dim=200, embedding_matrix=embedding_matrix)

The network can be visualized with Keras' plot_model function.

In [24]:
plot_model(lstm_cnn_model, to_file='rnn.png', show_shapes=True, show_layer_names=True)

![rnn](rnn.png)

In [25]:
filepath="./models/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 512
epochs = 2

history = lstm_cnn_model.fit(x=padded_train_sequences, 
                             y=y_train, 
                             validation_data=(padded_test_sequences, y_test), 
                             batch_size=batch_size, 
                             callbacks=[checkpoint], 
                             epochs=epochs, 
                             verbose=1)

Train on 1420752 samples, validate on 157862 samples
Epoch 1/2

Epoch 00001: val_acc improved from -inf to 0.83408, saving model to ./models/weights-improvement-01-0.8341.hdf5
Epoch 2/2

Epoch 00002: val_acc improved from 0.83408 to 0.83999, saving model to ./models/weights-improvement-02-0.8400.hdf5


In [26]:
best_lstm_cnn_model = load_model('./models/weights-improvement-02-{:0.4f}.hdf5'.format(checkpoint.best))

y_pred_rnn_cnn = best_lstm_cnn_model.predict(padded_test_sequences, verbose=1, batch_size=2048)

y_pred_rnn_cnn = pd.DataFrame(y_pred_rnn_cnn, columns=['prediction'])
y_pred_rnn_cnn['prediction'] = y_pred_rnn_cnn['prediction'].map(lambda p: 1 if p >= 0.5 else 0)



In [27]:
def printClassificationErrors(y_test, y_pred):
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    print('ROC AUC score: {}'.format(roc_auc_score(y_test, y_pred)))
    print('Accuracy Score: {}'.format(accuracy_score(y_test, y_pred)))

printClassificationErrors(y_test, y_pred_rnn_cnn['prediction'])

Confusion Matrix:
[[65959 12885]
 [12375 66643]]
Classification Report:
             precision    recall  f1-score   support

          0       0.84      0.84      0.84     78844
          1       0.84      0.84      0.84     79018

avg / total       0.84      0.84      0.84    157862

ROC AUC score: 0.8399830685925739
Accuracy Score: 0.8399868239348292


We obtained a validation accuracy of about 84%.

# Grabbing Tweet Data using the Tweepy API

We can grab live twitter data using Twitter's Tweepy API.  I created a subclass of the StreamListener class in order to add parameters like a time limit, number of tweets, whether to grab retweets or not, and filter words.

In [28]:
from Modules.tweepy_streaming import saveTweepyTweets
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
import config

In [29]:
tweepy_listener = saveTweepyTweets(time_limit=240, 
                                   num_of_tweets=50, 
                                   save_file='twitter_stream_data.json', 
                                   retweets=False, 
                                   filter_set=config.FILTERED_WORDS)
auth = OAuthHandler(config.CONSUMER_KEY, config.CONSUMER_SECRET)
auth.set_access_token(config.ACCESS_TOKEN, config.ACCESS_TOKEN_SECRET)
stream = Stream(auth=auth, listener=tweepy_listener)

In [30]:
stream.filter(track=['anime'], languages=['en'])

Getting tweet #1...
Getting tweet #2...
Getting tweet #3...
Getting tweet #4...
Getting tweet #5...
Getting tweet #6...
Getting tweet #7...
Getting tweet #8...
Getting tweet #9...
Getting tweet #10...
Getting tweet #11...
Getting tweet #12...
Getting tweet #13...
Getting tweet #14...
Getting tweet #15...
Getting tweet #16...
Filtering...
Getting tweet #17...
Getting tweet #18...
Getting tweet #19...
Filtering...
Getting tweet #20...
Getting tweet #21...
Filtering...
Getting tweet #22...
Getting tweet #23...
Getting tweet #24...
Getting tweet #25...
Filtering...
Getting tweet #26...
Getting tweet #27...
Getting tweet #28...
Getting tweet #29...
Getting tweet #30...
Getting tweet #31...
Getting tweet #32...
Getting tweet #33...
Filtering...
Getting tweet #34...
Getting tweet #35...
Getting tweet #36...
Getting tweet #37...
Getting tweet #38...
Getting tweet #39...
Getting tweet #40...
Getting tweet #41...
Getting tweet #42...
Getting tweet #43...
Getting tweet #44...
Getting tweet #45...

Now we can compile the tweet data into a DataFrame, grab out the texts and clean it.

In [31]:
def tweetsToDataFrame(json_file):
    data = []
    with open(json_file, 'r') as json_data:
        for line in json_data:
            tweet = json.loads(line) # load it as Python dict
            data.append(tweet)
    return pd.DataFrame(data)

tweet_df = tweetsToDataFrame('twitter_stream_data.json')
tweet_df.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,...,quoted_status_id_str,quoted_status_permalink,reply_count,retweet_count,retweeted,source,text,timestamp_ms,truncated,user
0,,,Sun Jul 15 04:15:29 +0000 2018,"[14, 40]","{'urls': [], 'symbols': [], 'hashtags': [], 'u...",,,0,False,low,...,,,0,0,False,"<a href=""http://twitter.com/download/android"" ...",@BucksMachine Tfw metal gear solid anime,1531628129029,False,"{'location': 'Michigan, USA', 'profile_sidebar..."
1,,,Sun Jul 15 04:15:29 +0000 2018,,"{'urls': [{'indices': [39, 62], 'expanded_url'...",,,0,False,low,...,,,0,0,False,"<a href=""https://www.google.com/"" rel=""nofollo...",I added a video to a @YouTube playlist https:/...,1531628129230,False,"{'location': 'Fontana, CA', 'profile_sidebar_f..."
2,,,Sun Jul 15 04:15:29 +0000 2018,"[0, 35]","{'urls': [{'indices': [36, 59], 'expanded_url'...",,,0,False,low,...,1.0176412362939636e+18,{'expanded': 'https://twitter.com/EGGCHlM/stat...,0,0,False,"<a href=""http://twitter.com/download/android"" ...",What did you do to Chimmy shdksbskj https://t....,1531628129464,False,"{'location': 'partly 🔞', 'profile_sidebar_fill..."
3,,,Sun Jul 15 04:15:30 +0000 2018,,"{'urls': [{'indices': [94, 117], 'expanded_url...",,,0,False,low,...,,,0,0,False,"<a href=""https://ifttt.com"" rel=""nofollow"">IFT...",Anime Expo Opens for Fans of Japanese Culture ...,1531628130783,False,"{'location': None, 'profile_sidebar_fill_color..."
4,,,Sun Jul 15 04:15:31 +0000 2018,"[0, 69]","{'urls': [{'indices': [46, 69], 'expanded_url'...","{'media': [{'sizes': {'thumb': {'h': 150, 'res...",,0,False,low,...,,,0,0,False,"<a href=""http://publicize.wp.com/"" rel=""nofoll...","cFreeze’s Anime Watching: Week 28, 2018 Recap ...",1531628131733,False,"{'location': 'United States', 'profile_sidebar..."


## Preprocessing the tweets

We can preprocess the tweets using the same process as earlier.

In [32]:
tweet_df['text'] = tweet_df['text'].map(clean_text)
tweet_df = tweet_df[['text']]

In [33]:
tweet_sequences = tokenizer.texts_to_sequences(tweet_df['text'])
padded_tweet_sequences = pad_sequences(tweet_sequences, maxlen=MAX_LENGTH)

In [34]:
y_pred_tweet = best_lstm_cnn_model.predict(padded_tweet_sequences, verbose=1, batch_size=2048)
y_pred_tweet = pd.DataFrame(y_pred_tweet, columns=['prediction_prob'])
y_pred_tweet['prediction'] = y_pred_tweet['prediction_prob'].map(lambda p: 1 if p >= 0.5 else 0)



In [36]:
tweet_df = pd.merge(tweet_df, y_pred_tweet, left_index=True, right_index=True, how='outer')

In [38]:
for i in range(tweet_df.shape[0]):
    print('{} | Sentiment: {}'.format(tweet_df.loc[i]['text'], tweet_df.loc[i]['prediction']))

Tfw metal gear solid anime | Sentiment: 1
I added a video to a playlist THIS IS A CRAZY ANIME Reacting to Prison School | Sentiment: 1
What did you do to Chimmy shdksbskj | Sentiment: 1
Anime Expo Opens for Fans of Japanese Culture in Buenos Aires Latin American Herald Tribune | Sentiment: 1
cFreeze s Anime Watching Week 28 2018 Recap | Sentiment: 1
WOAHHHHH 95 LIKES WTH IM WATCHING ANIME AND I COME BACK TO THIS i m shook | Sentiment: 1
im all about the emotional anime twinks | Sentiment: 0
Binge watch some anime it ll help | Sentiment: 1
Sailor moon crystal crunchyroll the devil is a part timer netflix your name some anime websit | Sentiment: 1
all of the characters were individually so unique which means all of the relationships were complicated and meaning | Sentiment: 1
Apprill At least until she turns out t | Sentiment: 0
Deadass lost my composure bruh | Sentiment: 0
oooo lemme give it a watch | Sentiment: 1
jimin you re the sweetest | Sentiment: 1
A year ago I would never put a p