## Sentiment Analysis

We use a Many to one architecture to model sentiment classification. There are many appliations, in this particular one I chose movie review classification.

In this particular notebook, I've used an LSTM as the choice of RNN.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import bs4
import re
import collections

### Dataset

I've used an IMDB dataset which contains movie reviews and the associated sentiment i.e if the review is positive or negative. The dataset contains 50000 reviews.

In [2]:
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### Preprocessing

Convert the semtiment column to 0 and 1 for classification.


In [4]:
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)

In [5]:
ls = list(df['review'])

Remove the HTML tags and stop words in the reviews column

In [6]:
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

def clean_data(text):
    soup = bs4.BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    text = re.sub('\[[^]]*\]', '', text)
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern,'',text)
    text = text.lower()
    for i in stopwords:
        text = text.replace(' ' + i + ' ', ' ')
        text = text.replace('  ', ' ')
    return text

In [7]:
df['review'] = df['review'].apply(clean_data)

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching just 1 oz epi...,1
1,a wonderful little production filming techniqu...,1
2,i thought wonderful way spend time hot summer ...,1
3,basically theres family little boy jake thinks...,0
4,petter matteis love time money visually stunni...,1


In [9]:
# Split the data into train and test: 40000 for training and 10000 for testing
data = list(df['review'])
labels = list(df['sentiment'])

train_X = data[:40000]
test_X = data[40000:]
labels_X = labels[:40000]
labels_Y = labels[40000:]

word_cap = 25000 #Use only the top 25000 words to tokenize the text

Tokenizer creates tokens for the corpus which can be accessed by tokenizer.index. Using the texts_to_sequences() method maps the text into integer tokens. oov_token used for out of vocabulary words, its index is vocab_count + 1.  

To make sure all the sentences are of the same length, we post pad the sequences with zeoros.

In [10]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>', num_words=word_cap)
tokenizer.fit_on_texts(train_X)
train_idx = tokenizer.word_index
train_idx_rev = {i[1]:i[0] for i in train_idx.items()}

train_seq = tokenizer.texts_to_sequences(train_X)
test_seq = tokenizer.texts_to_sequences(test_X)

train_seq = tf.keras.preprocessing.sequence.pad_sequences(train_seq, padding='post')
test_seq = tf.keras.preprocessing.sequence.pad_sequences(test_seq, maxlen=train_seq.shape[1], padding='post')

In [11]:
train_seq.shape # (num of examples, length of each sentence)

(40000, 1437)

### Model

The first model consists of an Embedding layer which is not pretrained, it is trained on the fly with the current data i.e the Embedding weight matrix is updated by backproagation while training on the current data.

The next layer consists of 32 bidirectional LSTMs followed bya dense layer to classify the output

In [12]:
inputs = tf.keras.Input(shape=(train_seq.shape[1],))
x = tf.keras.layers.Embedding(word_cap, 100, input_length=train_seq.shape[1])(inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [13]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1437)]            0         
_________________________________________________________________
embedding (Embedding)        (None, 1437, 100)         2500000   
_________________________________________________________________
dropout (Dropout)            (None, 1437, 100)         0         
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                34048     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 2,534,113
Trainable params: 2,534,113
Non-trainable params: 0
___________________________________________________

In [14]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [15]:
model.fit(train_seq, np.array(labels_X), validation_data=(test_seq, np.array(labels_Y)), epochs=3, batch_size=64)

Train on 40000 samples, validate on 10000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fd3ad356710>

In [17]:
model_json = model.to_json()

with open("model_many_one.json", "w") as json_file:
    json_file.write(model_json)

# serialize weights to HDF5
model.save_weights("model_many_one.h5")

Function to test on a few custom sentences

In [18]:
def custom_test(dat, mod):
    for i in range(len(dat)):
        dat[i] = dat[i].split()
    dat = tokenizer.texts_to_sequences(dat)
    dat = tf.keras.preprocessing.sequence.pad_sequences(dat, maxlen=1437, padding='post')
    res = mod.predict(dat)
    return ['positive' if i >= 0.5 else 'negative' for i in res]

In [19]:
result = custom_test(["I love the movie", "the movie is okayish", " Super film", "movie is not that great", "Not a bad film"], model)

In [21]:
result

['positive', 'negative', 'positive', 'positive', 'negative']

### Model 2

This model is the same as the above but here I'm using pretrained GloVe vectors for the Embedding Layer

In [22]:
vecs = {}
with open ('/home/srikar/datasets/glove/glove.6B.100d.txt', 'r') as file:
    for i in file:
        temp = i.split()
        vecs[temp[0]] = temp[1:]

In [23]:
# Create the Embedding matrix, load the weights from the pretrained corpus of words. 
embeddings_matrix = np.zeros((word_cap, 100));
        
for i in range(1,word_cap):
    embedding_vector = vecs.get(train_idx_rev[i])
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

In [24]:
inputs = tf.keras.Input(shape=(train_seq.shape[1],))
x = tf.keras.layers.Embedding(word_cap, 100, input_length=train_seq.shape[1], weights=[embeddings_matrix], trainable=False)(inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)

model_new = tf.keras.Model(inputs=inputs, outputs=outputs)

In [26]:
model_new.summary() # The number of trainable parameters reduce significantly as we are using pretrained vectors

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1437)]            0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1437, 100)         2500000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1437, 100)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                34048     
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 2,534,113
Trainable params: 34,113
Non-trainable params: 2,500,000
____________________________________________

In [27]:
model_new.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [28]:
model_new.fit(train_seq, np.array(labels_X), validation_data=(test_seq, np.array(labels_Y)), epochs=10, batch_size=64)

Train on 40000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd1f5f3dcf8>

In [33]:
result2 = custom_test(["I love the movie", "the movie is okayish", " Super film", "movie is not that great", "Not a bad film"], model_new)

In [34]:
result2

['positive', 'positive', 'positive', 'positive', 'negative']