# Sentiment Analysis

First conceptual modelling with scraped review data. The goal is to analyse the sentiment of a review and correctly predict that sentiment.

After letting Scrapy do it's work I got a dataset with 50k reviews. All of these have been labeled already.

I'll process the reviews file here before building a model to classify the sentiment. In this notebook I'll just use 2 classes for simplicity, the "positive"- and "negative"-class.

In [1]:
from __future__ import absolute_import, division, print_function
import os
import sys
import csv
import re
import statistics
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

# Helper Functions

In [2]:
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST. (),!?\'\`
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9ëéèáé]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def getstats(format_pos_sampled,format_neg_sampled):
    poslen = [len(i) for i in format_pos_sampled]
    neglen = [len(i) for i in format_neg_sampled]
    print('\n \t\tpos\t\tneg')
    print("count :",'\t',int(len(format_pos_sampled)),'\t',int(len(format_neg_sampled)))
    print("mean  :",'\t',int(statistics.mean(poslen)),'\t\t',int(statistics.mean(neglen)))
    print('median:','\t',int(statistics.median(poslen)),'\t\t',int(statistics.median(neglen)))
    print('stdev :','\t',int(statistics.stdev(poslen)),'\t\t',int(statistics.stdev(neglen)))
    return("")

# Preparing the Dataset

In [None]:
datasetneg = []
datasetpos = []
with open('data/review_text_labeled_bare.csv', 'r', encoding='utf8') as reviews:
    for review in reviews:
        label = review.split()[0]
        if label == 'positive':
            datasetpos.append(clean_str(str(review)))
        elif label == 'negative':
            datasetneg.append(clean_str(str(review)))
        else:
            pass

In [None]:
print("Here are 2 positive examples:")
print(datasetpos[123])
print(datasetpos[456],'\n')
print("Here are 2 negative examples:")
print(datasetneg[123])
print(datasetneg[456])

The model will train with this compiled dataset, to ultimately be able to classify unknown new reviews into either positive or negative. 

In [None]:
import random
format_neg_sampled = [i for i in datasetneg if len(i) >= 150 and len(i) <= 500]
format_pos_sampled = [i for i in datasetpos if len(i) >= 150 and len(i) <= 500]
print("Statistics for the whole dataset:")
print(getstats(datasetpos,datasetneg))
print()
print("Statistics for the reviews between 150 and 500 words long:")
print(getstats(format_pos_sampled,format_neg_sampled))

I'm taking 5000 samples of each label. Then I pop a negative and positive from each sampled set to create the new final dataset.

In [None]:
pos_sampled = random.sample(format_pos_sampled, k=5000)
neg_sampled = random.sample(format_neg_sampled, k=5000)

# This is the 'final' dataset our deep learning model will use
dataset = []
for i in range(10000):
    if i%2 == 0:
        dataset.append(pos_sampled.pop())
    elif i% 2 != 0:
        dataset.append(neg_sampled.pop())
    else:
        pass


In [None]:
# And saving the dataset to CSV
with open('reviews_labels.csv', 'w',encoding='utf-8') as file:
    file.writelines("\n".join([i for i in dataset]))

Note: In the terminal I used:<br>
<code> cat reviews_label.csv | sed -e 's/positive /positive;/g' | sed -e 's/negative /negative;/g' > dataset.csv </code><br>
to make it easier for Pandas to read. 

# Finally, modelling
Now that preprocessing is done it's time to start modelling. I'll load the just saved csv into a Pandas DataFrame and create a Series for the reviews and a series for the Labels.

In [None]:
print("Now creating Pandas Dataframe from cvs")
reviewsall = pd.read_csv('review_dataset.csv', sep=';', header=None)

# Pandas Series for the reviews
reviews = reviewsall[1]

# Pandas Series for the labels for each of those series
labels = reviewsall[0]

print("Finished loading Reviews and Labels")

# Counting frequencies
To create a CBOW/Continues Bag of Words I'll need to count how often each word appears in the data. I'll use these frequencies to create a vocabulary for encoding the review data

In [None]:
from collections import Counter
total_counts = Counter()
for _, row in reviews.iteritems():
    total_counts.update(row.split(' '))
print("Total words in dataset: ", len(total_counts))

Vocab & Word2IDx
----
I'll initially use the first 20000 words of my vocab for the model, this will be my vocab

I'll also create a dictionary called word2idx, which maps each word in the vocab to an index. 

In [None]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:20000]

word2idx = {word: i for i, word in enumerate(vocab)}

# Wordvectorization
The following function takes a string and returns a numpy array of the words' vector.

In [None]:
def text_to_vector(text):
    word_vector = np.zeros(len(vocab), dtype=np.int_)
    for word in text.split(' '):
        idx = word2idx.get(word, None)
        if idx is None:
            continue
        else:
            word_vector[idx] += 1
    return np.array(word_vector)

Now I'll finally run through all of my final dataset and convert each review to a word vector with the function above

In [None]:
word_vectors = np.zeros((len(reviews), len(vocab)), dtype=np.int_)
for ii, (_, text) in enumerate(reviews.iteritems()):
    word_vectors[ii] = text_to_vector(text)

# Train and Test Datasets
To keep it simple I'll use TFLearn's function <code>to_categorical</code> to reshape the data into two output units which it can classify with a softmax activation function. I'm not worried about the validation set because TFLearn does this automatically :)

In [None]:
Y = (labels=='positive').astype(np.int_)
records = len(labels)

In [None]:
shuffle = np.arange(records)
np.random.shuffle(shuffle)
test_fraction = 0.9

In [None]:
train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
trainX, trainY = word_vectors[train_split,:], to_categorical(Y.values[train_split], 2)
testX, testY = word_vectors[test_split,:], to_categorical(Y.values[test_split], 2)

# Building our TrustPilot Net

<b>Input Layer</b><br>
input layer = amount of units, in my case I use 20000 element long vectors to encode -> 20000 input units

<b>More (Hidden!) Layers</b><br>
I add two hidden layers, each one adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer.

<b>Output Layer</b><br>
The output layer is what we actually want to see. Since I've turned this sentiment analysis problem into a classification problem with two labels, the output layer will have size 2. The appropriate activation function for trying to predict if some input data belongs to one of two classes, I use softmax

<b>Training</b><br>
With TFLearn you use <code> fit </code> to train a model. A usage example is:<br>
<code> net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy') </code><br>
<br>
Hyperparameters for the training command are the following:
* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` = learning rate
* `loss` Network error calculated from with the categorical cross-entropy.

In [None]:
def build_model():
    tf.reset_default_graph()

    # Input Layer
    net = tflearn.input_data([None, 20000])

    # More Layers (hidden)
    net = tflearn.fully_connected(net, 200, activation='ReLU')
    net = tflearn.fully_connected(net, 25, activation='ReLU')
    
    # Output layer > one of 2 classes
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='sgd',learning_rate=0.01,loss='categorical_crossentropy')
    
    model = tflearn.DNN(net)
    return model

In [None]:
# Initialise the model (exciting!)
model = build_model()

In [None]:
# Training, this is where the action starts! 
# Action which unfortunately consists mostly of waiting.
# First 10 epochs with batch size 256
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=265, n_epoch=10)


In [None]:
# That did not increase accuray, nor did it drop the loss. Now 20 epochs with batch size 128
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=20)

In [None]:
# Wow, increased accuray, but loss is still high... Let's try 10 epochs with batch size 64
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=64, n_epoch=10)

# Testing with our Test Dataset
Note: I ran model.fit 30 epochs with batch size 128 first. The accuracy was high (95%), but I was not satisfied with the decrease in loss.

That's why I rand model.fit 20 more epochs with batch size 256. This decreased the total loss.
Now that the model has been trained with the Train Dataset, it has extracted features to classify a review into 1 of 2 classes, either positive or negative. 

To put the accuracy the test I'll let the model make a prediction for each of the Test Dataset's reviews. I'll then compare these predictions to the ground truth from the dataset itself. 

In [None]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
testdataset_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test Dataset accuracy: ", testdataset_accuracy)

<b>94% Accuracy</b> is not bad :), for a first draft model, <i>if I may pat myself on the back prematurely</i>. 

But how does the model work with sentences it hasn't seen before? 

Testing with trivial examples
--
Here I'll test the model's output with some trivial example sentences in Dutch. The model figures this out by looking at the features learned from the examples given in training. The probability calculation works as follows;

The model first predicts the probability of the sentence's vector, meaning what's the chance that given the sentence's tokens in that order, transformed into a vector, would correspond to one of the two labels. 

The function <code> test_sentence(zin) </code> below calculates the probability of a sentence (zin) being positive. If the probability is high (>0.5), the chance of the prediction being correct/accurate is high as well.

Let's get started!

In [None]:
def test_sentence(zin):
    positive_prob = model.predict([text_to_vector(zin.lower())])[0][1]
    print('Sentence: {}'.format(zin))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')
    

In [None]:
zin = "Transformers is de zeker weten de beste film van 2016"
test_sentence(zin)
print()

zin = "Transformers is de niet de beste film van 2016"
test_sentence(zin)
print()

zin = "Het is ongelofelijk hoe iemand met talent iets zo spectaculair lelijk maakt"
test_sentence(zin)
print()

zin = "Het is ongelofelijk dat iemand van deze vieze rommel houdt."
test_sentence(zin)
print()

zin = "Het is niet te geloven dat iemand zoiets prachtigs vieze rommel noemt."
test_sentence(zin)
print()

zin = "Ik vind het product erg goed en het ziet er ook nog mooi uit. Aan te raden. Wel kopen."
test_sentence(zin)
print()

zin = "Ik vind het product erg slecht en het ziet er ook nog lelijk uit. Af te raden. Niet kopen."
test_sentence(zin)
print()

zin = "Ik raad dit bedrijf af, zeer slechte service en beroerd behandeld. Helemaal niet tevreden"
test_sentence(zin)
print()

zin = "Ik raad dit bedrijf aan, zeer goede service en keurig behandeld. Helemaal tevreden"
test_sentence(zin)
print()

zin = "Bij dit bedrijf ben ik vreselijk slecht te woord gestaan. Ik kwam voor een reparatie maar dat is niet goed gegaan. Uiteindelijk heeft het me handen vol geld gekost en ben ik niet opgeschoten. Rotbedrijf"
test_sentence(zin)
print()

zin = "Bij dit bedrijf ben ik fantastisch goed te woord gestaan. Ik kwam voor een reparatie en dat is helemaal goed gegaan. Uiteindelijk heeft het me weinig geld gekost en ben ik geholpen. Topbedrijf"
test_sentence(zin)
print()
