# Sentiment Analysis

We are interested in classifying extracts of "Allocine" movie reviews. Sentiment Analysis litterature usually offers sentiment analysis on English corpuses while we will here be attempting to classify french reviews, which may be somehow different knowing that both languages have different grammar rules and sentence structures.

# Table of contents

* [1. Context and theoretical background](#1.-Context-and-theoretical-background)
	* [1.1 Sentiment Analysis](#1.1-Sentiment-Analysis)
    * [1.2 Language Modelling](#1.2-Language-Modelling)
    * [1.3 Support Vector Machine](#1.3-Support-Vector-Machine)
* [2. Application on Movie Reviews](#2.-Application-on-Movie-Reviews)
    * [2.1 Python Setup](#2.1-Python-setup)
    * [2.2 Input Format](#2.2-Input-Format)
    * [2.3 Text Preprocessing](#2.3-Text-Preprocessing)
    * [2.4 Putting it all together](#2.4-Putting-it-all-together)
    * [2.5 Classification Model](#2.5-Classification-Model)
    * [2.6 Model Evaluation](#2.6-Model-Evaluation)

## 1. Context and theoretical background

### 1.1 Sentiment Analysis

<u> Positive or negative movie review :</u>
<UL TYPE="square">
<LI> Unbelievably disappointing <img src="../pictures/minus.png" width="30" height="30">
<LI> Full of zany characters and richly applied satire, and some great plot twists <img src="../pictures/plus.png" width="30" height="30">
<LI> this is the greatest screwball comedy ever filmed <img src="../pictures/plus.png" width="30" height="30">
<LI> It was pathetic. The worst part about it was the boxing scenes.<img src="../pictures/minus.png" width="30" height="30">
</UL>

<u> Google Product Search :</u>
<img src="../pictures/google_product.png" width="700" height="600">

<u> Twitter sentiment vs. Gallup Poll of Consumer Confidence </u>
<UL TYPE="square">
<LI> Gallup :  global performance-management consulting company best known for its public opinion polls conducted worldwide
<LI> Twitter sentiment based on "Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" by Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010.
</UL>
<img src="../pictures/gallup_v2.png" width="700" height="600">

### 1.2 Language Modelling

<b><u>Intuition:</u></b> The vocabulary is characteristic of the class ! In the example of movie reviews, if words like "bad" or "boring" are recurrent, it is highly probable that the review is negative.

<b><u>Problem:</u></b> This words are much less probable than common words like "be", "if", "but" ... etc and therefore can be "hidden" by those common words in the model

<b><u>Solution:</u></b> Give weights to the words with TfIdf language modelling:
<UL TYPE="squares">
    <LI> Term frequency (Luhn 1957) : frequency of the word <br/><br/>
    <LI> Inverse document frequency (IDF) (Sparck Jones 1972)
    <UL TYPE="circles">
        <LI> N is the total number of documents
        <LI> $df_i$ is the documents with the word $i$
        <LI> $idf_i = \mathbb{\log}(\frac{N}{df_i})$
    </UL> <br/><br/>
    <LI> Tf-Idf: word i in document j: $$w_{ij} = tf_{ij} idf_i$$
    $$w_{ij} = \mathbb{C}(w_{ij})\mathbb{\log}(\frac{N}{df_i})$$
</UL>

### 1.3 Support Vector Machine

<b><u>Text Classification:</u></b>
<UL TYPE="squares">
    <LI> Input
    <UL Type="circles">
        <LI> a document d
        <LI> a fixed set of classes $ C = {c_1, c_2, ..., c_J} $
    </UL>
    <LI> Ouput : a predicted class $ c \in C $
</UL>

<b><u>Support vector machines (SVMs)</u></b> are a set of supervised learning methods used for classification, regression and outliers detection.

A SVM constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space with the largest distance to the nearest training data points of any class (functional margin). In general, the larger the margin the lower the generalization error of the classifier.

<img src="../pictures/svm.png" height="600" width="600">

## 2. Application on Movie Reviews

### 2.1 Python Setup

We use Pandas for array and data manipulation and Scikit-learn for text classification. Finally, using NLTK package, we can perform text processing and data cleaning (Lemmatization, stemming, remove stopwords and punctuation ... etc).

In [106]:
# Useful modules
import os, re, math, random
from string import punctuation
import numpy as np
from time import time
import matplotlib.pyplot as plt
%matplotlib inline

# Pandas
import pandas as pd

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier, RidgeClassifier, Perceptron
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Nltk
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordPOSTagger
from nltk.tokenize.stanford import StanfordTokenizer

%run setup_workshop.py
from setup_workshop import jar, model, tag_dict, lex_dict

st_tok = StanfordTokenizer(jar, encoding='utf-8')
st_tag = StanfordPOSTagger(model, jar, encoding = 'utf-8')

### 2.2 Input Format

In [2]:
# Read the data set
data_path = "../dataset/data_allocine_workshop.csv"
data = pd.read_csv(data_path, delimiter = ';')

# drop missing data
data.dropna(inplace = True)

# Head extract of the movie reviews dataset
data.head()

Unnamed: 0,review,rating
0,"Je n’aime pas utiliser cette phrase, je trouve...",1.0
1,"Speed and color in the eyes... Depuis Matrix, ...",1.0
2,Un an avant la consécration d'Avatar par la pr...,1.0
3,"Les frères Wachowski, à qui l’on doit l’impres...",0.0
4,S'il s'agissait seulement de partir du bon vie...,0.0


We will use Allocine move reviews which are already splitted into two classes : 0 for negative and 1 for positive. Negative reviews correspond to reviews rated 1.5 stars and lower. Positive reviews correspond to reviews rated 4.0 stars and higher.
There are two columns : "Review" (text input) and "Rating" (text class).

In [3]:
from collections import Counter

for key, value in Counter(data.rating).items():
    print("%s : %s" % (key, value))

0.0 : 15303
1.0 : 18977


### 2.3 Text preprocessing

In [90]:
review = data.iloc[random.randint(1, data.shape[0]),0]
print(review)

Excellent !!!
Il faut vraiment voir le deux fois !
C'est encore mieux ainsi. Le sang coule 2 fois!
Tarantino nous ravi de ces scénarios !


In [91]:
# remove numbers
words = re.sub(r'\d+', '', review)

# remove elision
words = words.replace("'", "e ")

# remove punctuation and stopwords
# lemmatize words
words = st_tok.tokenize(words)
tags = st_tag.tag(words)

In [92]:
print(tags)

[('Excellent', 'ADJ'), ('!!!', 'N'), ('Il', 'CLS'), ('faut', 'V'), ('vraiment', 'ADV'), ('voir', 'VINF'), ('le', 'DET'), ('deux', 'DET'), ('fois', 'NC'), ('!', 'PUNC'), ('Ce', 'DET'), ('est', 'V'), ('encore', 'ADV'), ('mieux', 'ADV'), ('ainsi', 'ADV'), ('.', 'PUNC'), ('Le', 'DET'), ('sang', 'NC'), ('coule', 'ADJ'), ('fois', 'N'), ('!', 'PUNC'), ('Tarantino', 'NPP'), ('nous', 'CLO'), ('ravi', 'VPP'), ('de', 'P'), ('ces', 'DET'), ('scénarios', 'NC'), ('!', 'PUNC')]


In [85]:
def french_stem(tags_words, lexicon = lex_dict):
    result = []
    for tag in tags_words:
        if tag[1] in tag_dict:
            tag_fr = tag_dict[tag[1]]
            if (tag_fr in tags_crit) and tag[0] in lex_dict[tag_fr]:
                lemme_fr = lex_dict[tag_fr][tag[0]]
                result.append(lemme_fr)
                
    return(result)

### 2.4 Putting it all together

We can synthesize all the above procedure in the following function that we can apply on all the data set

In [12]:
def review_to_words(review):
    
    try:
            words = re.sub(r'\d+', '', review)
            words = words.replace("'", "e ")
            
            words = word_tokenize(words)
            tags = st_tag.tag(words)
        
            clean_review = french_stem(tags)
            
            return(" ".join(clean_review))
    
    except AttributeError:
        pass

In [93]:
review_to_words(review)

'falloir vraiment voir fois être encore mieux ainsi sang fois ravir scénario'

Before going further, we usually split the dataset between a training set and a validation set.

We usually fit the model on the training set and we use the validation set to decide wether the model is consistant enough or not. As a matter of fact, the validation set provides a vocabulary that the model might have never seen and a good accuracy score would show that the results of the model can be generalized beyond the training set alone.

In [14]:
train, test = train_test_split(data, test_size=0.33)
train = train.reset_index(drop = True)
test = test.reset_index(drop = True)

train_reviews = []
train_ratings = []

for i, example in enumerate(train.review):
    if (i%100 == 0):
        print("Training review : %s" %i)
    clean_review = review_to_words(train.review[i])
    if type(clean_review) == str:
        train_reviews.append(clean_review)
        train_ratings.append(train.rating[i])
        
test_reviews = []
test_ratings = []

for i, example in enumerate(test.review):
    if (i%100 == 0):
        print("Test review : %s" %i)
    clean_review = review_to_words(test.review[i])
    if type(clean_review) == str:
        test_reviews.append(clean_review)
        test_ratings.append(test.rating[i])        

Training review : 0
Training review : 100
Training review : 200
Training review : 300
Training review : 400
Training review : 500
Training review : 600
Training review : 700
Training review : 800
Training review : 900
Training review : 1000
Training review : 1100
Training review : 1200
Training review : 1300
Training review : 1400
Training review : 1500
Training review : 1600
Training review : 1700
Training review : 1800
Training review : 1900
Training review : 2000
Training review : 2100
Training review : 2200
Training review : 2300
Training review : 2400
Training review : 2500
Training review : 2600
Training review : 2700
Training review : 2800
Training review : 2900
Training review : 3000
Training review : 3100
Training review : 3200
Training review : 3300
Training review : 3400
Training review : 3500
Training review : 3600
Training review : 3700
Training review : 3800
Training review : 3900
Training review : 4000
Training review : 4100
Training review : 4200
Training review : 4300


### 2.5 Classification Model

Now that we have cleaned the data, we can proceed to the model.

We start by building the term-document matrix:

In [55]:
vectorizer = CountVectorizer(min_df = 5, max_df = .95)
tfidf = TfidfTransformer()

train_data_features = vectorizer.fit_transform(train_reviews)
train_data_features_tfidf = tfidf.fit_transform(train_data_features)

train_data_features = train_data_features.toarray()
train_data_features_tfidf = train_data_features_tfidf.toarray()

vocab = vectorizer.get_feature_names()

In [56]:
dist = np.sum(train_data_features, axis=0)
dist_tfidf = np.sum(train_data_features_tfidf, axis=0)

dict_features = dict(zip(vocab, dist))
dict_features_tfidf = dict(zip(vocab, dist_tfidf))

In [67]:
import operator
sorted_dict_features = sorted(dict_features.items(), key=operator.itemgetter(1), reverse = True)[:10]
print("les 10 éléments les plus fréquents dans le corpus et leur tfidf : \n")
print("Mot: compte, tfidf \n")
for word, count in sorted_dict_features:
    print("%s : %s, %s" %(word, count, round(dict_features_tfidf[word],0)))

les 10 éléments les plus fréquents dans le corpus et leur tfidf : 

Mot: compte, tfidf 

être : 67429, 2559.0
et : 44840, 1771.0
avoir : 36873, 1710.0
film : 35899, 1680.0
ne : 25551, 1236.0
pas : 22907, 1175.0
plus : 12625, 766.0
faire : 12216, 729.0
voir : 11920, 870.0
mais : 11524, 713.0


In [75]:
# SVM with Gradient Descent optimization
svm = SGDClassifier(loss = 'hinge')
svm = svm.fit(train_data_features_tfidf, train_ratings)

### 2.6 Model Evaluation

In [78]:
test_data_features = vectorizer.transform(test_reviews)
test_data_features = tfidf.transform(test_data_features)
test_data_features = test_data_features.toarray()

y_pred = svm.predict(test_data_features)

In [94]:
accuracy = roc_auc_score(test_ratings, y_pred)
print("Accuracy on validation set : %.2f%%" % (accuracy * 100.0))

Accuracy on validation set : 91.57%


In [117]:
def benchmark(clf, X_train, Y_train, X_test, Y_test):
    
    pipeline = Pipeline([
    ('vect', CountVectorizer(min_df = 5, max_df = .95)),
    ('tfidf', TfidfTransformer()),
    ('clf', clf)
        ])
    pipeline.fit(X_train, Y_train)
    y_pred = pipeline.predict(X_test)
    return roc_auc_score(Y_test, y_pred)

results = {}
roc_curve_dict = {}

classifiers = [(RidgeClassifier(tol = 1e-2, solver = 'sag'), 'Ridge'),
              (Perceptron(n_iter = 50), 'Perceptron'),
              (RandomForestClassifier(n_estimators = 100), 'Random Forest'),
              (MultinomialNB(alpha = .01), 'Naive Bayes')]
    
for clf, name in classifiers:
    score = benchmark(clf, train_reviews, train_ratings,
                     test_reviews, test_ratings)
    results[name] = score
    print("Accuracy %s: %0.2f%%" % (name, score * 100))

Accuracy Ridge: 91.28%
Accuracy Perceptron: 88.20%
Accuracy Random Forest: 88.27%
Accuracy Naive Bayes: 90.95%
