- Kaggle : https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview
- Maker notes : https://www.notion.so/maker-NLP-00d265601ad146e490bea30cda512756

# Installation

In [None]:
! make -C ../

In [None]:
import sys
sys.path.append('..')

from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from maker_nlp.vizualisation import plot_top_k_words_per_sentiment, plot_top_k_words_in_corpus
from maker_nlp.preprocessing import remove_stop_words, convert_to_lowercase, remove_accents, \
    remove_punctuation_and_digits, normalize_text, lemmatize

DATA_FOLDER = Path('../data')

## Dataset

In [None]:
df = pd.read_csv(DATA_FOLDER / 'final_dataset.csv')
print(df.shape)
df.head()

In [None]:
phrase, sentiment = df.Phrase, df.Sentiment
print(f'Shape of Phrase = {phrase.shape}, Shape of Sentiment = {sentiment.shape}')

## Pre-processing & Feature engineering

### Vectorize Text Data

#### Transform dataset to a bag of words

Bag of words is a naive methods to vectorize any texts. 
The first step is to gather all the different words that appear in all the texts. We call that a ***dictionnary***.
Then, for each text, we count the occurences of the words in the dictionnary: the different counts form a ***vector***.  

To make this explanation easy, let's take an example:  
  
>*This car is amazing*  
>*My car is blue*  

First, let's create our ***dictionnary***:

<img src = "../img/bow2.png">

Then, we *vectorize* the two sentences as in the following table (each row corresponds to a sentence):

|this|car|is|amazing|my|blue|
|----|---|--|-------|--|----|
|   1|  1| 1|      1| 0|   0|
|   0|  1| 1|      0| 1|   1|




**Now, we perform this strategy on our data**

In [None]:
count_vectorizer = CountVectorizer()

In [None]:
count_vectorizer.fit(phrase)

In [None]:
phrase_count_features = count_vectorizer.transform(phrase)
phrase_count_features

#### Evaluation

In [None]:
plot_top_k_words_per_sentiment(phrase, sentiment, 10)

### Work with test dataset

**Exercice:**

Nettoyer le texte pour rendre le graphique intéressant à l'analyse.

(Extraits de 'Du pétrole à l’énergie verte, Total en transition ?' de ouest-france)

In [None]:
texte_1 = "À l'horizon 2025, la multinationale vise une capacité mondiale de production d’électricité \
    bas carbone de 25 gigawatts (contre 2,7 GW aujourd’hui), « ce qui commencerait à être relativement \
    important à l’échelle de la planète », déroule Patrick Pouyanné."

In [None]:
texte_2 = "Depuis 2015, Total a acquis pour presque six milliards d’euros d’actifs \
    dans les énergies dites nouvelles (Saft, Direct Energie, Total Eren, etc.). Une branche qui \
    concentre désormais 10 % du total de ses investissements et représente 10 000 employés, sur les \
    100 000 que compte le groupe."

In [None]:
texte_3 = "Le nombre de stations a presque été divisé par quatre ces quarante dernières \
    années en France, et dépassait à peine les 11 000 en 2018."

In [None]:
texte_4 = "À titre d’exemple, Patrick Pouyanné évoque le cas de l’Inde, un pays de « bientôt \
    1,5 milliard de personnes », soit « 25 % du problème du changement climatique »."

In [None]:
plot_top_k_words_in_corpus([texte_1, texte_2, texte_3, texte_4], 10)

### Remove Useless Words

In [None]:
text = """24 sept. 2020 14:52 - Le groupe Total a confirmé ce jeudi la fermeture de sa raffinerie de Grandpuits (Seine-et-Marne) pour la transformer en "plateforme zéro pétrole"."""
text

In [None]:
remove_stop_words(text)

#### Normalize Text

In [None]:
text = """24 sept. 2020 14:52 - Le groupe Total a confirmé ce jeudi la fermeture de sa raffinerie de Grandpuits (Seine-et-Marne) pour la transformer en "plateforme zéro pétrole"."""
text

- Convert text to lowercase

In [None]:
convert_to_lowercase(text)

- Remove accents

In [None]:
remove_accents(text)

- Remove punctuation and digits

In [None]:
remove_punctuation_and_digits(text)

**All these operations are grouped into one function**

In [None]:
normalized_text = normalize_text(text)
normalized_text

#### Stop words

In [None]:
useful_words = remove_stop_words(normalized_text)
useful_words

In [None]:
useless_words = set(useful_words.split(' ')).symmetric_difference(set(normalized_text.split(' ')))
useless_words

#### Evaluation

In [None]:
normalized_phrase = phrase.apply(normalize_text)
cleaned_phrase = normalized_phrase.apply(remove_stop_words)

In [None]:
plot_top_k_words_per_sentiment(cleaned_phrase, sentiment, 20)

### Lemmatization

#### What is text lemmatization?

Lemmatization is the process of reducing each word to a *canonical form* named a ***lemma***. To objective is to treat different forms of a same word as a single one.  
For example:  

> *better* -> good  
> *are* -> be  
> *running* -> run

In [None]:
sentence = "Apples and oranges are similar. Boots and hippos aren't."
lemmatize(sentence)

#### Application

In [None]:
lemmatized_text = cleaned_phrase.apply(lemmatize)
lemmatized_text

In [None]:
plot_top_k_words_per_sentiment(lemmatized_text, sentiment, 20)