# Sentiment analysis for the IMDB reviews dataset

In this notebook, we tackle IMDB reviews sentiment analysis (download link available on the [README file](./README.md)). It consists of 50000 reviews, each associated a positive or negative sentiment. The goal of this notebook is to build a classifier that is available to predict the sentiment given the reviews, with high accuracy.

In a first stage, we select a subset of data to work on (e.g 5000 samples, i.e. 10% of the whole dataset). The goal of this step is to experiment with different cleaning strategies, and to perform hyperparameter selection. 

Then, we will divide the whole dataset into 25000 train samples, and 25000 test samples.
    * The train sample is used to train a model with hyperparamters obtained from the cross-validaton procedure on the data subset used previously.
    * The test set is used solely for model performance assessment. 

## Libraries imports

Depending on whether the nltk resources needed for data cleaning are present on the machine or not, this notebook cell might take some time to execute depending on the internet speed connection, to download the nltk lacking resources.

In [33]:
from sklearn import model_selection
import re
import pandas as pd
import numpy as np
import joblib
from itertools import chain
from pandarallel import pandarallel
pandarallel.initialize(progress_bar= True)


import utils, postprocess as postproc, preprocess as preproc, tuning
from os import path
from functools import partial
from tqdm.auto import tqdm
tqdm.pandas()
import pprint


INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Using a data subset
We select a subset of data points, we use half of them for training and the other half for testing.

In [34]:
if not path.exists("./data/data_raw_sub.csv"):
    df_raw_all = pd.read_csv("./data/data_raw.csv")
    df_raw_sub = model_selection.train_test_split(df_raw_all, train_size= 0.1, 
                                                  stratify = df_raw_all["sentiment"],
                                                  random_state= 0)[0]
    #df_raw_sub = df_raw
    df_raw_sub.to_csv("data/data_raw_sub.csv", index= False)

df = pd.read_csv("data/data_raw_sub.csv")
df_raw = pd.read_csv("data/data_raw.csv")
df["sentiment"] = (df["sentiment"]=="positive").values.astype("int8")
df.head()

Unnamed: 0,review,sentiment
0,Taran Adarsh a reputed critic praised such a d...,0
1,"Worth the entertainment value of a rental, esp...",0
2,"I liked Antz, but loved ""A Bug's Life"". The an...",1
3,This reboot is like a processed McDonald's mea...,0
4,"The working title was: ""Don't Spank Baby"". <br...",1


# Data reading and cleaning
In this part, we perform data cleaning. This is done one time over the whole data sets, and comprises processing steps that do not affect the task at hand (sentiment analysis in this case. ) In other words, these steps are not tunable. Such steps necessarily do not depend on the data distribution, but solely on the example at hand. They include:
* Removing html tags
* Removing punctuation
* Shortening/Removing long spaces and other characters (i.e. underscores)

Once the cleaning is done (or if the cleaned file already exists), then a subset of training set is selected, on which most of the experiments to select a model are carried out.

In [37]:
cleaning_steps = preproc.clean_steps_before_normalize
#print(cleaning_steps)
text_processor = partial(preproc.clean_text, check_spell= False, normalize= "lemmatize", rm_stop_words= False, 
                         steps = cleaning_steps)
df["review_clean"] = df["review"].parallel_apply(text_processor)
#df_cln["review"] = df_raw["review"].progress_apply(text_processor)
df.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1250), Label(value='0 / 1250'))), …

Unnamed: 0,review,sentiment,review_clean
0,Taran Adarsh a reputed critic praised such a d...,0,a reputed critic praise such a dubba movie fil...
1,"Worth the entertainment value of a rental, esp...",0,worth the entertainment value of a rental espe...
2,"I liked Antz, but loved ""A Bug's Life"". The an...",1,i like but love a animation that be put into t...
3,This reboot is like a processed McDonald's mea...,0,this reboot be like a processed meal compare t...
4,"The working title was: ""Don't Spank Baby"". <br...",1,the work title be not go on to become a succes...


## Spelling checking

What fraction do unknown words represent ?

In [26]:
words = []
for s in tqdm(df["review_clean"]):
    words.extend(s.split())
words = set(words)


  0%|          | 0/50000 [00:00<?, ?it/s]

In [27]:
unknown_words = preproc.spell_checker.unknown(words)
print(f"{len(unknown_words)/len(set(words))*100:.2f} % of words are unknown !")

34.81 % of words are unknown !


Let's print some of them

In [28]:
for w in list(unknown_words)[:40]:
    print(w)

qu
strummer
voy
guignol
flagrante
vadim
anatomising
impresson
ballbusting
sloow
vierde
protegee
hugo
leeze
beeg
eightiesly
reunuin
zen
za
sexploitative
represntation
unsensationalized
dawson
riviting
cassarole
cum
imprisonement
louise
ga
cleansweep
pooja
epiphanous
meth
tonto
bejeebers
honoured
horts
camelot
kz
pardesi


In [36]:
def print_error_source(unk_wrd, col):
    inds = np.where(df["review_clean"].apply(lambda s: unk_wrd in s.lower().split()).values)[0]
    for i in inds:
        pprint.pprint(df[col].iloc[i])
        print("----")
    return inds

print_error_source("beeg", "review")

('Whew! What can one say about this bizarre, stupefying mock-u-mentory about '
 "Ed Wood's cross-dressing fantasies?? Well, one word that comes to mind is "
 'incoherent! Wood uses raw slabs of innocuous, incidental stock footage, and '
 'then builds a "story" around them - and what a story!! Wood himself stars as '
 'Glen, a regular Joe who just happens to enjoy lounging around in his '
 "fiancee's lingerie and sweaters. I think what Wood wanted was a plea for "
 'tolerance for all the Glens of this world by showing that Glen is just like '
 'all of us underneath, only in angora. Ummm...ok. But then, we get this very '
 'bizarre montage of some horny devil, a chick in bondage, some rude, pointing '
 'people, some moore stock footage, and finally an emaciated Bela '
 'Lugusi,playing some kind of twisted, invalid Puppetmaster. Lugosi is a howl, '
 'spouting out such rubbish as "Beeevaaare...the beeeg greeeen dragon that '
 'seeets at your doorstep: he eeeets leeetle boys, puppydog tails

array([1282])

In [13]:
preproc.proper_noun_re.match("Asia Argento")

We observe the following sources of unknown words:
* British/American spelling
* using adverbs that do not exist in the english language by adding the "ly" suffix for example

## What languages are there ?

In [9]:
import langdetect
# df_raw["language"]= df_raw["review"].parallel_apply(lambda s: langdetect.detect(s))

In [10]:
# re.sub(r"([A-Z][a-z]+ *){2,}|.+ ([A-Z][a-z]+ *)+", " ", "hat is Happening")
re.sub(r"([A-Z][a-z]+ *){2,}|.+ [A-Z][a-z]+", " ", "hat is Happening")

' '

In [11]:
re.sub(r"([^\.])[A-Z][a-z]+", r"\1", "what Alex did")

'what  did'

TODO
* [ ] Write tests for every regular expression
* Take into account english and american spelling for the spell checker
* Examine the languages
* Take into account: 
    * proper nouns of one word: needs better pattern
    * the all capitalized text, mixture of capitalized and non capitalized
    * The exclamation points
    * emoticons
    * smileys
    * message abbreviations: tldr, lol, lmao etc
    * content between quotation marks
    * presence of laughs --> standardize to a common laugh ? e.g. replace by "hhhh" ? does the length convey some sentiment?

In [45]:
preproc.spell_checker.correction("hhhhhh")