# Sentiment Analysis: From Naive Bayes to BERT 

**Introduction**

TODO TODO TODO

TODO TODO TODO

TODO TODO TODO

TODO TODO TODO


**Prerequisite**

1. Install requirements 

```
pip install -r requirements.txt
```

2. Download [Google Word2Vec Model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) to this directory (3M vocab, cased, 300 d) and run 

```
gunzip GoogleNews-vectors-negative300.bin.gz
```

3. Download [Stanford GloVe Model](http://nlp.stanford.edu/data/glove.840B.300d.zip) (2.2M vocab, cased, 300d) to this directory and run the following commands.

```
unzip glove.840B.300d.zip
python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
rm glove.840B.300d.(zip|txt)
```

Alternatively, GloVe can be used with SpaCy's `en_core_web_md` too. See [Document](https://spacy.io/models/en#en_core_web_md). In this notebook, we will not use GloVe from SpaCy due to lots of its limitations.

4. Download Spacy model by running this command in terminal 

```
python -m spacy download en_vectors_web_sm
```



In [33]:
%load_ext autoreload
%autoreload

from dataset import download_tfds_imdb_as_text
from classical_ml_models import run_logistic_exp, run_multi_nb_exp, run_ber_nb_exp
from word_emb import run_logistic_word_emb_exp
from nlp_utils import print_stat

import gensim

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**1. Prepare for experiment**

- load word embeddings pretrained models
- Download dataset and get to know it briefly. 

In [22]:
word_emb_models = {
    "word2vec": gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True),
    "glove": gensim.models.KeyedVectors.load_word2vec_format('./glove.840B.300d.w2vformat.txt', binary=False) 
}

X_train, X_test, y_train, y_test = download_tfds_imdb_as_text()



number of training samples 25000
number of testing samples 25000


In [17]:
print_stat(X_train)

average number of char 1325.07
average number of tokens 272.43
total number of vocab without stop words 120455
most common: [('/><br', 50935), ('movie', 42423), ('film', 38841), ('like', 19414), ('good', 14327)]

example
Avoid this one, unless you want to watch an expensive but badly made movie. Example? The sound is good but the dialogue is not clear - a cardinal sin in a French film.<br /><br />This film attempts to combine western, drug intrigue and ancien regime costume epic. What? Well, consider this. The cowboy music is hilarious during sword fights. Or how about the woman in her underwear, holding a knife and jumping up and down on the bed?<br /><br />Someone should do a 'What's Up Tiger Lily' on this bomb. Rewrite the script and then either dub or subtitle it. Heck, it's almost that now. (BTW, Gerard Depardieu and Carole Bouquet, both known to American audiences, have roles.)

This movie is a half-documentary...and it is pretty interesting....<br /><br />This is a good movie...

In [18]:
print_stat(X_test)

average number of char 1293.79
average number of tokens 266.30
total number of vocab without stop words 119407
most common: [('/><br', 50039), ('movie', 42305), ('film', 38185), ('like', 19084), ('good', 13846)]

example
This movie was horrendous it was sorta like accidentally watching a gay porn waiting for the girls but they just don't come....I waited for almost 2 hours for the damn scarecrows....they just don't come...instead it's just some dumb ass wandering through a dead cornfield with a camera it's a mix of Blaire witch and some bad episode of the twilight zone. And the best part is that as of October 23 2005 they started filming a sequel please don't be fooled by the box even though it looks exactly the same as the first dark harvest it's not lions gate bought the rights to the Maize:the movie and had the brilliant idea to release it as the sequel to the original dark harvest;which i thought was funny........the only thing they had in common was they were both shot in a cornfi

**Discussion**

At a glance, we know that
- need to remove html tags during text preprocessing e.g. '/><br'
- punctuation may be useful because it shows the excitement. 
- the average number of tokens is about 270. Keep this in mind when choosing models

**2. Probabilistic Models**

The first experiment, we will use classical machine learning algorithms e.g. Logistic Regression and Naive Bayes. 

`X = IMDB_review`

`y = 1 if the review is positive else 0`

In order to create the vector X representing the IMDB reviews, we have to do the following process.

**Tokenizer**

I use SpaCy tokenizer with two different settings, see [documentation](https://spacy.io/usage/linguistic-features#tokenization) to understand its algorithm.

1. SpaCy Tokenizer
2. SpaCy Tokenizer + LowerCase + Lematization
3. SpaCy Tokenizer + LowerCase + Lematization + Remove Stop Words

The intuition of lowercase and lematization is that it can group words with similar meaning but in different form together. For example,

`It is a good movie.`

`It is the best movie.`

If we tokenize and lemmatize these two sentneces, the results will share the token `good`. If we tokenize but not lemmatize, `good` and `best` will be different tokens.

Lemmatization may or may not improve the accuracy of models. It depends on what kind of NLP tasks we are wokring on. Let's do the experiment and see how lemmatization effects the accuracy of our models and discuss why.

Another thing we can do is to remove stop words. The words like `is`, `of` are likely to appear in almost every document so they provide very less information for the model to classify the document. Again, removing stop words might be useful or not useful. We can also say that the model like logistic regression can assign low weight to word features that carry less information. 


**Vectorization**
There are several choices of text representations, see [documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text)

Here I use two types of vectorizer
1. CountVectorizer
2. TfidfVectorizer

Somes setting we can play with is `min_df`, `max_df`, which are minimum and maxinum number of occurence of words across documents. Here I use the default setting, `min_df` is 1, `max_df` is not set.

Note here that the intuition behind max_df is somewhat similar to remove stop words. The words like `is`, `I`, `of` are likely to appear in almost every document so it will be filted by `max_df`. These words are also filtered by stop words as well.

Each vectorizer, I have four differnt configurations (2 x 2).
1. Binary, Multinomial 
2. 1-gram, both 1-gram and 2-grams

To understand the intuition behind n-grams, let's see this example.

`This movie is not good. It is boring.`

`This movie is not boring. It is good.` 

These two sentences have to exact same 1-gram but opposite sentiment. We have to use 2-grams to differentiate.


**Models**
1. Multinomial Naive Bayes, see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
2. Bernuli Naive Bayes, see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
3. Logistic Regression, see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

The first two are generative model based on joint probability. They both estimate parameters using maxinum likelyhood. The different is how they define features. The third model are discriminative models. It estiamtes parameters using gradient descent. Learn more in [Manning's Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf).

Now let's start with the most classical model for sentiment analysis - Naive Bayes


In [3]:
run_multi_nb_exp(X_train, X_test, y_train, y_test)

Unnamed: 0,vectorizer,preprocessing,tokenizer,ngram,binary/multinomial,F1
0,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",False,0.798752
1,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",False,0.821933
2,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",False,0.843808
3,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",False,0.853251
4,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 1)",False,0.792631
5,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 1)",False,0.810695
6,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 2)",False,0.840261
7,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 2)",False,0.849568
8,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma_remove_stop,"(1, 1)",False,0.804507
9,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma_remove_stop,"(1, 1)",False,0.817771


In [4]:
run_ber_nb_exp(X_train, X_test, y_train, y_test)

Unnamed: 0,vectorizer,preprocessing,tokenizer,ngram,binary/multinomial,F1
0,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",True,0.824464
1,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",True,0.846779
2,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 1)",True,0.816929
3,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 2)",True,0.842736
4,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma_remove_stop,"(1, 1)",True,0.798459
5,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma_remove_stop,"(1, 2)",True,0.800929


Now let's run the logistic regression experiment

In [5]:
run_logistic_exp(X_train, X_test, y_train, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Unnamed: 0,vectorizer,preprocessing,tokenizer,ngram,binary/multinomial,F1
0,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",True,0.887099
1,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",True,0.891876
2,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",False,0.885628
3,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 1)",False,0.884538
4,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",True,0.90142
5,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",True,0.909439
6,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",False,0.899214
7,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer,"(1, 2)",False,0.902088
8,CountVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 1)",True,0.88055
9,TfidfVectorizer,preprocess_remove_html_non_ascii,spacy_tokenizer_lower_lemma,"(1, 1)",True,0.887447


**Discussion**

TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

**Word Embeddings**


TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

TODO TODO TODO TODO TODO TODO TODO TODO 

In [34]:
run_logistic_word_emb_exp(X_train, X_test, y_train, y_test, word_emb_models)

Unnamed: 0,word_emb_model,tfidf,tokenizer,polling,F1
0,word2vec,True,spacy_tokenizer_lower_lemma_remove_stop,norm,0.822629
1,glove,True,spacy_tokenizer_lower_lemma_remove_stop,norm,0.819563
2,word2vec,True,spacy_tokenizer,norm,0.832477
3,glove,True,spacy_tokenizer,norm,0.840142
4,word2vec,True,spacy_tokenizer_lower_lemma,norm,0.827647
5,glove,True,spacy_tokenizer_lower_lemma,norm,0.824966
6,word2vec,False,spacy_tokenizer_lower_lemma_remove_stop,norm,0.845811
7,glove,False,spacy_tokenizer_lower_lemma_remove_stop,norm,0.845131
8,word2vec,False,spacy_tokenizer,norm,0.856787
9,glove,False,spacy_tokenizer,norm,0.853586
