# **DSFM Exercise**: Sentiment Analysis - NLP, Text Embedding

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

Sentiment analysis is a challenging subject in machine learning. People express their emotions in a way that can be very ambiguous for both humans and computers. In this demo, we analyze sentiments of a set of IMDB movie reviews. The dataset consists of review texts as well as a binary sentiment label (1: positive, 0: negative). 

<img src="http://barnraisersllc.com/wp-content/uploads/2017/01/Sentiment-Analysis.jpg" width="500" height="500" align="center"/>

Image source: http://barnraisersllc.com/wp-content/uploads/2017/01/Sentiment-Analysis.jpg

Dataset source: *Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

-------------

## **Part 0**: Setup

In [None]:
# Put all import statements at the top of your notebook

# Standard imports
import pandas as pd
import numpy  as np
from bs4 import BeautifulSoup 
import re

# Data science packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection         import KFold, GridSearchCV, cross_val_score
from sklearn.ensemble                import RandomForestClassifier
from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split

# Neural networks
from tensorflow.keras.models                 import Sequential, load_model
from tensorflow.keras.layers                 import Dense, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.preprocessing.text     import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils.vis_utils import model_to_dot

# Text processing packages
import nltk
import nltk.data
from nltk.corpus import stopwords 
from nltk.stem   import SnowballStemmer
from gensim.models import doc2vec, word2vec

# Visualization packages
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from IPython.display import SVG

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

%matplotlib inline


In [None]:
# Set constants 

# Set a seed for replication
SEED = 10

# Set performance metric
SCORE = 'accuracy'

In [None]:
# Nested cross validation helper function
def nested_cv(X, y, est_pipe, p_grid, p_score, n_splits_inner = 3, n_splits_outer = 3, n_cores = 1, seed = 0):

    # Cross-validation schema for inner and outer loops (stratified if it is a classification)
    inner_cv = KFold(n_splits = n_splits_inner, shuffle = True, random_state = seed)
    outer_cv = KFold(n_splits = n_splits_outer, shuffle = True, random_state = seed)
    
    # Grid search to tune hyper parameters
    est = GridSearchCV(estimator = est_pipe, param_grid = p_grid, cv = inner_cv, scoring = p_score, n_jobs = n_cores)

    # Nested CV with parameter optimization
    nested_scores = cross_val_score(estimator = est, X = X, y = y, cv = outer_cv, scoring = p_score, n_jobs = n_cores)
    
    print('Average score: %0.4f (+/- %0.4f)' % (nested_scores.mean(), nested_scores.std() * 1.96))
    
    return nested_scores.mean(), nested_scores.std() * 1.96

# Define a function to split a review into clean list of words
def review_to_wordlist(review):
    
    review_text = BeautifulSoup(review).get_text()
   
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    words = review_text.lower().split()
    
    stops = set(stopwords.words("english"))
    words = [w for w in words if not w in stops]
    
    stemmer = SnowballStemmer('english')
    words = [stemmer.stem(w) for w in words]
    
    return(words)

# Define a function to split a review into parsed sentences, where each sentence is a word list
def review_to_sentences(review, tokenizer):
    
    raw_sentences = tokenizer.tokenize(review.strip())  
    sentences = []
    for raw_sentence in raw_sentences:      
        if len(raw_sentence) > 0:           
            sentences.append( review_to_wordlist( raw_sentence ))
   
    return sentences

# Function to average all of the word vectors in a given paragraph
def makeFeatureVec(words, model, num_features):
    
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,), dtype='float32')
    nwords = 0.
     
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

# Given a set of reviews (each one a list of words), calculate 
# the average feature vector for each one and return a 2D numpy array
def getAvgFeatureVecs(reviews, model, num_features):
    
    # Initialize a counter
    counter = 0
    
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype='float32')
     
    # Loop through the reviews
    for review in reviews:
       
       # Print a status message every 1000th review
        if counter%1000 == 0:
            print ('Review %d of %d' % (counter, len(reviews)))
       
        # Call the function (defined above) that makes average feature vectors
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
       
        # Increment the counter
        counter = counter + 1
        
    return reviewFeatureVecs

## **Part 1**: EDA

In the first part, we load the `sentiment_data.tsv` dataset. The dataset consists of `id`, `sentiment`, and `review` columns.

Hint: This is a *tab*-separated file, but we can still use pandas' `read_csv` function. 


**Q 1**: Load the data. What shape does it have?

**Q 2**: Extract the observation at index 0. How does the review text look?

**Q 3**: Plot the distribution of the target and plot a word cloud.

**Q 4**: Check for missing values and drop rows containing missing values (if any).

## **Part 2**: Text pre-processing

In the second part, we pre-process the raw review text in the following steps:

1. Remove HTML tags
2. Keep only alphabetical terms
3. Lowercase all words
4. Remove stop words
5. Stem words

IMPORTANT: Use the review at index 0 for each step to see how it changes. We will pre-process all reviews in the next step.

**Q 1**: Remove HTML tags with the `BeautifulSoup` package.

**Q 2**: Keep only alphabetical terms/words.

**Q 3**: Lowercase all words and separate text by blank spaces.

**Q 4**: Remove stop words.

**Q 5**: Stem all words.

**Q 6**: Consolidate all steps into a single function called `review_to_words`.

**Q 7**: Pre-process all reviews with the new function.

## **Part 3**: Feature Extraction using TFIDF

In this short part, we transform the cleaned text into TFIDF representations.

**Q 1**: Transform cleaned reviews to TFIDF using the 5000 most frequent words in the corpus. What shape will the new representation have?

**Q 2**: Print the TFIDF values of the first 10 words in the vocabulary.

## **Part 4**: Binary classification using Random Forest with TFIDF Features

In this part, we will run a binary classification using Random Forest and the TFIDF features to predict the binary sentiment label.

**Q 1**: Extract the target vector from the data.

**Q 2**: Fit a `RandomForestClassifier` to our data and test different parameter values for `n_estimators` (e.g. from 10 to 50). 

What nested cross validated accuracy do you get? How long does it take to test the different parameter values?

Hint 1: Use the `%%time` magic command at the beginning of the cell for timing.

Hint 2: Use the `nested_cv` helper function defined at the very top of this notebook. 

## **Part 5**: Word Vectors - Word2Vec

A popular vectorization method for words is a technique known as Word2Vec, which is implemented in the `gensim` library. 

In Word2Vec each word is assigned a low-dimensional vector which is learnt under the assumption that words that are close to each other in a document are semantically related. Word2Vec can be used as a base for vectorize documents in low dimensions. You can read more about it here: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf.

**Q 1**: Extract all sentences using the `punkt` tokenizer in the `nltk` package. How many sentences are in the corpus?

Hint: Use the `review_to_sentences` helper function defined at the very top of this notebook to convert reviews to sentences.

**Q 2**: Train the word vectors using the `Word2Vec` function in the `gensim` package. Generate word vectors with 300 dimensions.

**Q 3**: Investigate how words relate to each other using the `doesnt_match` and `most_similar` functions on the trained Word2Vec model. What's the most similar word to "man"?

### Paragraph vectors - Average Word Vectors

Even though the word vectors show good semantical properties, using them to get the same sort of properties from sentences is not straight forward. The simplest solution is to average word vectors of a document to come up with the same dimensional paragraph vector.

Hint 1: Use the `review_to_wordlist` helper function defined at the very top of this notebook. 

Hint 2: Use the `getAvgFeatureVecs` helper function defined at the very top of this notebook to average word vectors.

**Q 4**: Fit a `RandomForestClassifier` to our data and test different parameter values for `n_estimators` (e.g. from 10 to 50). 

What nested cross validated accuracy do you get? How long does it take to test the different parameter values?

## **Part 6**: Document Vectors - Doc2Vec

Another solution to obtain review vectors is to obtain them directly. Paragraph Vectors are treating each document as a word itself and obtains their vectors directly. You can read more about it here: https://cs.stanford.edu/~quocle/paragraph_vector.pdf . Again, the `gensim` library implements this method in the `doc2vec` class.

The first two cells below get rid of some "housekeeping" to prepare the data. Make sure the variable names match with your variable names.

In [None]:
# Doc2Vec needs each review to be tagged with some sort of ids
# Here we tag each review with the 'id' field

tagged_clean_data_reviews = []
for uid, review in zip(data['id'], clean_data_reviews):
    tagged_clean_data_reviews.append(doc2vec.TaggedDocument(words=review, tags=['%s' % uid[1:-1]]))

In [None]:
# An example of tagging

tagged_clean_data_reviews[0]

**Q 1**: Train the document vectors using the `Doc2Vec` function in the `gensim` package. Generate document vectors with 300 dimensions.

**Q 2**: Investigate how documents relate to each other using the `docvecs.similarity` function on the trained Doc2Vec model. How similar are documents d1 = '7759_3' and d2 = '5814_8'?

**Q 3**: Fit a `RandomForestClassifier` to our data and test different parameter values for `n_estimators` (e.g. from 10 to 50). Hint: use the `linspace` function in the `numpy` package to generate evenly spaced numbers in a given interval.

What nested cross validated accuracy do you get? How long does it take to test the different parameter values?

## **SUMMARY OF ACCURACY SCORES**

In [None]:
width       = 50
models      = ['Baseline', 'Random Forest', 'Random Forest + word2vec', 'Random Forest + doc2vec']
result_acc  = [0.5, acc_rf, acc_rfW2V, acc_rfD2V]
result_std  = [0,   std_rf, std_rfW2V, std_rfD2V]

print('', '=' * width, '\n', 'Summary of Accuracy Scores'.center(width), '\n', '=' * width)  
for i in range(len(models)):
    print(models[i].center(width-18), '{0:.4f}'.format(result_acc[i]), '+/-{0:.4f}'.format(result_std[i]))