# Homework: Sentiment analysis of product reviews (Part 1)


In this notebook you will explore logistic regression and feature engineering with scikit-learn functions. You will use product review data from Amazon to predict whether the sentiment about a product, inferred from its review, is positive ($+1$) or negative ($-1$). 

Even though more sophisticated approaches exist, such as tf-idf (discussed in module 1), for simplicity we'll use a bag-of-words representation as our feature matrix. If you need to review, feature extraction on text data was discussed in the first module of the course.

Your job is to do the following:
* Perform some basic feature engineering to deal with text data
* Use scikit-learn to create a vocabulary for a corpus of reviews
* Create a bag of words representation based on the vocabulary shared by the reviews
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the computed weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

As usual, we import a few libraries we need. Later, you'll need to import more.

In [None]:
import pandas as pd
import numpy as np
import math
import string
import warnings
warnings.filterwarnings("ignore")
# use WIDER CANVAS:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Data preparation

We will use a dataset consisting of Amazon baby product reviews.

In [None]:
products = pd.read_csv('~/data/amazon_baby.gz')
print('We will work with',len(products),'reviews of baby products')

Let's take a peek at the data.

In [None]:
products

Let's examine the second review. In Pandas you can access entries by index number. Indices usually start at 0.

In [None]:
an_entry=1
for col in products.columns:
    print(f'{col.upper()}: {products.iloc[an_entry][col]}')

## Build word count vector for each review

First, we perform two simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Transform the reviews into a bag-of-words representation using a countvectorizer.

*Note*. For the sake of simplicity, we replace all punctuation symbols (e.g., !, &, :, etc.) by blanks. A better approach would preserve composite words such as "would've", "hasn't", etc. If interested, see [this page](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
for scripts with better ways of handling punctuation.

Make sure to look up the details for `maketrans`, a method of the `str` class. 

In [None]:
# These are the symbols we will replace
print(string.punctuation)

***Question 1.*** Complete a function `remove_punctuation(text)` to replace punctuation symbols by blanks in its `text` argument.

In [None]:
# YOUR CODE HERE
import string 
def remove_punctuation(text):
    ...

Let's test your function on the sample review displayed earlier, but first, we create a clean corpus of reviews without punctuation.

In [None]:
review_no_punctuation = products['review'].apply(remove_punctuation)
print(review_no_punctuation[an_entry])

## Create the feature matrix X
We need a feature matrix with one row for each review. Each row uses a bag-of-word representation on a vocabulary built on the entire corpus of reviews. This task can be easily carried out using the `CountVectorizer` class of sklearn. 

The vectorizer works by creating a vocabulary (set of words) in a corpus, tokenizing the words in the vocabulary (assigning a unique integer to each word), and creating a bag-of-words representation for each document (review) in the corpus. The integers assigned to the words in the vocabulary become positions in a feature vector that counts the number of occurrences of each particular word. Since in practice feature vectors are huge, a compressed row matrix (`csr_matrix`) is used for each row (see scipy's Compressed Sparse Row matrix for more information).

***Question 2.***
- Create an instance `cv` of the CountVectorizer class that can remove three types of words: *stop words* (listed below), words that appear in only one review, and words that appear in more than 60% of the reviews.
- Using the `fit` function, tokenize the words that were not removed from the clean corpus of reviews. As a side effect, this step creates a dictionary (`vocabulary_`) that maps words to integer positions.
- Create a bag-of-words csr_matrix (feature vector) for each review and store it in an additional column `word_count_vec` of the products dataframe.

Use the following list of stop words: ['you','he','she','they','an','the','and','or','in','on','at','is','was','were','am','are'].

In [None]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import CountVectorizer
...
products['word_count_vec'] = ...

***Question 3.*** How big are the feature vectors? This, of course, is the same for all samples. What are the feature vector locations of the words 'great' and 'poor'? *Hint.* The vocabulary is a Python dictionary. 

In [None]:
# YOUR CODE HERE

**Question 4.** Print the review of the 28th entry in `products` (remember 0-indexing!). Write code to answer the following questions:
- How many distinct words from the dictionary appear in the cleaned review? 
- How many times does the word 'book' appear in the review? 

In [None]:
# YOUR CODE HERE

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, under the assumption that they usually express a neutral sentiment.

In [None]:
products = products[products['rating'] != 3]
print(f'We are left with {len(products)} reviews with strong sentiment')

***Question 5.*** Consider reviews with a rating of 4 or higher to be *positive* reviews, and ones with rating of 2 or lower to be *negative*. Create a sentiment column, using $+1$ for the positive class label and $-1$ for the negative class label.

In [None]:
# YOUR CODE HERE

The dataset now contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

In [None]:
# Let's take a look at the new column
products[['name','rating','sentiment']]

## Split data into training and test sets

Let's perform a 80-20 train/test split of the data. We'll use `random_state=0` so that everyone gets the same result.

In [None]:
len(products)

In [None]:
train_data = products.sample(frac=.8, random_state=0)
test_data = products.drop(train_data.index)
print(f'We will use N={len(train_data)} training samples')
print(f'and {len(test_data)} testing samples')
print(f'Total samples: {len(products)}')

## Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data. This model will use the column **word_count_vec** as a feature and the column **sentiment** as the target.

***Question 6.*** Create a logistic regression model called `sentiment_model` with scikit-learn (similar to the one in the class demo) with $L_2$-regularization and $C=100$ penalty. You will need to extract a feature matrix `X_train` and vector of true labels `y_train` from your training data.  To create the feature matrix X_train you will need to stack the rows of bag-of-words into a single matrix (you may want to check the function `vstack` in scipy).

*Note:* This may take a while on a big trainings set.

In [None]:
# YOUR CODE HERE
from sklearn.linear_model import LogisticRegression

X_train should now be a *compressed* feature matrix of size $N\times d$, where $d$ is the size of the vocabulary. Let's check it out.

In [None]:
print(type(X_train))
print(X_train.shape)

Now that we have fitted the model, we can extract the weights (coefficients) as a dictionary as follows:

***Question 7.*** Extract the weights of the words and store them in a dictionary `word_coef` that maps feature names (words) to coefficients. *Hint.* You can get the feature names using your vectorizer.

In [None]:
# YOUR CODE HERE
# Create an empty dictionary to store the word to coefficient mapping
word_coef = {}
...

There are thousands of coefficients in the model. Recall from the lecture that positive weights $w_j$ correspond to favorable reviews, while negative weights correspond to negative ones. 

Let's examime the coefficients of a few features as a sanity check. Did you get what you expected?

***Question 8.*** Find the coefficients of the following words: 'awesome', 'good', 'great', 'awful', 'terrible', 'poor'. How many words got a coefficient inconsistent with its meaning.

In [None]:
# YOUR CODE HERE

***Question 9.*** Fill in the following block of code to compute the number `num_pos_weights` of positive (>0) weights and the number `num_neg_weights` of negative (<=0) weights. Print both counts and verify that they add to the total number of coefficients you computed earlier in the notebook.

In [None]:
# YOUR CODE HERE

## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 8 examples in the test dataset.  We refer to this set of examples as the **sample_test_data**.

In [None]:
sample_test_data = test_data.iloc[90:98]
sample_test_data

Let's dig deeper into the rows of the **sample_test_data**. Here are the ratings and reviews.

In [None]:
for i in range(len(sample_test_data)):
    print('\nrating:',sample_test_data.iloc[i]['rating'])
    print('review:\n',sample_test_data.iloc[i]['review'])

We will now make a **class** prediction for our **sample_test_data**. We hope that `sentiment_model` predicts **+1** if the true sentiment is positive and **-1** if the true sentiment is negative. Recall from lecture that the score $z$ for the logistic regression model  is defined as:
$$z_i = \mathbf{w}\cdot \mathbf{x}_i$$ 

where $\mathbf{x}_i$ represents the features (word counts) for sample $i$ and the corresponding score is a number in the range $(-\infty,\infty)$. We will write some code to obtain the **scores** using sklearn.

***Question 10.*** Using your model's `decision_function`, compute the **scores** (these are the $z$-values) of the reviews in `sample_test_data` and print the true sentiment. How many scores are compatible with the true sentiment values? Interpret your findings.

In [None]:
# YOUR CODE HERE
scores = []
...

### Predicting sentiment

These scores can now be used to make class predictions, as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w} \cdot \mathbf{x}_i > 0 \\
      -1 & \mathbf{w}\cdot \mathbf{x}_i \leq 0 \\
\end{array} 
\right.
$$

***Question 11.*** Using scores, write python code to compute and print $\hat{y}$, the class predictions:

In [None]:
# YOUR CODE HERE

*Sanity check*. Run the following code to check whether the class predictions obtained using your code are the same as those produced by sklearn.

In [None]:
print("Class predictions according to sklearn:")
print(sentiment_model.predict(sp.vstack(sample_test_data['word_count_vec'].values)))

### Probability predictions

Recall from the lectures that we can also calculate the probabilities that $y=+1$ from the scores using:
$$
\mbox{Pr}(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1+e^{-\mathbf{w}\cdot\mathbf{x}_i}}
$$

***Question 12.*** Using the variable `scores` computed previously, write a single line of code to estimate the probability that a sentiment is positive using the above formula. Print the results. For each row, the probabilities should be a number in $[0, 1]$. Of the eight data points in **sample_test_data**, which one, classified as positive, has the *lowest probability* of being classified as a positive review? Was this prediction correct?

In [None]:
# YOUR CODE HERE

***Question 13.*** Now compute estimated probabilities with sklearn by using the function `predict_proba` on your model.

*Sanity check*: Make sure your probability predictions match the ones obtained from sklearn.

In [None]:
# YOUR CODE HERE

### Find the most positive and most negative reviews

We now turn to examining the test dataset, **test_data** and, for faster performance, use sklearn to form predictions on all of the test data points.

***Question 14.*** Using the `sentiment_model`, find the 20 reviews in the entire **test_data** with the **highest probability** of being classified as a **positive review**. We refer to these as the "most positive reviews." Recall that you can make probability predictions by using `.predict_proba` and you can select the $n$ largest values of a frame with the function `nlargest`.

In [None]:
# YOUR CODE HERE

***Question 15.***
Now, repeat this exercise to find the 20 "most negative reviews" in the test data. Recall that a review is considered negative if it has a low probability of being positive.

In [None]:
# YOUR CODE HERE

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

***Question 16.*** Write a function `get_classification_accuracy` that takes in a model, a dataset, the true labels of the data set, and returns the accuracy of the model measured on the given data set.

You will need to:
1. Use the trained model to compute class predictions (you can use the `predict` method)
2. Count the number of data points when the predicted class labels match the true labels (the ground truth).
3. Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [None]:
# YOUR CODE HERE
def get_classification_accuracy(model, data, true_labels):
    ...

Now, let's check the accuracy of our sentiment_model.

***Question 17.*** What is the accuracy of the **sentiment_model** on the **training_data** and on the **test_data**? Round your answer to 4 decimal places. What does this tell you about the quality of your model?

In [None]:
# YOUR CODE HERE

There are lots of words in the model we trained above. We want to determine which ones are the most important.

***Question 18.*** Write code to find the 10 most positive and the 10 most negative weights in our learned model. Print both the feature name and the corresponding weight.

In [None]:
# YOUR CODE HERE

## Learn another classifier with fewer words

We will now train a simpler logistic regression model using only a subset of words that occur in the reviews. For this portion of the assignment, we selected 18 words to work with. These `significant_words` are shown in the cell below.

In [None]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves','wonderfully','lifesaver',
      'well', 'broke', 'less', 'waste', 'disappointed', 'unusable',
      'work', 'money', 'return']

***Question 19.*** Create a new instance of CountVectorizer which will create a feature vector out of a given string based on instances of our significant words in the string. First, you will build the small vectorizer by specifying its vocabulary in the constructor function `CountVectorizer`. Then, you'll transform each review into a bag-of-words using this new vectorizer, placing the new vectors in the column **subset_word_count_vec**.

In [None]:
# YOUR CODE HERE
...

Add the new column  to our training and testing DataFrames.

In [None]:
train_data['word_count_subset_vec'] = products['word_count_subset_vec']
test_data['word_count_subset_vec'] = products['word_count_subset_vec']

Let's see what an example of the training dataset looks like:

In [None]:
an_entry=4
print(train_data.iloc[an_entry]['review'])
train_data.iloc[an_entry]

Since we are only working with a subset of the available words, only a few `significant words` will be present in this review.

## Train a logistic regression model on a subset of data

***Question 20.*** Build a classifier with **word_count_subset_vec** as the feature and **sentiment** as the target, using the same training parameters as for the full model. 

In [None]:
# YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
...

Now, we will inspect the weights (coefficients) of the **simple_model**.

***Question 21.*** Just as you did in **Question 7**, extract the weights of the words and store them in a dictionary `word_coef` that maps feature names (words) to coefficients. Print the words to coefficient mapping, sorting the coefficients (in descending order) by the **value** to obtain the coefficients with the most positive effect on the sentiment.

In [None]:
# YOUR CODE HERE
...

***Question 22.*** Consider the coefficients of **simple_model**. There should be 19 of them, an intercept term + one for each word in **significant_words**. How many of the coefficients (corresponding to the **significant_words** and *excluding the intercept term*) are positive for the `simple_model`? Write a single line of code to compute the answer (do not compute it by hand!)

In [None]:
# YOUR CODE HERE
...

***Question 23***: Are the positive words in the **simple_model** (let us call them `positive_significant_words`) also positive words in the **sentiment_model**?

In [None]:
# YOUR CODE HERE
...

## Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

***Question 24.*** Compute the classification accuracy of the **sentiment_model** and of the **simple_model** on the **train_data**. Which model (**sentiment_model** or **simple_model**) has higher accuracy on the training set?

In [None]:
# YOUR CODE HERE
...

Now, we will repeat this exercise on the **test_data**. Start by computing the classification accuracy of the **sentiment_model** on the **test_data**:

***Question 25.*** Compute the classification accuracy of the sentiment_model and of the simple_model on the test_data. Which model (sentiment_model or simple_model) has higher accuracy on the testing set?

In [None]:
# YOUR CODE HERE
...

## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should comfortably beat the majority class classifier, otherwise, the model is (usually) pointless.

***Question 26.*** Write a function `compute_majority_classifier(data,label)` that returns a majority classifier for the column `label` of frame `data` (***yes***, *I am asking you to write a function that returns a function*). You may assume that the labels are numeric $+1$ or $-1$. Test it using the sentiment of **train_data**. What does the majority classifier return in this case?

In [None]:
# YOUR CODE HERE
...

***Question 27.*** Compute the accuracy of the majority classifier on the **test_data**. Round your answer to two decimal places.

In [None]:
# YOUR CODE HERE

***Question 28.*** Is the **sentiment_model** definitely better than the majority class classifier (the baseline)? Based on the gathered information, does the **sentiment_model** suffer from high bias or from high variance? What else would you try to improve performance? Explain.

YOUR ANSWER HERE: