# Lectures 7: Class demo IMDB Reviews

## Imports

In [None]:
import os
import sys

sys.path.append(os.path.join("code"))
from utils import *
import matplotlib.pyplot as plt
import mglearn
import numpy as np
import pandas as pd
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
%matplotlib inline
URL_DATA_DIR = "https://github.com/firasm/bits/raw/refs/heads/master/imdb_small.csv"
pd.set_option("display.max_colwidth", 200)

## Demo: Model interpretation of linear classifiers

- One of the primary advantage of linear classifiers is their ability to interpret models. 
- For example, with the sign and magnitude of learned coefficients we could answer questions such as which features are driving the prediction to which direction. 

- We'll demonstrate this by training `LogisticRegression` on the famous [IMDB movie review](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) dataset. The dataset is a bit large for demonstration purposes. So I am going to put a big portion of it in the test split to speed things up. 

In [None]:
imdb_df = pd.read_csv(URL_DATA_DIR, encoding="ISO-8859-1")
imdb_df.head()

Let's clean up the data a bit. 

In [None]:
import re

def replace_tags(doc):
    doc = doc.replace("<br />", " ")
    doc = re.sub(r"https://\S*", "", doc)
    return doc

In [None]:
imdb_df["review_pp"] = imdb_df["review"].apply(replace_tags)

### Activity: Discuss the following questions in your group

- Are we breaking the Golden rule here?
- Why or How?

<br><br><br><br>

### Let's split the data and create bag of words representation.

This is a very large dataset (and even now we're only working with a fifth of its actual size), so we're going to put a lot of it to the test set so the calculations don't take forever!

In [None]:
train_df, test_df = train_test_split(imdb_df, test_size=0.7, random_state=123)
X_train, y_train = train_df["review_pp"], train_df["sentiment"]
X_test, y_test = test_df["review_pp"], test_df["sentiment"]
train_df.shape

In [None]:
vec = CountVectorizer(stop_words="english")
bow = vec.fit_transform(X_train)
bow

### Examining the vocabulary

- The vocabulary (mapping from feature indices to actual words) can be obtained using `get_feature_names_out()` on the `CountVectorizer` object. 

In [None]:
vocab = vec.get_feature_names_out()

In [None]:
vocab[0:10]  # first few words

In [None]:
vocab[2000:2010]  # some middle words

In [None]:
vocab[::500]  # words with a step of 500

In [None]:
y_train.value_counts()

### Model building on the dataset 

First let's try `DummyClassifier` on the dataset. 

In [None]:
dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

We have a balanced dataset. So the `DummyClassifier` score is around 0.5. 

Now let's try logistic regression. 

In [None]:
pipe_lr = make_pipeline(
    CountVectorizer(stop_words="english"),
    LogisticRegression(max_iter=1000),
)
scores = cross_validate(pipe_lr, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

Seems like we are overfitting. Let's optimize the hyperparameter `C` of LR and `max_features` of `CountVectorizer`. 

In [None]:
scores_dict = {
    "C": 10.0 ** np.arange(-3, 3, 1),
    "mean_train_scores": list(),
    "mean_cv_scores": list(),
}
for C in scores_dict["C"]:
    pipe_lr = make_pipeline(CountVectorizer(max_features=10_000, stop_words="english"),
                        LogisticRegression(max_iter=1000, C=C)
                       )
    scores = cross_validate(pipe_lr, X_train, y_train, return_train_score=True)
    scores_dict["mean_train_scores"].append(scores["train_score"].mean())
    scores_dict["mean_cv_scores"].append(scores["test_score"].mean())

results_df = pd.DataFrame(scores_dict)
results_df

In [None]:
optimized_C = results_df["C"][results_df["mean_cv_scores"].idxmax()]
print(
    "The maximum validation score is %0.3f at C = %0.2f "
    % (
        np.max(results_df["mean_cv_scores"]),
        optimized_C,
    ))

In [None]:
pipe_lr = make_pipeline(CountVectorizer(max_features=10000, stop_words="english"),
                        LogisticRegression(max_iter=1000, C = 0.10)
                       )
pipe_lr.fit(X_train, y_train)

### Examining learned coefficients 

- The learned coefficients are exposed by the `coef_` attribute of [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) object. 

In [None]:
# Get feature names
feature_names = pipe_lr.named_steps['countvectorizer'].get_feature_names_out().tolist()

# Get coefficients 
coeffs = pipe_lr.named_steps["logisticregression"].coef_.flatten()

In [None]:
word_coeff_df = pd.DataFrame(coeffs, index=feature_names, columns=["Coefficient"])
word_coeff_df

- Let's sort the coefficients in descending order. 
- Interpretation
    - if $w_j > 0$ then increasing $x_{ij}$ moves us toward predicting $+1$. 
    - if $w_j < 0$ then increasing $x_{ij}$ moves us toward predicting $-1$. 


In [None]:
word_coeff_df.sort_values(by="Coefficient", ascending=False)

- The coefficients make sense!

Let's visualize the top 20 features.

In [None]:
mglearn.tools.visualize_coefficients(coeffs, feature_names, n_top_features=20)

Let's explore prediction of the following new review. 

In [None]:
fake_reviews = ["It got a bit boring at times but the direction was excellent and the acting was flawless. Overall I enjoyed the movie and I highly recommend it!",
 "The plot was shallower than a kiddie pool in a drought, but hey, at least we now know emojis should stick to texting and avoid the big screen."
]

Let's get prediction probability scores of the fake review. 

In [None]:
pipe_lr.predict(fake_reviews)

In [None]:
# Get prediction probabilities for fake reviews 
pipe_lr.predict_proba(fake_reviews)

In [None]:
pipe_lr.classes_

We can find which of the vocabulary words are present in this review:

In [None]:
def plot_coeff_example(model, review, coeffs, feature_names, n_top_feats=6):
    print(review)
    feat_vec = model.named_steps["countvectorizer"].transform([review])
    words_in_ex = feat_vec.toarray().ravel().astype(bool)

    ex_df = pd.DataFrame(
        data=coeffs[words_in_ex],
        index=np.array(feature_names)[words_in_ex],
        columns=["Coefficient"],
    )
    mglearn.tools.visualize_coefficients(
    coeffs[words_in_ex], np.array(feature_names)[words_in_ex], n_top_features=n_top_feats
    )
    return ex_df.sort_values(by=["Coefficient"], ascending=False)

In [None]:
plot_coeff_example(pipe_lr, fake_reviews[0], coeffs, feature_names)

In [None]:
plot_coeff_example(pipe_lr, fake_reviews[1], coeffs, feature_names)

<br><br><br><br>

### Most positive review 

- Remember that you can look at the probabilities (confidence) of the classifier's prediction using the `model.predict_proba` method.
- Can we find the reviews where our classifier is most certain or least certain?

In [None]:
# only get probabilities associated with pos class
pos_probs = pipe_lr.predict_proba(X_train)[
    :, 1
]  # only get probabilities associated with pos class
pos_probs

What's the index of the example where the classifier is most certain (highest `predict_proba` score for positive)?

In [None]:
most_positive_id = np.argmax(pos_probs)

In [None]:
print("True target: %s\n" % (y_train.iloc[most_positive_id]))
print("Predicted target: %s\n" % (pipe_lr.predict(X_train.iloc[[most_positive_id]])[0]))
print("Prediction probability: %0.4f" % (pos_probs[most_positive_id]))

Let's examine the features associated with the review. 

In [None]:
plot_coeff_example(pipe_lr, X_train.iloc[most_positive_id], coeffs, feature_names)

The review has both positive and negative words but the words with **positive** coefficients win in this case! 

### Most negative review 

In [None]:
neg_probs = pipe_lr.predict_proba(X_train)[
    :, 0
]  # only get probabilities associated with neg class
neg_probs

In [None]:
most_negative_id = np.argmax(neg_probs)

In [None]:
print("Review: %s\n" % (X_train.iloc[[most_negative_id]]))
print("True target: %s\n" % (y_train.iloc[most_negative_id]))
print("Predicted target: %s\n" % (pipe_lr.predict(X_train.iloc[[most_negative_id]])[0]))
print("Prediction probability: %0.4f" % (neg_probs[most_negative_id]))

In [None]:
plot_coeff_example(pipe_lr, X_train.iloc[most_negative_id], coeffs, feature_names)

The review has both positive and negative words but the words with negative coefficients win in this case! 

## Activity: Discuss the following questions in your group

- Is it possible to identify most important features using $k$-NNs? What about decision trees?  

<br><br><br><br>