# Sentiment Analysis of Customer Reviews: A Small Dataset

The exercise concerns classification of mobile phone reviews into positive, negative and neutral. The data contains 2k reviews and is characterised by class imbalance: there are much more positive reviews than negative ones in the dataset. Your task is to train two Linear SVM classifiers: one on unbalanced data and one after applying a class-balancing technique. Decide on the technique to use and find out if it helps to improve classification accuracy on the test data.

Note that rather than Macroaveraged F score the project uses Macroaveraged RMSE ([Baccianella et al 2009](http://nmis.isti.cnr.it/sebastiani/Publications/ISDA09.pdf)) as an evaluation metric.

Propose a solution by writing code and commentary instead of "???" in the cells below.

The data can be obtained at this [page](https://jmcauley.ucsd.edu/data/amazon/), download the file [reviews_Cell_Phones_and_Accessories_5.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz) and put it into the same folder as the notebook.

To run the code, you may need to first install the imblearn package:

`pip install -U imbalanced-learn`

In [None]:
import logging
logging.basicConfig()
logging.getLogger("SKLEARNEX").setLevel(logging.ERROR)

import json
import gzip
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_predict

from IPython.display import display

In [None]:
def groupby_labels(y, yhat):
    """Based on https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function
    """
    m = np.stack([y, yhat]).T
    m = m[m[:, 0].argsort()]
    grouped_preds = np.split(m[:, 1], np.unique(m[:, 0], return_index=True)[1])[1:]
    labels = np.unique(m[:, 0])
    return labels, grouped_preds

def mae_macro(y, yhat):
    """Macroaveraged MAE
    """
    labels, preds = groupby_labels(y, yhat)
    mean_diff = np.array([np.abs(label - pred).mean() for label, pred in zip(labels, preds)]).mean()
    return mean_diff

def rmse_macro(y, yhat):
    """Macroaveraged RMSE
    """
    labels, preds = groupby_labels(y, yhat)
    mean_diff = np.array([np.power(label - pred, 2).mean() for label, pred in zip(labels, preds)]).mean()
    return np.sqrt(mean_diff)

def evaluate_model(model, ytest, Xtest):
    """Given a trained model and test data, generate predictions
    and print a report with evaluation results
    """
    yhat = model.predict(Xtest)
    print(classification_report(ytest, yhat, zero_division=0))
    rmse = rmse_macro(ytest, yhat)
    print(f"{'Macro RMSE':18} {rmse:.3}")
    mae = mae_macro(ytest, yhat)
    print(f"{'Macro MAE':18} {mae:.3}")

def print_cv_results(grid_search, col_width=100, max_rows=10):
    """Given a grid search object, print a table with the 
    cross-validation results
    """
    results = pd.DataFrame(grid_search.cv_results_
                             )[['params', 'mean_train_score', 'mean_test_score']]
    
    results["mean_train_score"] = -results["mean_train_score"]
    results["mean_test_score"] = -results["mean_test_score"]
    
    results["diff, %"] = 100*(results["mean_train_score"]-results["mean_test_score"]
                                                         )/results["mean_train_score"]

    pd.set_option('display.max_colwidth', col_width)
    pd.set_option('display.min_rows', max_rows)
    pd.set_option('display.max_rows', max_rows)
    display(results.sort_values('mean_test_score', ascending=True))

In [None]:
# create a scoring function
from sklearn.metrics import make_scorer

neg_rmse_macro = make_scorer(rmse_macro, greater_is_better=False)

# Load the data

Each review is provided with with a 5 star rating. Use the first 2k reviews.

In [None]:
texts = []
targets = []
max_lines = 2000
lines = 0

for line in gzip.open("reviews_Cell_Phones_and_Accessories_5.json.gz", 'r'):
    d = json.loads(line)
    score = int(d['overall'])
    text = d['reviewText']
    texts.append(text)
    targets.append(score)
    
    # read the first `max_lines` reviews
    lines += 1
    if lines >= max_lines:
        break

df = pd.DataFrame({"text": texts, "target": targets})

# Training-test split

In [None]:
from sklearn.model_selection import train_test_split

trainset, testset = train_test_split(df, test_size=0.1, stratify=df["target"], random_state=7)

# Data exploration and transformation

Check the distribution of the class label in the training data.

In [None]:
trainset["target"].value_counts().plot(kind="bar", rot=0)

In [None]:
Xtrain = trainset.drop("target", axis=1)
ytrain = trainset["target"].copy()

Xtest = testset.drop("target", axis=1)
ytest = testset["target"].copy()

## Construct document-by-word matrix

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(
    strip_accents="unicode", # convert accented chars to non-accented versions
    lowercase=True,
    tokenizer=None,        # None - use the default tokenizer
    preprocessor=None,     # None - use the default preprocessor
    stop_words="english",
    ngram_range=(1, 3),    # min and max range of ngrams
    analyzer="word",       # split the document into words, rather than e.g. characters
    max_df=1.0,              # ignore words with df greater than the value (int represents count, 
                           # float represents proportion of documents)
    min_df=3               # ignore words with df lower than the value (int represents count, 
                           # float represents proportion)
)

In [None]:
docs_train_counts = count_vectorizer.fit_transform(Xtrain['text'])
docs_train_counts.shape

In [None]:
docs_test_counts = count_vectorizer.transform(Xtest['text'])
docs_test_counts.shape

## TFIDF weighting of document vectors

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

docs_train_tfidf = tfidf_transformer.fit_transform(docs_train_counts)
docs_test_tfidf = tfidf_transformer.transform(docs_test_counts)

In [None]:
from sklearn.preprocessing import MaxAbsScaler 

scaler = MaxAbsScaler()

Xtrain = scaler.fit_transform(docs_train_tfidf)
Xtest = scaler.transform(docs_test_tfidf)

# Baseline

In [None]:
trainset["target"].value_counts()

In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(Xtrain, ytrain)
yhat_train = dummy_clf.predict(Xtrain)

evaluate_model(dummy_clf, ytrain, Xtrain)

# Training

## Unbalanced data

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

lsvm = LinearSVC(random_state=7, max_iter=10000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10] 
}

lsvc_grid_search = GridSearchCV(lsvm, param_grid, cv=10,
                           scoring=neg_rmse_macro,
                           return_train_score=True) 
lsvc_grid_search.fit(Xtrain, ytrain)

print_cv_results(lsvc_grid_search, col_width=100, max_rows=150)

<mark>Comment:</mark>: ???

In [None]:
# cross-validation confusion matrix, training data
yhat = cross_val_predict(lsvc_grid_search.best_estimator_, Xtrain, ytrain, cv=10)
ConfusionMatrixDisplay.from_predictions(ytrain, yhat, 
                                        labels=lsvc_grid_search.best_estimator_.classes_, 
                                        normalize="true",
                                        cmap=plt.cm.Blues);

<mark>Comment:</mark>: ???

## Class balancing

In [None]:
from imblearn.pipeline import Pipeline
from imblearn ???

pipeline = Pipeline([
        ???
        ('lsvc', LinearSVC(random_state=7, max_iter=10000))
    ])

param_grid = [
    {
        ???
        'lsvc__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10],
    },
]

cb_grid_search = GridSearchCV(pipeline, param_grid, cv=10, 
                              scoring=neg_rmse_macro,
                              return_train_score=True)

cb_grid_search.fit(Xtrain, ytrain)

print_cv_results(cb_grid_search, col_width=100)

<mark>Comment:</mark>: ???

In [None]:
# cross-validation confusion matrix on the training data
yhat = cross_val_predict(cb_grid_search.best_estimator_, Xtrain, ytrain, cv=10)

ConfusionMatrixDisplay.from_predictions(ytrain, yhat, 
                                        labels=cb_grid_search.best_estimator_.classes_, 
                                        normalize="true",
                                        cmap=plt.cm.Blues);

<mark>Comment:</mark>: ???

# Evaluate on test

In [None]:
# Unbalanced data
evaluate_model(lsvc_grid_search.best_estimator_, ytest, Xtest)

In [None]:
# Class balancing
evaluate_model(cb_grid_search.best_estimator_, ytest, Xtest)

<mark>Comment:</mark>: ???

# Citing this notebook

If you use this notebook in your work, please cite it as follows:
    
Pekar, V. (2023). Big Data for Decision Making. Lecture examples and exercises. (Version 1.0.0). URL: https://github.com/vpekar/bd4dm