---
title: "Supervised Learning"
format:
    html: 
        code-fold: false
---

<br>
<br>

# Overview

In this section, I leverage several different supervised machine learning techniques to run predictions on three different target variables. As a refresher, supervised machine learning is a suite of algorithms that are trained labeled datasets to classify or predict outcomes for specific target variables[@IBMsupervised]. In this section, I use three subcategories of supervised learning methods, including regression, binary classification, and multi-class classification. For training, I take advantage of `scikit-learn`'s train_test_split object to break up my data into a training and testing set. After running each model, I will look at each model's respective error metrics to gauge their performance in prediction/classification tasks. 

Before beginning the modeling process, I carry out feature extraction on our text data. For this, I leverage the TF-IDF embedding method that has been used in multiple parts of this study. In additiont to TF-IDF, I will also experiment with Latent Dirichlet Allocation (LDA) as another form of embedding. From there, I will plug both results into the modeling process to see which yields more promising results.

At the end of this section I will present my findings, including a summary of each model's performance, and some visualizations to supplement said fidings.

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Importing Data and Packages

Let's begin our process by importing relevant packages

In [1]:
# Data loading and manipulation packages
import gzip
import pandas as pd
import numpy as np

# Data preprocessing and feature extraction packages
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from textblob import TextBlob

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import BertTokenizer, BertModel
import torch

# Model training packages
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Model evaluation packages
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    roc_curve,
    auc
)

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns


Now, we can load in our data:

In [2]:
# Pathway to raw data
data_path = "../../data/processed-data/reviews_short.csv.gz"

# Unzip the CSV file
with gzip.open(data_path, 'rb') as f:
    # Read the CSV file into a dataframe
    reviews = pd.read_csv(f)

reviews.head(1)

Unnamed: 0,reviewRating,vote,verified,reviewTime,reviewerID,productID,reviewerName,reviewText,summary,reviewTextClean,summaryClean,binary_target
0,5.0,2,False,2016-06-17,A7HY1CEDK0204,B00I9GYG8O,Jor El,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,youre looking cinema k capabilities budget cam...,filmmakers love camera,positive


In [3]:
#| echo: false
reviews['reviewTextClean'] = reviews['reviewTextClean'].astype(str)

## Data Preprocessing

### Handling Zero-Vote rows

To start out this process, I shift my focus to the heavily-skewed 'vote' column, where a vast majority of rows have zero votes. In order to reconcile this, while making the `vote` column still usable in our models, I will convert it into a binary-encoded column called `vote_binary` where its values are $1$ if the value for `vote` is non-zero, and $0$ otherwise

In [4]:
# Checking % of zero-vote rows
print(f"Percent of reviews with 0 votes: {round((len(reviews.loc[reviews['vote'] == 0])/len(reviews)*100), 2)}%")

Percent of reviews with 0 votes: 86.02%


In [5]:
# Creating vote_binary column
reviews['vote_binary'] = (reviews['vote'] > 0).astype(int)
# Printing result
reviews[['vote', 'vote_binary']].head()

Unnamed: 0,vote,vote_binary
0,2,1
1,0,0
2,0,0
3,0,0
4,5,1


### Encoding `verified` Column

Next, let's do the same thing for `verified`, setting "True" to 1, and "False" to 0.

In [6]:
# Creating verified_binary column
reviews['verified_binary'] = (reviews['verified'] == True).astype(int)
# Printing result
reviews[['verified', 'verified_binary']].head()

Unnamed: 0,verified,verified_binary
0,False,0
1,False,0
2,True,1
3,True,1
4,True,1


### Encoding `productID`

Here, I use sklearn's [`LabelEncoder`](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html) object to convert the productID column into a more useful format for modeling. For this process, I will keep both the non-encoded and encoded version of `productID`, as the non-encoded version may be useful for tree-based algorithms like Random Forests and Gradient Boosting, while the encoded version may prove more useful in models that require a numerical input, like Logistic Regression and Support Vector Machines SVM.

In [7]:
# Initializing LabelEncoder object
encoder = LabelEncoder()
# Fitting to our productID column
reviews['productID_encoded'] = encoder.fit_transform(reviews['productID'])
# Checking result
reviews[['productID', 'productID_encoded']].head()

Unnamed: 0,productID,productID_encoded
0,B00I9GYG8O,27766
1,B01DB6BK5I,42622
2,B00011KM3I,1415
3,B000EDOSFQ,3153
4,B01CVOLKKQ,42382


### Dropping Unnecessary Columns

With that out of the way, let's begin to drop unnecessary columns. For this section, we will drop:

- `vote:` Non-encoded vote column
- `productID:` Non-encoded product ID
- `verified:` Non-encoded values for varified column
- `reviewTime:` Time of review, since we are not conducting any time series analysis here
- `reviewerID:` ID of reviewer
- `reviewerName:` Name of reviewer

I choose to leave in the uncleaned versions of the review text and summary in case I want to use some all-in-one embedding tools down the line like `BERT`.


In [8]:
# Setting columns to drop 
cols_to_drop = ['verified', 'vote','productID', 'reviewTime', 'reviewerID', 'reviewerName']
# Dropping
reviews = reviews.drop(columns=cols_to_drop)
# Printing result
reviews.head(1)

Unnamed: 0,reviewRating,reviewText,summary,reviewTextClean,summaryClean,binary_target,vote_binary,verified_binary,productID_encoded
0,5.0,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,youre looking cinema k capabilities budget cam...,filmmakers love camera,positive,1,0,27766


### Renaming and Reordering Columns for Cleanliness

In [9]:
# Renaming columns 
renamed_columns = {
    'reviewRating': 'rating',
    'reviewText': 'text',
    'summary': 'summary',
    'reviewTextClean': 'text_clean',
    'summaryClean': 'summary_clean',
    'binary_target': 'binary_sentiment',
    'vote_binary': 'vote',
    'verified_binary': 'verified',
    'productID_encoded': 'product_id'
}

reviews = reviews.rename(columns=renamed_columns)

# Rordering columns for clarity
column_order = [
    'product_id',
    'rating',
    'vote',
    'text', 
    'summary', 
    'text_clean', 
    'summary_clean', 
    'binary_sentiment', 
    'verified', 
]

reviews = reviews[column_order]

# Printing result
reviews.head(5)

Unnamed: 0,product_id,rating,vote,text,summary,text_clean,summary_clean,binary_sentiment,verified
0,27766,5.0,1,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,youre looking cinema k capabilities budget cam...,filmmakers love camera,positive,0
1,42622,2.0,0,"<div id=""video-block-R14IHTRCCNUS1P"" class=""a-...",Web-cams from 2002 packed in a non-discrete bu...,div idvideoblockrihtrccnusp classasection aspa...,webcams packed nondiscrete buggy package,negative,0
2,1415,5.0,0,Great products and excellent services!,Five Stars,great products excellent services,five stars,positive,1
3,3153,5.0,0,"Priced rivaled any one sided dvd case, easy to...",Satisfied,priced rivaled one sided dvd case easy open dv...,satisfied,positive,1
4,42382,1.0,1,"Only made it a year, then burned up. Had to cy...",Melted,made year burned cycle almost every day last m...,melted,negative,1


## Feature Extraction

### Polarity

Before moving on to more advanced feature extraction, I am going to quickly add back the `polarity` column that we worked with in the EDA section to serve as a target variable in our regression models. If you need a refresher on polarity, please head over to the [EDA](../eda/main.ipynb) section where I provide an overview.

In [11]:
# TextBlob allows us to feed raw text into it for polarity extraction, so I will do that here
reviews['polarity'] = reviews['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
# printing result
reviews[['text', 'polarity']].head(1)

Unnamed: 0,text,polarity
0,If you're looking for Cinema 4K capabilities o...,0.321675




### TF-IDF

Term frequency-inverse document freqeuncy (TF-IDF) is a process that measures the importance of a word (or pair of words in our case) by comparing its rate of appearance in a document to its rate of appearance in a whole collection of documents. In this approach, the weight of a given word or word-pair depends both on its frequency and rarity. The benefit of TF-IDF over a simple document term matrix is that TF-IDF will punish words/pairs that occur frequently across our corpus, and favor words that have importance to their respective documents but are found less frequently across the corpus[@TFIDF]. Recall the formulaic representation of TF-IDF that I include on the home page:

In equation form[@TFIDF]: 
$$
\text{TF}(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
$$
$$
\text{IDF}(t,D) = log_{e}\frac{\text{Total number of documents D in corpus}}{\text{Number of documents containing term t}}
$$
$$
\text{TF-IDF} = TF(t,d) \cdot IDF(t,D)
$$


For its implementation here, I define the function `tfidf_embedding()` that takes in our pandas dataframe, applies TF-IDF to the cleaned review text column, and returns a dataframe that contains all of the tfidf features. The main function process uses the `TfidfVectorizer` from the sklearn library. In my implementation, I feed in the following parameters[@scikitTFIDF]:

- `max_features`: Threshold for the number of features to be ranked by their term frequency across the corpus.
- `ngram_range`: A tuple that controls the range of n-values for different n-grams to be extracted. In my case, I use (1,2) to include both unigrams and bigrams.

In [12]:
def tfidf_embedding(df, text_column, max_features=1000, ngram_range=(1,2)):
    """
    This function:
    1) Takes in a pandas dataframe and target text column
    2) Applies TF-IDF embedding using the set parameters
    3) Returns pandas dataframe with tf-idf embeddings 
    """

    # Initialize TfidfVectorizer
    tfidf = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words='english') # Included stopwords param here to help clean up spots that I may have missed in text processing section

    # Fit tfidf to our target column
    tfidf_features = tfidf.fit_transform(df[text_column])

    # Create pandas dataframe for tfidf features
    tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=[f"tfidf_{i}" for i in range(tfidf_features.shape[1])])

    # Concatenate TF-IDF features df with our original df to maintain other features
    df_tfidf = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis = 1)

    return df_tfidf

In [13]:
## Applying this function to our reviews data set
reviews_tfidf = tfidf_embedding(reviews, text_column='text_clean', max_features=500, ngram_range=(1,2))
# Printing shape
reviews_tfidf.shape

(100000, 509)

## Model Selection

### Regression 

### Binary Classification

### Multiclass Classification