---
title: "Supervised Learning"
format:
    html: 
        code-fold: false
---

<br>
<br>

# Overview

In this section, I leverage several different supervised machine learning techniques to run predictions on three different target variables. As a refresher, supervised machine learning is a suite of algorithms that are trained labeled datasets to classify or predict outcomes for specific target variables[@IBMsupervised]. In this section, I use three subcategories of supervised learning methods, including regression, binary classification, and multi-class classification. For training, I take advantage of `scikit-learn`'s train_test_split object to break up my data into a training and testing set. After running each model, I will look at each model's respective error metrics to gauge their performance in prediction/classification tasks. 

Before beginning the modeling process, I carry out feature extraction on our text data. For this, I leverage the TF-IDF embedding method that has been used in multiple parts of this study. From there, I build out three custom modeling pipelines for each of our subcategories. The piplines will be modular, allowing us to control which feature are used in the modeling process. Each of the three pipelines will apply a set of specific methods for their given goal, compare each result and output the highest performing model's metrics.

At the end of this section I will present my findings, including a summary of each model's performance, and some visualizations to supplement said fidings.

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Importing Data and Packages

Let's begin our process by importing relevant packages

In [1]:
# Data loading and manipulation packages
import gzip
import pandas as pd
import numpy as np

# Data preprocessing and feature extraction packages
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob

# Model training packages
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Model evaluation packages
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    roc_curve,
    auc
)

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns


Now, we can load in our data:

In [2]:
# Pathway to raw data
data_path = "../../data/processed-data/reviews_short.csv.gz"

# Unzip the CSV file
with gzip.open(data_path, 'rb') as f:
    # Read the CSV file into a dataframe
    reviews = pd.read_csv(f)

reviews.head(1)

Unnamed: 0,reviewRating,vote,verified,reviewTime,reviewerID,productID,reviewerName,reviewText,summary,reviewTextClean,summaryClean,binary_target
0,5.0,2,False,2016-06-17,A7HY1CEDK0204,B00I9GYG8O,Jor El,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,youre looking cinema k capabilities budget cam...,filmmakers love camera,positive


In [3]:
#| echo: false
reviews['reviewTextClean'] = reviews['reviewTextClean'].astype(str)
reviews['reviewRating'] = reviews['reviewRating'].astype(int)

## Data Preprocessing

### Handling Zero-Vote rows

To start out this process, I shift my focus to the heavily-skewed 'vote' column, where a vast majority of rows have zero votes. In order to reconcile this, while making the `vote` column still usable in our models, I will convert it into a binary-encoded column called `vote_binary` where its values are $1$ if the value for `vote` is non-zero, and $0$ otherwise

In [4]:
# Checking % of zero-vote rows
print(f"Percent of reviews with 0 votes: {round((len(reviews.loc[reviews['vote'] == 0])/len(reviews)*100), 2)}%")

Percent of reviews with 0 votes: 86.02%


In [5]:
# Creating vote_binary column
reviews['vote_binary'] = (reviews['vote'] > 0).astype(int)
# Printing result
reviews[['vote', 'vote_binary']].head()

Unnamed: 0,vote,vote_binary
0,2,1
1,0,0
2,0,0
3,0,0
4,5,1


### Encoding `verified` Column

Next, let's do the same thing for `verified`, setting "True" to 1, and "False" to 0.

In [6]:
# Creating verified_binary column
reviews['verified_binary'] = (reviews['verified'] == True).astype(int)
# Printing result
reviews[['verified', 'verified_binary']].head()

Unnamed: 0,verified,verified_binary
0,False,0
1,False,0
2,True,1
3,True,1
4,True,1


### Encoding `productID`

Here, I use sklearn's [`LabelEncoder`](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html) object to convert the productID column into a more useful format for modeling. For this process, I will keep both the non-encoded and encoded version of `productID`, as the non-encoded version may be useful for tree-based algorithms like Random Forests and Gradient Boosting, while the encoded version may prove more useful in models that require a numerical input, like Logistic Regression and Support Vector Machines SVM.

In [7]:
# Initializing LabelEncoder object
encoder = LabelEncoder()
# Fitting to our productID column
reviews['productID_encoded'] = encoder.fit_transform(reviews['productID'])
# Checking result
reviews[['productID', 'productID_encoded']].head()

Unnamed: 0,productID,productID_encoded
0,B00I9GYG8O,27766
1,B01DB6BK5I,42622
2,B00011KM3I,1415
3,B000EDOSFQ,3153
4,B01CVOLKKQ,42382


### Encoding Binary Target

Here, I use binary encoding to create a new `binary_sentiment` column that adjusts values in the `binary_target` column. Specifically, I will change the value "positive" to 1, and "negative" to 0.

In [9]:
# Creating verified_binary column
reviews['binary_sentiment'] = (reviews['binary_target'] == "positive").astype(int)
# Printing result
reviews[['binary_target', 'binary_sentiment']].head()

Unnamed: 0,binary_target,binary_sentiment
0,positive,1
1,negative,0
2,positive,1
3,positive,1
4,negative,0


### Dropping Unnecessary Columns

With that out of the way, let's begin to drop unnecessary columns. For this section, we will drop:

- `binary_sentiment:` String-encoded sentiment column
- `vote:` Non-encoded vote column
- `productID:` Non-encoded product ID
- `verified:` Non-encoded values for varified column
- `reviewTime:` Time of review, since we are not conducting any time series analysis here
- `reviewerID:` ID of reviewer
- `reviewerName:` Name of reviewer

I choose to leave in the uncleaned versions of the review text and summary in case I want to use some all-in-one embedding tools down the line like `BERT`.


In [10]:
# Setting columns to drop 
cols_to_drop = ['binary_target', 'verified', 'vote','productID', 'reviewTime', 'reviewerID', 'reviewerName']
# Dropping
reviews = reviews.drop(columns=cols_to_drop)
# Printing result
reviews.head(1)

Unnamed: 0,reviewRating,reviewText,summary,reviewTextClean,summaryClean,vote_binary,verified_binary,productID_encoded,binary_sentiment
0,5,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,youre looking cinema k capabilities budget cam...,filmmakers love camera,1,0,27766,1


### Renaming and Reordering Columns for Cleanliness

In [13]:
# Renaming columns 
renamed_columns = {
    'reviewRating': 'rating',
    'reviewText': 'text',
    'summary': 'summary',
    'reviewTextClean': 'text_clean',
    'summaryClean': 'summary_clean',
    'binary_sentiment': 'sentiment',
    'vote_binary': 'vote',
    'verified_binary': 'verified',
    'productID_encoded': 'product_id'
}

reviews = reviews.rename(columns=renamed_columns)

# Rordering columns for clarity
column_order = [
    'product_id',
    'rating',
    'vote',
    'verified',
    'sentiment',
    'text_clean', 
    'summary_clean', 
    'text', 
    'summary', 
]

reviews = reviews[column_order]

# Printing result
reviews.head(1)

Unnamed: 0,product_id,rating,vote,verified,sentiment,text_clean,summary_clean,text,summary
0,27766,5,1,0,1,youre looking cinema k capabilities budget cam...,filmmakers love camera,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.


As a reminder for our boolean columns:

- `vote`: $(1 \ \text{for non-zero vote counts}, \ 0 \  \text{otherwise})$
- `verified`: $(1 \ \text{if account is verified}, \ 0 \  \text{otherwise})$
- `sentiment`: $(1 \ \text{if rating} \ \geq{4}, \ 0 \  \text{if rating} \ < 4)$

## Feature Extraction

### Polarity

Before moving on to more advanced feature extraction, I am going to quickly add back the `polarity` column that we worked with in the EDA section to serve as a target variable in our regression models. If you need a refresher on polarity, please head over to the [EDA](../eda/main.ipynb) section where I provide an overview.

In [15]:
# TextBlob allows us to feed raw text into it for polarity extraction, so I will do that here
reviews['polarity'] = reviews['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
# printing result
reviews[['text', 'rating', 'polarity']].head(1)

Unnamed: 0,text,rating,polarity
0,If you're looking for Cinema 4K capabilities o...,5,0.321675


### TF-IDF

Term frequency-inverse document freqeuncy (TF-IDF) is a process that measures the importance of a word (or pair of words in our case) by comparing its rate of appearance in a document to its rate of appearance in a whole collection of documents. In this approach, the weight of a given word or word-pair depends both on its frequency and rarity. The benefit of TF-IDF over a simple document term matrix is that TF-IDF will punish words/pairs that occur frequently across our corpus, and favor words that have importance in their respective documents, and are found less frequently across the corpus[@TFIDF]. Recall the formulaic representation of TF-IDF that I include on the home page:

In equation form[@TFIDF]: 
$$
\text{TF}(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
$$
$$
\text{IDF}(t,D) = log_{e}\frac{\text{Total number of documents D in corpus}}{\text{Number of documents containing term t}}
$$
$$
\text{TF-IDF} = TF(t,d) \cdot IDF(t,D)
$$


For its implementation here, I define the function `tfidf_embedding()` that takes in our pandas dataframe, applies TF-IDF to the cleaned review text column, and returns a dataframe that contains all of the tfidf features. The main function process uses the `TfidfVectorizer` from the sklearn library. In my implementation, I feed in the following parameters[@scikitTFIDF]:

- `max_features`: Threshold for the number of features to be ranked by their term frequency across the corpus.
- `ngram_range`: A tuple that controls the range of n-values for different n-grams to be extracted. In my case, I use (1,2) to include both unigrams and bigrams.

In [16]:
def tfidf_embedding(df, text_column, max_features=1000, ngram_range=(1,2)):
    """
    This function:
    1) Takes in a pandas dataframe and target text column
    2) Applies TF-IDF embedding using the set parameters
    3) Returns pandas dataframe with tf-idf features 
    """

    # Initialize TfidfVectorizer
    tfidf = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words='english') # Included stopwords param here to help clean up spots that I may have missed in text processing section

    # Fit tfidf to our target column
    tfidf_features = tfidf.fit_transform(df[text_column])

    # Create pandas dataframe for tfidf features
    tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=[f"tfidf_{i}" for i in range(tfidf_features.shape[1])])

    # Concatenate TF-IDF features df with our original df to maintain other features
    df_tfidf = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis = 1)

    return df_tfidf

In [17]:
## Applying this function to our reviews data set
reviews_tfidf = tfidf_embedding(reviews, text_column='text_clean', max_features=500, ngram_range=(1,2))
# Printing shape
reviews_tfidf.shape

(100000, 510)

In [18]:
reviews_tfidf.head(1)

Unnamed: 0,product_id,rating,vote,verified,sentiment,text_clean,summary_clean,text,summary,polarity,...,tfidf_490,tfidf_491,tfidf_492,tfidf_493,tfidf_494,tfidf_495,tfidf_496,tfidf_497,tfidf_498,tfidf_499
0,27766,5,1,0,1,youre looking cinema k capabilities budget cam...,filmmakers love camera,If you're looking for Cinema 4K capabilities o...,Filmmakers will love this camera.,0.321675,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.219799,0.0


## Model Selection

### Pipeline 1: Binary Classification

#### Logistic Regression

- **Overview**

Logistic regression is a type of supervised learning technique that is often used to predict a binary outcome (A or B). The model does this by taking in a set of independent variables for a given observation, and calculating the probability that they belong to a certain target class. Logistic regressions use a [sigmoid function](https://www.linkedin.com/pulse/understanding-sigmoid-function-logistic-regression-piduguralla/) that maps the linear combination of inputs to a final probability in the range [0, 1].[@sigmoid] If the value returned by the sigmoid function is $\geq 0.5$, then the model assigns the observation to the target class (in our case it assigns the value $1$ to reviews that it believes are positive)

**Sigmoid Function:**
<br>
![](../../xtra/multiclass-portfolio-website/images/sigmoid.png){width="500px"}
<br>
Source: [Daily Dose of Data Science](https://www.dailydoseofds.com/why-do-we-use-sigmoid-in-logistic-regression/)

- **Model Rationale**

When selecting which binary classification models I wanted to use for predicting `sentiment`, logistic regression stood out as an obvious baseline. The model's simplicity, and direct probabilistic output makes it a reliable and lightweight option for running predictions using TF-IDF embeddings. 

#### Random Forest 

- **Overview**

Random forests are a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) based approach to machine learning. However, instead of constructing a 

- **Model Rationale**







In [21]:
def binary_classification(df, target='sentiment', selected_features=None, include_tfidf=False, test_size = 0.20, random_state = 5000):

    # Step 1: select which features we want to use
        # 1(a) raise error if parameters are left blank
    if selected_features is None and not include_tfidf:
        raise ValueError("Please specify either `selected_features` or set `include_tfidf` to True")
    
        # 1(b) combine features
    features = []
    if selected_features:
        features += selected_features

    if include_tfidf:
        features += [col for col in df.columns if col.startswith("tfidf_")]

        # 1(c) Extract Features and Target variables
    X = df[features]
    y = df[target]

    # Step 2: Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Step 3: Initialize Models
    models = {
        "Logistic Regression": LogisticRegression(max_iter = 1000, random_state=random_state),
        "Random Forest": RandomForestClassifier(n_estimators=100, random_state=random_state)
    }

    # Step 4: Train and evaluate the models
    results = {}
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_prob = model.predict_proba(X_test)[:, 1]

        # Calculate error metrics
        metrics = {
            "Accuracy": accuracy_score(y_test, y_pred),
            "Precision": precision_score(y_test, y_pred),
            "Recall": recall_score(y_test, y_pred),
            "F1 Score": f1_score(y_test, y_pred),
            "ROC-AUC": roc_auc_score(y_test, y_pred)
        }

        # Add metrics to results dict
        results[model_name] = {
            "Model": model,
            "Metrics": metrics
        }

    # Step 5: Return the best model based on F1 score
    best_model = max(results, key=lambda k: results[k]["Metrics"]["F1 Score"])


    print(f"BEST MODEL: {best_model}")
    print("===========================\n")
    print("Performance Metrics:")
    for metric, value in results[best_model]["Metrics"].items():
        print(f"{metric}: {value:.4f}")

    return results[best_model]

In [22]:
# Using only TF-IDF values for classification
best_model_result = binary_classification(
    df=reviews_tfidf, 
    target="sentiment", 
    include_tfidf=True  # Include all TF-IDF columns
)

BEST MODEL: Logistic Regression

Performance Metrics:
Accuracy: 0.8564
Precision: 0.8730
Recall: 0.9621
F1 Score: 0.9154
ROC-AUC: 0.6881


TypeError: 'module' object is not callable

### Pipeline 2: Multiclass Classification