# YouTube Spam Comment Classifier
---

## Project Information
- **Author:** Braden Tillema
- **Date:** December 9, 2025
- **School:** University of Colorado, Denver
- **Course:** CSCI 5930-H01 (Machine Learning)

## Project Objective
The objective of this project was to develop a Google Chrome browser extension prototype that detects spam comments using Machine Learning. If a comment or reply is classified as spam, it is highlighted red. I created a binary spam comment classifier using Naive Bayes and Logistic Regression using Stochastic Gradient Descent. Using Ensemble Learning, both classifiers vote to determine if a comment is spam. The votes are weighted by the accuracy of the model divided by the sum of the accuracy of each model. This method enables the most accurate model to have the more important vote.

In [1]:
from collections import defaultdict
from dotenv import load_dotenv
from os import getenv, path
from requests import get
import json
import math
import numpy as np
import pandas as pd
import time

## Task 1. Opening the Datasets

### Merging Multiple Datasets
This cell combines multiple datasets. Do **not** run this if you do not wish to reset any manual classifications!

In [None]:
# Check if this was already created
if path.exists('Datasets/Youtube-Spam-Collection.csv'):
    input("Warning: Continuing resets any manual classifications! Press enter to proceed...")

# Import Datasets
dataset_1 = pd.read_csv('Datasets/Youtube01-Psy.csv')
dataset_2 = pd.read_csv('Datasets/Youtube02-KatyPerry.csv')
dataset_3 = pd.read_csv('Datasets/Youtube03-LMFAO.csv')
dataset_4 = pd.read_csv('Datasets/Youtube04-Eminem.csv')
dataset_5 = pd.read_csv('Datasets/Youtube05-Shakira.csv')

# Combine Datasets
dataset = pd.concat([dataset_1, dataset_2, dataset_3, dataset_4, dataset_5], ignore_index=True)

# Check that there are the expected number of samples
assert len(dataset) == 1956, "Incorrect Number of Samples!"

# Save Combined Dataset to CSV
dataset.to_csv('Datasets/Youtube-Spam-Collection.csv', index=False)

### Opening an Existing Dataset
Run this if `Youtube-Spam-Collection.csv` already exists and you already have manual classifications!

In [2]:
# Import Combined Dataset
dataset = pd.read_csv('Datasets/Youtube-Spam-Collection.csv')
dataset.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


### Optional. Manually Classify Newer Comments

In [None]:
# Retrieve YouTube API Key
load_dotenv()
YOUTUBE_API_KEY = getenv("YOUTUBE_API_KEY")
assert YOUTUBE_API_KEY != None, "Missing API Key!"

# Format Request URL
base_url = "https://www.googleapis.com/youtube/v3/commentThreads"
part = "id,replies,snippet"
videoId = "1voqPZSqSE4"

# Construct Request URL
req_url = f'{base_url}?key={YOUTUBE_API_KEY}&part={part}&videoId={videoId}'

# Retrieve Comment Thread from YouTube API
response = get(req_url)
assert response.ok == True, f'Request returned {response.status_code}!'

In [None]:
# Convert Response to JSON
response_json = response.json()

# Get the Next Page Token
nextPageToken = response_json.get('nextPageToken')

In [None]:
# Create a list to store all of the comments
comments_data = response_json.get('items')

print('Number of Top Level Comments: 0')
while len(comments_data) < 30:

    # Wait 5 seconds before requesting again
    time.sleep(5)

    # Construct next Request URL using the last page token and get response
    response = get(f'{req_url}&pageToken={nextPageToken}')

    # If the response is not okay, exit loop early
    if response.ok == False: break

    # Convert Response to JSON
    response_json = response.json()
    
    # Get the Next Page Token
    nextPageToken = response_json.get('nextPageToken')

    comments_data.append(response_json.get('items'))
    print('Number of Top Level Comments:', len(comments_data))

In [None]:
comments = []

comment_text = []

for comment_data in comments_data:
    if(type(comment_data) is dict): comments.append(comment_data)
    else: comments.extend(comment_data)

getText = lambda comment: comment['snippet']['topLevelComment']['snippet']['textDisplay']

for comment in comments:
    comment_text.append(getText(comment))
    if 'replies' in comment.keys():
        for reply in comment.get('replies').values():
            comment_text.append(reply[0].get('textDisplay'))

### Manual Classifying

#### Inputs
- `0` Marks the comment as Non-Spam
- `1` Marks the comment as Spam
- `2` End manual classifying
- `3` Skip the comment

#### Personal Criteria
- If the comment is not in English, skip the comment
- If the comment can not be understood, skip the comment
- If the comment is advertising or asking for money, mark it as spam
- If the comment is related to the video, mark it as non-spam

In [None]:
for comment in comment_text:

    print(comment)
    is_spam = int(input("Non-Spam (0), Spam (1), Skip (2), End (3): "))

    if is_spam == 3:
        break

    if is_spam == 2:
        continue

    if is_spam == 0 or is_spam == 1:
        dataset.loc[len(dataset)] = [None, None, None, comment, is_spam]

# Save Updated Dataset to CSV
dataset.to_csv('Datasets/Youtube-Spam-Collection.csv', index = False)

## Task 2. Pre-Processing
### Determining the Training and Test Sets

In [3]:
num_samples = len(dataset)

indices = np.arange(num_samples)
np.random.shuffle(indices)

# Compute the number of training and test samples
training_samples = int(num_samples * 0.8)
test_samples = num_samples - training_samples

training_indices = indices[:training_samples]
test_indices = indices[training_samples:]

### Determine the Frequency of Each Word

In [4]:
# Frequency of each word, default to 0
word_frequency = defaultdict(int)

# Iterate through every comment in the training set
for row in dataset.iloc[training_indices].itertuples(index = False):

    # Retrieve the comment in lowercase
    comment = row.CONTENT.lower()

    # Split the comment by words and punctuation
    comment_tk = comment.split(' ')

    # For each token, increment frequency count
    for word in comment_tk:
        word_frequency[word] += 1

print('Total Words:', len(word_frequency))

Total Words: 5606


### Remove Unfrequent Words
I assume any unfrequent words are not global indication of spam.

In [5]:
# Create a list of words who appear more than 4 times
words = [word for word, freq in word_frequency.items() if freq > 4]

print('Remaining Words:', len(words))

Remaining Words: 640


### Reconstruct Dataset Using Word Count as Columns

In [6]:
# Dictionary to construct the DataFrame
comment_dataset = {word: [] for word in words}
comment_dataset['Spam'] = []

# Iterate through every comment
for row in dataset.itertuples(index = False):

    # Initialize the next index of the dictionary lists
    for col in comment_dataset:
        comment_dataset[col].append(0)

    # Set the Class Label
    comment_dataset['Spam'][-1] = row.CLASS

    # Retrieve the comment in lowercase
    comment = row.CONTENT.lower()

    # Split the comment by words and punctuation
    comment_tk = word_tokenize(comment)

    # Increment the count in the dataset
    for word in comment_tk:
        if word in comment_dataset.keys():
            comment_dataset[word][-1] += 1

# Create the DataFrame
comment_df = pd.DataFrame(comment_dataset)
comment_df.head()

Unnamed: 0,just,dance,3,hello,everyone,:),i,know,most,of,...,8,Unnamed: 13,6,loved,~,ur,phenomenallyricshere,dress,tube,Spam
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Correlation Coefficient

In [7]:
comment_np = comment_df.iloc[training_indices].to_numpy()

# Compute the correlation coefficient of each word
corr_coeff = {}

n = comment_np.shape[0]
y = comment_np[:, -1]
y_2 = y ** 2
for i in range(len(words)):
    x = comment_np[:, i]
    xy = x * y
    x_2 = x ** 2

    numer = n * np.sum(xy) - np.sum(x) * np.sum(y)
    denom = (n * np.sum(x_2) - np.sum(x) ** 2) * (n * np.sum(x_2) - np.sum(x) ** 2)

    if denom != 0:
        corr_coeff[words[i]] = abs(numer / np.sqrt(denom))
    else:
        corr_coeff[words[i]] = 0

# Sort words by correlation coefficient strength
words.sort(key = lambda x: corr_coeff[x], reverse = True)

# Onlu keep the top 500 words
for word in words[500:]:
    comment_df.drop(word, axis = 1, inplace = True)
words = words[:500]

### Create the Training and Test Sets

In [8]:
# Create the sets
train_set = comment_df.iloc[training_indices].to_numpy()
test_set = comment_df.iloc[test_indices].to_numpy()

X_train = train_set[:, :-1]
y_train = train_set[:, -1]

X_test = test_set[:, :-1]
y_test = test_set[:, -1]

## Task 3. Logistic Regression

In [9]:
def stochastic_gradient_descent_logistic_regression(X_train, y_train, alpha, epochs):
    "Stochastic Gradient Descent for Logistic Regression"

    # Sigmoid Function used for Logistic Regression
    h_beta = lambda X, betas: 1 / (1 + np.exp(-1 * X @ betas))

    n = X_train.shape[0] # Number of Samples
    m = X_train.shape[1] # Number of Features
    
    # Initialize Betas (Including Bias)
    betas = np.random.uniform(size = (m + 1))

    # Create Data Array with Bias Column
    data = np.column_stack((np.ones(n), X_train, y_train))

    for _ in range(epochs):

        # Shuffle the Data Array
        np.random.shuffle(data)

        for i in range(n):
            X_i = data[i, :-1]
            y_i = data[i, -1]

            gradient = X_i * (h_beta(X_i, betas) - y_i)
            betas = betas - alpha * gradient

    return betas

In [10]:
def logistic_regression_predict(X_test, betas):

    # Include the bias term column
    x0 = np.ones(X_test.shape[0])
    X = np.column_stack((x0, X_test))

    # Return the sigmoid function result
    return 1 / (1 + np.exp(-1 * (X @ betas)))

In [11]:
def accuracy(y_test, y_predict):
    return np.mean(y_test == y_predict)

## Task 4. Naive Bayes Classifier

In [12]:
def naive_bayes_training(X_train, y_train):

    # Get the different target labels
    targets = np.unique(y_train)

    # Create a dictionary to store the mean and variance for each target label
    statistics = {target: {"mean": 0, "var": 0} for target in targets}

    for target in targets:

        # Get the indices with the current target label
        indices = np.where(y_train == target)

        # Get the data for only those indices
        data = X_train[indices]

        # Compute the statistics for the current target label
        statistics[target]["mean"] = np.mean(data, axis = 0)
        statistics[target]["var"] = np.var(data, axis = 0) + 1e-9 # Prevent 0s

    return statistics

In [13]:
def probability(X, mean, var):
    """Normal Distribution Probability of a Sample"""
    exponent = -1 * (X - mean) ** 2 / (2 * var)
    denominator = np.sqrt(2 * np.pi * var)
    return np.exp(exponent) / denominator

In [14]:
def naive_bayes_predict(X_test, model):

    y_predict = []

    # Iterate through every sample
    for sample in X_test:

        highest_target = 0
        highest_prob = float('-inf')

        # Test for all target labels
        for target in model:
            mean = model[target]["mean"]
            var = model[target]["var"]
            prob = np.prod(probability(sample, mean, var))

            if prob > highest_prob:
                highest_target = target
                highest_prob = prob

        y_predict.append(highest_target)

    return np.array(y_predict)

## Task 5. Ensemble Learning
Combine the Naive Bayes and Logistic Regression Models.

In [15]:
def ensemble_learning(X_train, y_train, X_test, y_test):
    """Obtain Naive Bayes and Logistic Regression Models based on Training and Test Sets"""

    # ==========================================================================
    # Model 1. Naive Bayes
    # ==========================================================================

    # Obtain the mean and variance for each feature for each target class
    naive_bayes_model = naive_bayes_training(X_train, y_train)

    # Get the predictions for the test set
    y_predict = naive_bayes_predict(X_test, naive_bayes_model)

    # Compute the accuracy of the model
    naive_bayes_accuracy = accuracy(y_test, y_predict)

    # ==========================================================================
    # Model 2. Stochastic Gradient Descent Logistic Regression
    # ==========================================================================

    # Values for trial-and-error hyper-parameter search
    alphas = [0.001, 0.005, 0.01]
    epochs = [50, 100, 250, 500]

    # Store the weights that yielded the highest model accuracy
    logistic_regression_model = None
    logistic_regression_accuracy = float('-inf')

    # Test every combination of alphas and epochs
    for alpha in alphas:
        for epoch in epochs:

            # Use Stochastic Gradient Descent to obtain the betas
            betas = stochastic_gradient_descent_logistic_regression(X_train, y_train, alpha, epoch)

            # Get the predictions for the test set
            y_predict = np.round(logistic_regression_predict(X_test, betas))

            # Compute the accuracy of the current model
            curr_accuracy = accuracy(y_test, y_predict)

            # If the result performed better, save it
            if curr_accuracy > logistic_regression_accuracy:
                logistic_regression_model = betas
                logistic_regression_accuracy = curr_accuracy

    # ==========================================================================
    # Calculate Model Weights Based on Accuracy
    # ==========================================================================
    total_accuracy = naive_bayes_accuracy + logistic_regression_accuracy
    naive_bayes_weight = naive_bayes_accuracy / total_accuracy
    logistic_regression_weight = logistic_regression_accuracy / total_accuracy

    return naive_bayes_model, naive_bayes_weight, logistic_regression_model, logistic_regression_weight

In [16]:
def ensemble_predict(X_test, naive_bayes_model, naive_bayes_weight, logistic_regression_model, logistic_regression_weight):
    """Predict using the Naive Bayes and Logistic Regression Models"""

    # Get predictions from the naive bayes model
    y_predict_nb = naive_bayes_predict(X_test, naive_bayes_model)

    # Get predictions from the logistic regression model
    y_predict_lr = logistic_regression_predict(X_test, logistic_regression_model)

    # Weight predictions and return the final prediction (0: Non-Spam, 1: Spam)
    return np.round(naive_bayes_weight * y_predict_nb + logistic_regression_weight * y_predict_lr)

In [17]:
# Obtain the Naive Bayes and Logistic Regression Models
nb_model, nb_weight, lr_model, lr_weight = ensemble_learning(X_train, y_train, X_test, y_test)

# Print the vote weights of each model
print('Naive Bayes Weight:', nb_weight)
print('Logisitc Regression Weight:', lr_weight)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


Naive Bayes Weight: 0.43686502177068215
Logisitc Regression Weight: 0.5631349782293179


In [18]:
# Use the models to predict the test set
y_predict = ensemble_predict(X_test, nb_model, nb_weight, lr_model, lr_weight)

# Print out the accuracy on the test set
print(accuracy(y_test, y_predict))

0.8843373493975903


## Task 6. Save the Models
Save the models to `models.json` to be used for the browser extension.

In [19]:
# Fix Numpy Objects to Store in JSON
lr_model_fixed = [float(val) for val in lr_model]
nb_model_fixed = {}
for target in nb_model:
    nb_model_fixed[int(target)] = {'mean': None, 'var': None}
    nb_model_fixed[target]['mean'] = [float(val) for val in nb_model[target]['mean']]
    nb_model_fixed[target]['var'] = [float(val) for val in nb_model[target]['var']]

# Construct JSON Object to Save
save_obj = {
    'words': words,
    'lr_model': lr_model_fixed,
    'lr_weight': float(lr_weight),
    'nb_model': nb_model_fixed,
    'nb_weight': float(nb_weight)
}

with open("models.json", 'w') as file:
    json.dump(save_obj, file, indent = 4)
    print('Model saved to file models.json!')

Model saved to file models.json!
