<a href="https://colab.research.google.com/github/stheria4/sds510/blob/master/Module5Essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jeopardy Classifier (Naive Bayes, Logistic Regression, SVM)
**Name:** Sean Theriault
**Student ID:** stheria4
**Course:** SDS 510 – Python for Data Wrangling  
**Date:** 11/19/2025
**Project:** Module 5 - Essentials Badge

Adding more comments to the assignment as requested in my previous module feedback

This script imports the necessary libraries.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
import pandas as pd

Load Dataset from sharable google drive link

In [19]:
# Load Dataset
url = "https://drive.google.com/uc?export=download&id=1DCMBbMHAtGNnAZi2H8iupc429HSlDhgZ"
df = pd.read_json(url)

#  Data Setup

Here I converted the money values into regular numbers so I could label each question as high-value or low-value. Then I used the raw question text and turned it into simple bag-of-words features. After that, I split the data into training and testing sets for the classifiers.

In [20]:
# Same as Basics Module
def to_num(v):
    if isinstance(v, str):
        v = v.replace("$", "").replace(",", "")
        try:
            return int(v)
        except:
            return None
    return None

df["ValueNum"] = df["value"].apply(to_num)
df = df.dropna(subset=["ValueNum"])

# Make binary labels
df["Label"] = df["ValueNum"].apply(lambda x: 1 if x >= 1000 else 0)

# Use raw question text (cleaning kept causing issues earlier)
texts = df["question"].fillna("").astype(str)


# Vectorize using simple bag-of-words
vec = CountVectorizer()
X = vec.fit_transform(texts)
y = df["Label"]

# Train/test split (just using the typical 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

# Naive Bayes Approach (Baseline)

This is the first basic classifier I tried. It uses the question text with a bag-of-words model. Naive Bayes is simple and usually works okay for text, so I used it as my starting point.

In [21]:
model1 = MultinomialNB()
model1.fit(X_train, y_train)

pred1 = model1.predict(X_test)
# Check Accuracy
acc1 = accuracy_score(y_test, pred1)

print("Approach 1 (Naive Bayes) Accuracy:", acc1)

Approach 1 (Naive Bayes) Accuracy: 0.690084388185654


# Logistic Regression Approach

For the second method, I tried Logistic Regression. It uses the same vectorized text but a different type of classifier. I wanted to see if a more “linear” model would do better or worse than Naive Bayes.

In [22]:
# Trying Logistic Regression as a second method.
# Using max_iter=200 so the Logistic Regression model actually finishes training.
model2 = LogisticRegression(max_iter=200)
model2.fit(X_train, y_train)

# Seeing how well it predicts the labels.
pred2 = model2.predict(X_test)
# Check accuracy
acc2 = accuracy_score(y_test, pred2)

print("Approach 2 (Logistic Regression) Accuracy:", acc2)

Approach 2 (Logistic Regression) Accuracy: 0.6971167369901548


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Linear SVM (LinearSVC) Approach

For this approach, I tested a Linear Support Vector Machine. This is another common text-classification method. I didn’t change much else—same data, same vectorization—just swapped in the SVM model to compare results.

In [23]:
# I kept having the SVM freeze, so I looked this up on StackOverflow.
# People said to use dual=False for text data, so I added that to make it train faster.
model3 = LinearSVC(dual=False)
model3.fit(X_train, y_train)

pred3 = model3.predict(X_test)
# Check accuracy
acc3 = accuracy_score(y_test, pred3)

print("Approach 3 (BOW + SVM) Accuracy:", acc3)

Approach 3 (BOW + SVM) Accuracy: 0.673300515705579


# Write Output results to a txt file

In [24]:
# SAVE OUTPUT RESULTS
with open("classification_results.txt", "w") as f:
    f.write("Approach 1 (Naive Bayes) Accuracy: " + str(acc1) + "\n")
    f.write("Approach 2 (Logistic Regression) Accuracy: " + str(acc2) + "\n")
    f.write("Approach 3 (SVM) Accuracy: " + str(acc3) + "\n")

print("\nSaved classification_results.txt")


Saved classification_results.txt
