In [3]:
#Necessary imports
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from scipy.stats import norm
from sklearn.model_selection import GridSearchCV

#Reading datasets
train = pd.read_csv("train.csv")
dev = pd.read_csv("dev.csv")
test = pd.read_csv("test.csv")

# Remove rows where Label == "Apathetic" from train, dev, and test
train = train[train['Label'] != 'Apathetic']
dev = dev[dev['Label'] != 'Apathetic']
test = test[test['Label'] != 'Apathetic']

#Converts text labels into numbers (so the model can understand them)
label_encoder = LabelEncoder()
train['Encoded Label'] = label_encoder.fit_transform(train['Label'])

#Building model
# Step 1: Turn the text into numbers (TF-IDF) so model capture their meaning & importance
pipeline = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression(solver='lbfgs', max_iter= 400, C = 1.2, penalty='l2')
)

#Step 2: Train the Multi-Logistic Regression model on those numbers
pipeline.fit(train['Original Text'], train['Encoded Label'])

#Define the parameter combinations of C and max_iter to try
param_grid = {
    'logisticregression__C': [0.1, 0.5, 1.0, 1.2, 2.0],
    'logisticregression__max_iter': [100, 200, 400, 500]
}

#Use GridSearchCV with 5-fold cross-validation to find the best parameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(dev['Original Text'], label_encoder.transform(dev['Label']))

# Update model with the best parameters
pipeline.set_params(**grid_search.best_params_)

# Evaluate the model on the dev set
dev_predictions = pipeline.predict(dev['Original Text'])
# Convert the predicted numbers back into original label names
dev['Predicted Label'] = label_encoder.inverse_transform(dev_predictions)

# Calculate accuracy on the dev set
accuracy = (dev['Label'] == dev['Predicted Label']).mean()
print(f"Accuracy on dev set: {accuracy:.2f}")

#Merge train and dev to retrain final model so it can learn from all available data
full_train = pd.concat([train, dev])

# Retrain model on full data
pipeline.fit(full_train['Original Text'], label_encoder.transform(full_train['Label']))

# Training model on test set
test_predictions = pipeline.predict(test['Original Text'])
test['Predicted Label'] = label_encoder.inverse_transform(test_predictions)

# Calculate accuracy on the test set
accuracy_test = (test['Label'] == test['Predicted Label']).mean()
print(f"Accuracy on test set: {accuracy_test:.2f}")

# Calculate the standard error
n_test = len(test)
std_error = (accuracy_test * (1 - accuracy_test) / n_test) ** 0.5

# Calculate the 95% confidence interval
z_score = norm.ppf(0.975)  # 95% confidence
ci_lower = accuracy_test - z_score * std_error
ci_upper = accuracy_test + z_score * std_error


print(f"95% Confidence Interval for test set accuracy: ({ci_lower:.2f}, {ci_upper:.2f})")
#print(dev['Label'].value_counts())


Accuracy on dev set: 0.67
Accuracy on test set: 0.65
95% Confidence Interval for test set accuracy: (0.55, 0.74)


Analysis:

Our team created a baseline majority classifier, which always predicts the most frequent label. It achieved 44% accuracy on the dev set and 38% on the test set. This provides a helpful baseline for comparing subsequent models to ensure that the models we're building are doing better than chance.

Upon building a trained TF-IDF + Logistic Regression model, it achieved the following accuracies (after fine-tuning hyperparameters via cross-validation on the development set):

67% accuracy on the dev set
65% accuracy on the test set
A 95% confidence interval of (55%, 74%) on the test set

Before training the final model, we identified that one class ("Apathetic") had only a single data point. To avoid unstable cross-validation and unreliable model behavior, we removed this class from the dataset. This allowed for more robust training and improved the overall model performance.

To fine-tune hyperparameters, we performed 5-fold cross-validation on the dev set, testing multiple values for C and max_iter. This allowed for more reliable parameter selection and helped us avoid overfitting to any particular fold of the data.

These results reflect a substantial improvement of +23% on the dev set and +27% on the test set compared to the majority baseline. Notably, even the lower bound of the confidence interval exceeds the majority classifier’s test accuracy, indicating that the model generalizes better to unseen data and is learning patterns within the data it’s trained upon.

We can conclude from this that the model’s hyperparameter tuning — in particular, setting C = 1.2 and decreasing the number of iterations to 400 (since our train set has fewer than 10k data points) — improves model performance. Cross-validation was essential in selecting these values, as larger values than 1.2 for C decreased model accuracy, and changing the number of iterations had no meaningful effect.

The differences in performance between the baseline model and the one I trained illustrate that the model isn’t just guessing the most common answer — it’s actually learning from the text and making smarter, more accurate predictions based on what it reads.

We note that there is a small decrease in accuracy from the dev set (67%) to the test set (65%). Since model hyperparameters were tuned on the development set, it is expected that performance on the unseen test set may be slightly lower. This small drop suggests that the model has generalized well and is not overfitting to the development data.