## Logistic Regression Model

In this notebook, we will be creating and training a Logistic Regression model as a baseline comparison to our BERT model predictor. To compare the two models, we will be comparing their accuracies, f1-scores, and individual precisions scores per classification.

### Dataset Preparation

Here we import all of the libraries necessary

In [9]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from collections import Counter

To start, we will load the preprocessed dataset, vectorize the post text using TF-IDF, and split the dataset into the a training and test set

In [None]:
# Load the dataset
df = pd.read_csv("../Data/aita_posts_preprocessed.csv")

# Initialize a label encoder and apply one-hot encoding to the 'verdict' column
le = LabelEncoder()
df['verdict_encoded'] = le.fit_transform(df['verdict'])

# Split data
X = df['combined_text']
y = df['verdict_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training class distribution after SMOTE: Counter({3: 5424, 0: 5424, 1: 5424, 2: 5424})


Next, we need to vectorize each reddit post using TF-IDF. Once this is done, we will use SMOTE to counteract class imbalance within our data. Lastly, we will scale all features to make sure our model converges faster and prevent larger features from dominating in the training process.

In [None]:
# Vectorize text with TF-IDF
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 3), stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Apply SMOTE to balance classes
smote = SMOTE(random_state=42)
X_train_tfidf_smote, y_train_smote = smote.fit_resample(X_train_tfidf, y_train)
print("Training class distribution after SMOTE:", Counter(y_train_smote))

# Scale features
scaler = StandardScaler(with_mean=False)  # For sparse matrices
X_train_tfidf_smote_scaled = scaler.fit_transform(X_train_tfidf_smote)
X_test_tfidf_scaled = scaler.transform(X_test_tfidf)

After data preparation, we will tune the model and list the best parameters for training.

In [11]:
# Tune model with GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'solver': ['lbfgs', 'liblinear']}
grid = GridSearchCV(LogisticRegression(class_weight='balanced', max_iter=5000, tol=0.001),  # Increased max_iter, tol
                    param_grid, cv=5, scoring='f1_macro')
grid.fit(X_train_tfidf_smote_scaled, y_train_smote)
print("Best parameters:", grid.best_params_)
print("Best cross-validation F1-macro score:", grid.best_score_)

Best parameters: {'C': 100, 'solver': 'lbfgs'}
Best cross-validation F1-macro score: 0.9219754085549277


Now, we will begin training the model using the training set

In [12]:
# Train model with best parameters
model = grid.best_estimator_
model.fit(X_train_tfidf_smote, y_train_smote)

After training the model, we can use it on the testing data and compute test results.

In [None]:
# Evaluate model
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Prediction distribution:", Counter(y_pred))
print(classification_report(y_test, y_pred, target_names=le.classes_))

Accuracy: 0.544
Prediction distribution: Counter({np.int64(3): 1406, np.int64(0): 465, np.int64(2): 72, np.int64(1): 57})
                  precision    recall  f1-score   support

         asshole       0.26      0.27      0.26       447
  everyone sucks       0.09      0.05      0.06       102
no assholes here       0.07      0.05      0.05       111
 not the asshole       0.68      0.72      0.70      1340

        accuracy                           0.54      2000
       macro avg       0.27      0.27      0.27      2000
    weighted avg       0.52      0.54      0.53      2000

Top features for asshole: [(np.float64(11.898845336043651), 'edit'), (np.float64(9.015123299032348), 'generally'), (np.float64(8.979314587484486), 'wibta'), (np.float64(8.14491262413867), 'wouldn'), (np.float64(7.8954154023753365), 'gay'), (np.float64(7.831789179331353), 'fit'), (np.float64(7.754107082748809), 'slowly'), (np.float64(7.6861639602708465), 'hoping'), (np.float64(7.489347113492271), 'attend'), (