# REFERENCE: Baseline Logistic Regression Model

> This notebook is a **reference** for building a simple, interpretable baseline model.
> In your own CS3 notebook, you should:
> - Load the data
> - Create train/test split
> - Fit a Logistic Regression model
> - Compute accuracy, macro-F1, and a confusion matrix.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

In [None]:
# Load data
df = pd.read_csv('processed/labeled_data_clean.csv')

label_map = {
    0: 'Hate speech',
    1: 'Offensive language',
    2: 'Neutral'
}
df['label_name'] = df['label'].map(label_map)

X = df['tweet_clean'] if 'tweet_clean' in df.columns else df['tweet']
y = df['label']

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# TF-IDF vectorizer
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.95
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
# Logistic Regression baseline
log_clf = LogisticRegression(max_iter=1000, n_jobs=-1)
log_clf.fit(X_train_tfidf, y_train)

In [None]:
# Predictions + metrics
y_pred = log_clf.predict(X_test_tfidf)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
cm = confusion_matrix(y_test, y_pred)

print('Accuracy:', acc)
print('Macro F1:', macro_f1)
print('\nClassification report:\n')
print(classification_report(y_test, y_pred, target_names=label_map.values()))

In [None]:
# Confusion matrix as DataFrame
import pandas as pd

cm_df = pd.DataFrame(
    cm,
    index=[f'True {label_map[i]}' for i in sorted(label_map.keys())],
    columns=[f'Pred {label_map[i]}' for i in sorted(label_map.keys())]
)
cm_df

### Interpretation Notes (Reference)

- Accuracy and macro-F1 tell you how well the model does across classes.
- The confusion matrix shows **where** the model gets confused.
- In your CS3 notebook, you should describe:
  - Which classes are hardest to predict?
  - Does the model confuse hate speech with offensive language? Neutral?