 ### 🤖 Baseline Model Comparison
 
 This notebook compares three machine learning models on the preprocessed dataset to identify the best-performing baseline:
 
 ### Models Evaluated:
 - **Logistic Regression**: Fast and interpretable linear model.
 - **Random Forest**: Ensemble of decision trees, robust to noise and non-linear patterns.
 - **XGBoost**: Gradient boosting model known for top Kaggle performance.
 
 ### Evaluation Strategy:
 - **Metric**: Accuracy
 - **Validation**: 5-fold Stratified Cross-Validation
 - **Output**: Mean ± standard deviation of accuracy for each model
 
 This step helps establish a strong baseline and informs which model is most promising for further tuning.
 
 This notebook evaluates three baseline models on the processed training data:
 - Logistic Regression
 - Random Forest
 - XGBoost
 
 Each model is evaluated using 5-fold cross-validation on accuracy.
 The goal is to determine which model performs best as a starting point.

In [1]:
# Imports
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder

# Load processed training data
df = pd.read_csv('../data/train_processed.csv')
X = df.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df['Personality'])

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42)
}

# Cross-validation evaluation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

Logistic Regression: 0.9690 ± 0.0017
Random Forest: 0.9678 ± 0.0020
XGBoost: 0.9676 ± 0.0022
