## ✅ Model Choice: Logistic Regression

After extensive experimentation with various classifiers, including:

* Random Forest
* XGBoost
* CatBoost
* LightGBM
* Stacking & Voting Ensembles

We found that **Logistic Regression consistently delivers the best accuracy (≈0.9690)** while being:

* Simple and interpretable ✅
* Fast to train and cross-validate ✅
* Stable across folds and feature variations ✅

Even more complex models like XGBoost or CatBoost provided no measurable gain on the cleaned and engineered dataset.

**Conclusion:**
👉 Logistic Regression was selected as the final model due to its strong and stable performance, clarity, and efficiency.



In [1]:
import pandas as pd
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder

In [2]:
# Load processed dataset with engineered binary features
df_train = pd.read_csv('../data/prepared/train_fully_prepared.csv')

# Prepare features and target
X = df_train.drop(columns=["id", "Personality"])
y = LabelEncoder().fit_transform(df_train["Personality"])

# Define model
final_model = LogisticRegression(max_iter=1000)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(final_model, X, y, cv=cv, scoring="accuracy")
print(f"Logistic Regression CV: {scores.mean():.4f} ± {scores.std():.4f}")

# Retrain on full data
final_model.fit(X, y)

# Save model
joblib.dump(final_model, "model.pkl")


Logistic Regression CV: 0.9689 ± 0.0028


['model.pkl']

In [3]:
# Load processed test data
df_test = pd.read_csv("../data/prepared/test_fully_prepared.csv")
X_test = df_test.drop(columns=["id"])

# Load trained model
model = joblib.load("model.pkl")

# Predict
preds = model.predict(X_test)

# Convert back to labels (assuming same order as training)
label_encoder = LabelEncoder()
label_encoder.fit(["Introvert", "Extrovert"])  # must match training label order
pred_labels = label_encoder.inverse_transform(preds)

# Prepare submission
submission = pd.DataFrame({
    "id": df_test["id"],
    "Personality": pred_labels
})

# Save
submission.to_csv("submission.csv", index=False)
print("✅ submission.csv saved.")

✅ submission.csv saved.
