### 🧠 Hybrid Model: Full + Binarized Features
 
This notebook trains models using the full set of original features
from `train_processed.csv` plus the engineered binary flags.

The goal is to test whether combining raw numeric features with strong
binary signals leads to higher accuracy.


In [1]:
# Imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder

# Load processed dataset
df = pd.read_csv('../data/train_processed.csv')

In [2]:
# Add engineered binary flags
binary_flags = [
    ('num__Time_spent_Alone', lambda x: x > 4, 'Time_alone_heavy'),
    ('num__Going_outside', lambda x: x < 3, 'Going_outside_rare'),
    ('num__Post_frequency', lambda x: x < 3, 'num__Post_low'),
]

for col, cond, name in binary_flags:
    df[name] = cond(df[col]).astype(int)

# Prepare feature matrix
X = df.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df['Personality'])

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42)
}

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

Logistic Regression: 0.9688 ± 0.0020
Random Forest: 0.9677 ± 0.0026
XGBoost: 0.9676 ± 0.0022


### 🧪 Hybrid Model with Selected Binary Features

We added three highly asymmetric binary features to the full dataset:

- `Time_alone_heavy`: `Time_spent_Alone > 4`
- `Going_outside_rare`: `Going_outside < 3`
- `Post_low`: `Post_frequency < 3`

These features capture strong introvert signals.

### 🔍 Results (Full features + binary):

- Logistic Regression: 0.9688 ± 0.0020  
- Random Forest: 0.9677 ± 0.0026  
- XGBoost: 0.9676 ± 0.0022  

### 📌 Conclusion:

Adding these binary flags did not significantly improve performance.  
The model already captured similar patterns from the original numeric features.
