### 🔍 Modeling with Binarized Features
This notebook trains models using engineered binary features that showed strong separation between Introverts and Extroverts.
The goal is to test whether these simple yet powerful features can improve or match previous performance.

In [1]:
# Imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder

# Load processed dataset with engineered binary features
df = pd.read_csv('../data/train_processed.csv')


In [2]:
# Numerical columns to scan for thresholds
num_cols = ['num__Time_spent_Alone', 'num__Going_outside', 'num__Post_frequency',
            'num__Friends_circle_size', 'num__Social_event_attendance']

thresholds = range(1, 9)


def test_threshold_condition(df, col, threshold):
    """Generate both < and > binary flags for a column and evaluate their class separation."""
    results = []
    for direction in ['lt', 'gt']:
        symbol = '<' if direction == 'lt' else '>'
        flag_name = f"{col}_{direction}_{threshold}"

        condition = df[col] < threshold if direction == 'lt' else df[col] > threshold
        df[flag_name] = condition.astype(int)

        group_means = df.groupby('Personality')[flag_name].mean()
        mean_intro = group_means['Introvert']
        mean_extro = group_means['Extrovert']

        if (mean_intro >= 0.55 and mean_extro <= 0.05) or (mean_extro >= 0.55 and mean_intro <= 0.05):
            results.append((flag_name, col, symbol, threshold, mean_intro, mean_extro))
    return results


def discover_strong_binary_flags(df, numeric_columns, thresholds):
    """
    Generate binary features based on thresholds and return those with strong asymmetry.
    """
    selected = []
    for col in numeric_columns:
        for threshold in thresholds:
            selected.extend(test_threshold_condition(df, col, threshold))
    return pd.DataFrame(selected, columns=[
        'feature', 'original_col', 'direction', 'threshold',
        'mean_introvert', 'mean_extrovert'
    ])

# Run discovery
filtered_df = discover_strong_binary_flags(df, num_cols, thresholds)

# Display results
filtered_df.head(20)


Unnamed: 0,feature,original_col,direction,threshold,mean_introvert,mean_extrovert
0,num__Time_spent_Alone_gt_4,num__Time_spent_Alone,>,4,0.771606,0.021899
1,num__Time_spent_Alone_gt_5,num__Time_spent_Alone,>,5,0.652435,0.017666
2,num__Going_outside_lt_3,num__Going_outside,<,3,0.729948,0.015768
3,num__Post_frequency_lt_3,num__Post_frequency,<,3,0.73285,0.017593
4,num__Post_frequency_gt_5,num__Post_frequency,>,5,0.041451,0.551062
5,num__Friends_circle_size_gt_8,num__Friends_circle_size,>,8,0.045596,0.573546
6,num__Social_event_attendance_lt_3,num__Social_event_attendance,<,3,0.601451,0.014162
7,num__Social_event_attendance_gt_5,num__Social_event_attendance,>,5,0.04228,0.604643


In [3]:
# Add binary features manually (same logic as in analysis)
binary_flags = [
    ('num__Time_spent_Alone', lambda x: x > 4, 'Time_alone_heavy'),
    ('num__Going_outside', lambda x: x < 3, 'Going_outside_rare'),
    ('num__Post_frequency', lambda x: x < 3, 'num__Post_low'),
]

for col, condition, new_name in binary_flags:
    df[new_name] = condition(df[col]).astype(int)


# Prepare data for modeling
X = df[[f[2] for f in binary_flags]]
y = LabelEncoder().fit_transform(df['Personality'])

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42)
}

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

Logistic Regression: 0.9473 ± 0.0033
Random Forest: 0.9613 ± 0.0027
XGBoost: 0.9613 ± 0.0027


### 📊 Final Evaluation Summary

We trained models using only the selected asymmetric binary features.
Compared to the original full-feature models, performance did not improve:
 
**Original (full feature set):**
- Logistic Regression: 0.9690 ± 0.0017
- Random Forest: 0.9678 ± 0.0020
- XGBoost: 0.9676 ± 0.0022
 
**With asymmetric binary features only:**
- Logistic Regression: 0.9473 ± 0.0033
- Random Forest: 0.9613 ± 0.0027
- XGBoost: 0.9613 ± 0.0027
 
These hand-crafted binary features are highly specific but too sparse to capture the full diversity of patterns in the data.
