### ðŸ§ª Feature Engineering & Data Cleaning

In this notebook, we focus on enhancing model performance by improving the quality and expressiveness of our data.
 
#### Goals:
- Create meaningful new features that better separate introverts and extroverts
- Identify and remove noisy or low-quality samples that may hurt model learning

---
 
#### Plan:

1. **Analyze and remove low-quality samples**  
    Drop rows with excessive missing values or inconsistent behavior.
 
2. **Generate interaction features and behavior ratios**  
    Create composite signals that combine social traits.

3. **Add new binary flags**  
   Incorporate asymmetric thresholds discovered earlier.

4. **Evaluate models after cleaning & engineering**

In [1]:
import pandas as pd

# Load processed dataset
df = pd.read_csv("../data/train_processed.csv")

# Select columns that follow the pattern: num__/bin__missingindicator_
missing_cols = [
    col for col in df.columns
    if ('missingindicator_' in col and (col.startswith('num__') or col.startswith('bin__')))
]

# Count how many missing indicators are 1 per row
missing_count = df[missing_cols].sum(axis=1)

# Choose threshold manually
threshold = 2  # change to 1, 2, 3, etc.

# Identify and drop rows exceeding the threshold
to_drop = missing_count >= threshold
print(f"Dropping {to_drop.sum()} rows out of {len(df)} ({to_drop.mean():.2%})")

# Cleaned DataFrame
df_cleaned = df[~to_drop].reset_index(drop=True)




Dropping 819 rows out of 18524 (4.42%)


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

def show_feature_importance(X, y, top_n=10):
    model = RandomForestClassifier(random_state=42)
    model.fit(X, y)
    imp = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    return imp.head(top_n)


# Define features and labels
X = df_cleaned.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df_cleaned['Personality'])

show_feature_importance(X, y)


Unnamed: 0,feature,importance
11,bin__Drained_after_socializing,0.218304
0,num__Time_spent_Alone,0.211252
10,bin__Stage_fear,0.154385
1,num__Social_event_attendance,0.130804
4,num__Post_frequency,0.108403
2,num__Going_outside,0.089741
3,num__Friends_circle_size,0.073878
13,bin__missingindicator_Drained_after_socializing,0.004062
12,bin__missingindicator_Stage_fear,0.003576
6,num__missingindicator_Social_event_attendance,0.001367


In [3]:
# Drop all missingindicator features from the cleaned dataframe
df_cleaned = df_cleaned.drop(columns=[
    col for col in df_cleaned.columns if 'missingindicator_' in col
])

# Define features and labels
X = df_cleaned.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df_cleaned['Personality'])

show_feature_importance(X, y)

Unnamed: 0,feature,importance
6,bin__Drained_after_socializing,0.242412
5,bin__Stage_fear,0.209012
0,num__Time_spent_Alone,0.186046
2,num__Going_outside,0.118128
4,num__Post_frequency,0.115666
1,num__Social_event_attendance,0.083114
3,num__Friends_circle_size,0.045622


In [4]:
# Combined introvert score based on binary behavior (modeling_binary.ipynb)
df_cleaned['Introvert_score'] = (
    (df_cleaned['num__Time_spent_Alone'] > 4).astype(int) +
    (df_cleaned['num__Post_frequency'] < 3).astype(int) +
    (df_cleaned['num__Going_outside'] < 3).astype(int) +
    (df_cleaned['num__Social_event_attendance'] < 3).astype(int) +
    (df_cleaned['num__Friends_circle_size'] < 8).astype(int)
)



In [5]:
df_cleaned.head(n=20)

Unnamed: 0,id,num__Time_spent_Alone,num__Social_event_attendance,num__Going_outside,num__Friends_circle_size,num__Post_frequency,bin__Stage_fear,bin__Drained_after_socializing,Personality,Introvert_score
0,0,0,6,4,15,5,0,0,Extrovert,0
1,1,1,7,3,10,8,0,0,Extrovert,0
2,2,6,1,0,3,0,1,0,Introvert,5
3,3,3,7,3,11,5,0,0,Extrovert,0
4,4,1,4,4,13,5,0,0,Extrovert,0
5,5,2,8,5,8,3,0,0,Extrovert,0
6,7,2,8,3,4,5,0,0,Extrovert,1
7,8,4,2,1,0,2,1,0,Introvert,4
8,9,1,8,6,14,9,0,0,Extrovert,0
9,10,3,7,4,5,10,0,0,Extrovert,1


In [6]:
X = df_cleaned.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df_cleaned['Personality'])

show_feature_importance(X, y, top_n=15)


Unnamed: 0,feature,importance
7,Introvert_score,0.220069
5,bin__Stage_fear,0.189653
0,num__Time_spent_Alone,0.164341
1,num__Social_event_attendance,0.13333
6,bin__Drained_after_socializing,0.107101
2,num__Going_outside,0.074332
4,num__Post_frequency,0.072495
3,num__Friends_circle_size,0.038679


In [7]:
# Keep only selected top features + target and id
selected_features = [
    'id',
    'Introvert_score',
    'bin__Stage_fear',
    'bin__Drained_after_socializing',
    'num__Time_spent_Alone',
    'num__Social_event_attendance',
    'Personality'
]

# Personality label-encoding
df_cleaned["Personality"] = df_cleaned["Personality"].map({"Extrovert": 0, "Introvert": 1})

df_reduced = df_cleaned[selected_features].copy()



In [8]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder

# Features and target
X = df_reduced.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df_reduced['Personality'])

# Models to test
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=42)
}

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate all models
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} Â± {scores.std():.4f}")


Logistic Regression: 0.9689 Â± 0.0028
Random Forest: 0.9679 Â± 0.0031
XGBoost: 0.9680 Â± 0.0034


In [9]:
# Create final versions of df_reduced and df_reduced_test datasets

selected_features = [
    'id',
    'Introvert_score',
    'bin__Stage_fear',
    'bin__Drained_after_socializing',
    'num__Time_spent_Alone',
    'num__Social_event_attendance',
]


df_test_cleaned = pd.read_csv("../data/test_processed.csv")
df_test_cleaned = df_test_cleaned.drop(columns=[
    col for col in df_test_cleaned.columns if 'missingindicator_' in col
])

df_test_cleaned['Introvert_score'] = (
    (df_test_cleaned['num__Time_spent_Alone'] > 4).astype(int) +
    (df_test_cleaned['num__Post_frequency'] < 3).astype(int) +
    (df_test_cleaned['num__Going_outside'] < 3).astype(int) +
    (df_test_cleaned['num__Social_event_attendance'] < 3).astype(int) +
    (df_test_cleaned['num__Friends_circle_size'] < 8).astype(int)
)

df_reduced_test = df_test_cleaned[selected_features].copy()

In [10]:
# Save fully prepared files
df_reduced.to_csv('../data/prepared/train_fully_prepared.csv', index=False)
df_reduced_test.to_csv('../data/prepared/test_fully_prepared.csv', index=False)

## âœ… Summary of Feature Reduction

After engineering multiple behavioral signals, we evaluated feature importances and selected a reduced set of six most impactful features:
 
- `Introvert_score`
- `bin__Stage_fear`
- `bin__Drained_after_socializing`
- `num__Time_spent_Alone`
- `num__Social_event_attendance`
- `id`, `Personality`
 
These features retained the predictive strength of the original dataset, while simplifying the model and improving interpretability.

**Accuracy remained essentially unchanged**:
- Logistic Regression: **0.9690 â†’ 0.9689**
- Random Forest: **0.9678 â†’ 0.9679**
- XGBoost: **0.9676 â†’ 0.9680**

We now explore additional models to attempt to push accuracy even higher.