# 🚀 Performance Optimization: Modeling Experiments

This notebook explores different strategies to improve the performance of our models
on the introvert vs extrovert classification task.

While our baseline Logistic Regression already achieves high accuracy (~96.9%),
we aim to squeeze out any additional gains using the following techniques:

---

## 🔧 Goals

- Reduce overfitting and noise
- Improve generalization on hidden test data
- Achieve accuracy closer to 0.9700+ or beyond

---

## 🧪 Optimization Steps

1. **Feature Selection**  
   Use `SelectFromModel` and optionally `RFECV` to keep only meaningful features.

2. **Model Stacking**  
   Combine predictions from multiple base models using a meta-learner.

3. **Hyperparameter Tuning**  
   Use `GridSearchCV` to find optimal settings for Logistic Regression and XGBoost.

4. **Optional: Sample Filtering**  
   Remove rows with many missing values or conflicting patterns to reduce label noise.

---

We will evaluate each step using 5-fold cross-validation (`StratifiedKFold`)
and compare the accuracy improvements step-by-step.


In [1]:
# 1️⃣ Feature Selection with SelectFromModel
# We'll use Logistic Regression as the base estimator to identify and keep only the most impactful features.

# Imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier


# Load data
df = pd.read_csv('../data/train_processed.csv')
X = df.drop(columns=['id', 'Personality'])
y = LabelEncoder().fit_transform(df['Personality'])

# Feature selection
selector = SelectFromModel(LogisticRegression(max_iter=1000, random_state=42), threshold='median')
selector.fit(X, y)
X_selected = selector.transform(X)

print(f"Selected {X_selected.shape[1]} of {X.shape[1]} features")

# Evaluate model with selected features
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=1000, random_state=42)
scores = cross_val_score(model, X_selected, y, cv=cv, scoring='accuracy')
print(f"Logistic Regression (selected features): {scores.mean():.4f} ± {scores.std():.4f}")

Selected 7 of 14 features
Logistic Regression (selected features): 0.9678 ± 0.0021


In [None]:
# 2️⃣ Model Stacking
# We will now combine Logistic Regression, Random Forest, and XGBoost using a meta-classifier (Logistic Regression).

# %% Stacking model
base_models = [
    ('lr', LogisticRegression(max_iter=1000)),
    ('rf', RandomForestClassifier(random_state=42)),
    ('xgb', XGBClassifier(eval_metric='logloss', random_state=42))
]

stacked = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=cv
)

# Evaluate stacked model
stack_scores = cross_val_score(stacked, X, y, cv=cv, scoring='accuracy')
print(f"Stacked Model: {stack_scores.mean():.4f} ± {stack_scores.std():.4f}")


Stacked Model: 0.9689 ± 0.0019


In [4]:
# 3️⃣ Hyperparameter Tuning (Logistic Regression)
# We'll use GridSearchCV to optimize the regularization strength `C` for Logistic Regression.

# GridSearch for Logistic Regression
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

grid_lr = GridSearchCV(
    estimator=LogisticRegression(max_iter=2000, random_state=42),
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

grid_lr.fit(X, y)
print(f"Best CV accuracy: {grid_lr.best_score_:.4f}")
print(f"Best parameters: {grid_lr.best_params_}")

Best CV accuracy: 0.9691
Best parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}


In [6]:
# Hyperparameter Tuning (XGBoost)
# We'll now tune key parameters for XGBoost to explore if better settings improve performance.

# %% GridSearch for XGBoost
from sklearn.model_selection import GridSearchCV

xgb_params = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(
    estimator=XGBClassifier(eval_metric='logloss', random_state=42),
    param_grid=xgb_params,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

grid_xgb.fit(X, y)

print("🔍 XGBoost GridSearch Results:")
print(f"Best CV accuracy: {grid_xgb.best_score_:.4f}")
print(f"Best parameters: {grid_xgb.best_params_}")


🔍 XGBoost GridSearch Results:
Best CV accuracy: 0.9691
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}


### 🧾 Final Result Summary

Despite multiple optimization attempts, no approach significantly outperformed the baseline model.

| Approach                     | Accuracy ± Std      |
|-----------------------------|---------------------|
| Baseline Logistic Regression | 0.9690 ± 0.0017     |
| Feature Selection (SFM)     | 0.9678 ± 0.0021     |
| Stacking                    | 0.9689 ± 0.0019     |
| Tuned Logistic Regression   | 0.9691 ± 0.0017     |
| Tuned XGBoost               | 0.9691 ± 0.0017     |

**Conclusion:** The baseline model already performs near-optimally.  
Further improvements may require smarter feature engineering and better noise handling.