# Ensemble Dev Plan

Here are several models you can use with your current data to generate probability rankings for an ensemble. Each model supports probabilistic predictions and can be integrated into an ensemble:

1. Logistic Regression

Why Use It:
	•	Simple, interpretable, and efficient for binary and multiclass classification.

Implementation:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", max_iter=500, random_state=42)
log_reg.fit(X_train_scaled, y_train)
probs_log_reg = log_reg.predict_proba(X_test_scaled)  # Probabilities

2. XGBoost (already used)

Why Use It:
	•	High accuracy and feature importance.

Implementation:
(You already have this, but include it in your ensemble.)

3. Gradient Boosting Classifier (LightGBM)

Why Use It:
	•	Similar to XGBoost but faster and uses less memory.

Implementation:

import lightgbm as lgb

lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
lgb_model.fit(X_train_scaled, y_train)
probs_lgb = lgb_model.predict_proba(X_test_scaled)  # Probabilities

4. Support Vector Machines (SVM) with Probabilistic Outputs

Why Use It:
	•	Effective for high-dimensional spaces and non-linear classification.

Implementation:

from sklearn.svm import SVC

svc = SVC(probability=True, kernel="rbf", random_state=42)
svc.fit(X_train_scaled, y_train)
probs_svc = svc.predict_proba(X_test_scaled)  # Probabilities

5. k-Nearest Neighbors (k-NN)

Why Use It:
	•	Simple, interpretable, and non-parametric.

Implementation:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
probs_knn = knn.predict_proba(X_test_scaled)  # Probabilities

6. Multi-Layer Perceptron (MLP) Classifier

Why Use It:
	•	Neural network-based approach; good for non-linear relationships.

Implementation:

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300, random_state=42)
mlp.fit(X_train_scaled, y_train)
probs_mlp = mlp.predict_proba(X_test_scaled)  # Probabilities

7. CatBoost

Why Use It:
	•	Fast and works well with categorical data without much preprocessing.

Implementation:

from catboost import CatBoostClassifier

cat_model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1, random_state=42, verbose=0)
cat_model.fit(X_train_scaled, y_train)
probs_cat = cat_model.predict_proba(X_test_scaled)  # Probabilities

Ensemble Creation

Once you have probability outputs from all models:
	1.	Combine the probabilities using an averaging or weighted averaging strategy.
	2.	Optionally, use a meta-model (like Logistic Regression or XGBoost) to learn the optimal combination of the predictions.

Example: Averaging Probabilities

# Combine probabilities (equal weights)
combined_probs = (probs_log_reg + probs_lgb + probs_svc + probs_knn + probs_mlp + probs_cat) / 6

# Get final predictions
final_predictions = combined_probs.argmax(axis=1)  # For multiclass

Example: Stacking with Meta-Model

from sklearn.ensemble import StackingClassifier

stack = StackingClassifier(
    estimators=[
        ("log_reg", log_reg),
        ("lgb", lgb_model),
        ("svc", svc),
        ("knn", knn),
        ("mlp", mlp),
        ("cat", cat_model),
    ],
    final_estimator=LogisticRegression()
)

stack.fit(X_train_scaled, y_train)
stack_probs = stack.predict_proba(X_test_scaled)

Models Summary
	•	Logistic Regression: A simple baseline model.
	•	XGBoost: Powerful gradient boosting.
	•	LightGBM: Lightweight and faster alternative to XGBoost.
	•	SVM: Works well for complex decision boundaries.
	•	k-NN: Non-parametric and interpretable.
	•	MLP: Neural network-based model for non-linear relationships.
	•	CatBoost: Efficient handling of categorical data.

By combining these models into an ensemble, you can leverage their strengths to improve overall predictive performance.