## Model Training

**1. Model Selection and Training**
   - 1.1 Classification Models
      - Logistic Regression
      - Decision Trees
      - Random Forest
      - Support Vector Machines (SVM)
      - Gradient Boosting (e.g., XGBoost, LightGBM)
      - K-Nearest Neighbors (KNN)
      - Naive Bayes
      - Neural Networks

   - 1.2 Regression Models
      - Linear Regression
      - Polynomial Regression
      - Ridge Regression
      - Lasso Regression
      - Elastic Net
      - Decision Tree Regressor
      - Random Forest Regressor
      - Gradient Boosting Regressor
      - Support Vector Regression (SVR)

   - 1.3 Clustering Models
      - K-Means
      - Hierarchical Clustering
      - DBSCAN
      - Gaussian Mixture Models
      - Mean Shift
      - Spectral Clustering

   - 1.4 Cross-Validation Techniques
      - K-Fold Cross-Validation
      - Stratified K-Fold Cross-Validation
      - Leave-One-Out Cross-Validation
      - Time Series Cross-Validation

   - 1.5 Handling Class Imbalance
      - Oversampling (e.g., SMOTE)
      - Undersampling
      - Combination (SMOTEENN, SMOTETomek)
      - Class Weights
      - Ensemble Methods (e.g., BalancedRandomForestClassifier)

**2. Hyperparameter Tuning:**
   - 2.1 Grid Search
   - 2.2 Random Search
   - 2.3 Bayesian Optimization
   - 2.4 Genetic Algorithms
   - 2.5 Hyperband
   - 2.6 Optuna

**3. Model Evaluation:**
   - 3.1 Classification Metrics
      - Accuracy
      - Precision
      - Recall
      - F1-Score
      - ROC-AUC
      - PR-AUC

   - 3.2 Regression Metrics
      - Mean Squared Error (MSE)
      - Root Mean Squared Error (RMSE)
      - Mean Absolute Error (MAE)
      - R-squared (R2)
      - Adjusted R-squared

   - 3.3 Clustering Metrics
      - Silhouette Score
      - Calinski-Harabasz Index
      - Davies-Bouldin Index

   - 3.4 Reports and Visualizations
      - Confusion Matrix
      - Classification Report
      - ROC Curve
      - Precision-Recall Curve
      - Learning Curves
      - Validation Curves

**4. Model Interpretation:**
   - 4.1 Feature Importance Analysis
      - Random Forest Feature Importance
      - Permutation Importance
      - Recursive Feature Elimination (RFE)

   - 4.2 SHAP (SHapley Additive exPlanations)
      - SHAP Summary Plot
      - SHAP Dependence Plot
      - SHAP Force Plot
      - SHAP Interaction Values

   - 4.3 Partial Dependence Plots (PDP)
   - 4.4 Individual Conditional Expectation (ICE) Plots
   - 4.5 Global Surrogate Models
   - 4.6 Local Interpretable Model-agnostic Explanations (LIME)

**5. Ensemble Methods:**
   - 5.1 Bagging
      - Random Forest
      - Bagging Classifier/Regressor

   - 5.2 Boosting
      - AdaBoost
      - Gradient Boosting
      - XGBoost
      - LightGBM
      - CatBoost

   - 5.3 Stacking
      - StackingClassifier
      - StackingRegressor

   - 5.4 Voting
      - VotingClassifier
      - VotingRegressor

_____________________________________________________________________________________

**1.1 Classification**

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

classification_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Dictionary of classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(),  # Linear model for binary or multiclass classification based on the log-odds.
    'Decision Tree': DecisionTreeClassifier(),  # Non-linear model that splits data into branches to fit a piecewise constant function.
    'Random Forest': RandomForestClassifier(),  # Ensemble of decision trees that reduces overfitting by averaging multiple trees.
    'SVM': SVC(),  # Support Vector Classification that finds the optimal hyperplane separating different classes.
    'Gradient Boosting': GradientBoostingClassifier(),  # Ensemble technique that builds trees sequentially to correct errors of previous trees.
    'KNN': KNeighborsClassifier(),  # Instance-based learning method that classifies samples based on the majority class of their nearest neighbors.
    'Naive Bayes': GaussianNB(),  # Probabilistic classifier based on Bayes' theorem with an assumption of feature independence.
    'Neural Network': MLPClassifier()  # Multi-layer Perceptron that uses multiple layers of neurons to model complex patterns in data.
}

# You can easily switch classifiers like this:
# classification_pipeline.set_params(classifier=classifiers['Random Forest'])

**1.2 Regression**

In [2]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

regression_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Dictionary of regressors
regressors = {
    'Linear Regression': LinearRegression(),
    'Polynomial Regression': Pipeline([
        ('poly', PolynomialFeatures(degree=2)),  # Transforms features into polynomial features of a specified degree.
        ('linear', LinearRegression())  # Fits a linear model to the polynomial-transformed features.
    ]),
    'Ridge Regression': Ridge(),  # Linear model with L2 regularization to prevent overfitting by penalizing large coefficients.
    'Lasso Regression': Lasso(),  # Linear model with L1 regularization to enforce sparsity by penalizing the absolute values of coefficients.
    'Elastic Net': ElasticNet(),  # Combines L1 and L2 regularization to enforce sparsity and prevent overfitting.
    'Decision Tree Regressor': DecisionTreeRegressor(),  # Non-linear model that splits data into branches to fit a piecewise constant function.
    'Random Forest Regressor': RandomForestRegressor(),  # Ensemble of decision trees that reduces overfitting by averaging multiple trees.
    'Gradient Boosting Regressor': GradientBoostingRegressor(),  # Ensemble technique that builds trees sequentially to correct errors of previous trees.
    'SVR': SVR()  # Support Vector Regression fits the data within a margin while minimizing model complexity.
}

# You can easily switch regressors like this:
# regression_pipeline.set_params(regressor=regressors['Random Forest Regressor'])

**1.3 Clustering**

In [4]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, MeanShift, SpectralClustering
from sklearn.mixture import GaussianMixture

clustering_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clusterer', KMeans())  # Default clustering algorithm (K-Means).
])

# Dictionary of clustering algorithms
clusterers = {
    'K-Means': KMeans(),  # Partitions data into k clusters by minimizing the variance within each cluster.
    'Hierarchical Clustering': AgglomerativeClustering(),  # Builds a hierarchy of clusters by merging or splitting them successively.
    'DBSCAN': DBSCAN(),  # Density-Based Spatial Clustering of Applications with Noise; finds core samples of high density and expands clusters from them.
    'Gaussian Mixture Models': GaussianMixture(),  # Probabilistic model that assumes data is generated from a mixture of several Gaussian distributions.
    'Mean Shift': MeanShift(),  # Clustering algorithm that assigns data points to the nearest cluster center with the highest density.
    'Spectral Clustering': SpectralClustering()  # Uses the eigenvalues of a similarity matrix to perform dimensionality reduction before clustering.
}

# You can easily switch clusterers like this:
# clustering_pipeline.set_params(clusterer=clusterers['DBSCAN'])

**1.4 Cross Validation**

In [6]:
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.model_selection import StratifiedKFold

# Use it to get an estimate of the model's accuracy and its variance.
def perform_cross_validation(pipeline, X, y, cv=5, scoring='accuracy'):
    scores = cross_val_score(pipeline, X, y, cv=cv, scoring=scoring)
    return scores.mean(), scores.std()

# Use it to optimize model parameters for better performance.
def perform_grid_search(pipeline, param_grid, X, y, cv=5, scoring='accuracy'):
    grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring=scoring)
    grid_search.fit(X, y)
    return grid_search.best_params_, grid_search.best_score_

# Use it as a more efficient alternative to grid search when the parameter space is large.
def perform_randomized_search(pipeline, param_distributions, X, y, n_iter=10, cv=5, scoring='accuracy'):
    random_search = RandomizedSearchCV(pipeline, param_distributions, n_iter=n_iter, cv=cv, scoring=scoring)
    random_search.fit(X, y)
    return random_search.best_params_, random_search.best_score_

# Use it to ensure each fold has the same proportion of class labels, which is especially useful for imbalanced datasets.
def perform_stratified_k_fold_cross_validation(pipeline, X, y, n_splits=5, scoring='accuracy'):
    skf = StratifiedKFold(n_splits=n_splits)
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring=scoring)
    return scores.mean(), scores.std()

# Use it to get detailed performance metrics such as precision, recall, and F1-score.
def evaluate_classifier(pipeline, X_test, y_test):
    y_pred = pipeline.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

# Use it to get regression performance metrics such as Mean Squared Error (MSE) and R-squared.
def evaluate_regressor(pipeline, X_test, y_test):
    y_pred = pipeline.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared Score: {r2}")

# Example usage:
# cv_score, cv_std = perform_cross_validation(classification_pipeline, X, y)
# best_params, best_score = perform_grid_search(classification_pipeline, param_grid, X, y)
# best_params, best_score = perform_randomized_search(classification_pipeline, param_distributions, X, y)
# cv_score, cv_std = perform_stratified_k_fold_cross_validation(classification_pipeline, X, y)
# evaluate_classifier(classification_pipeline, X_test, y_test)
# evaluate_regressor(regression_pipeline, X_test, y_test)

**1.5 Class Imbalance**

In [7]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.ensemble import BalancedRandomForestClassifier

# You can add these to your pipeline like this:
from imblearn.pipeline import Pipeline as ImbPipeline

imbalanced_pipeline = ImbPipeline([
    ('sampler', SMOTE()),  # Synthetic Minority Over-sampling Technique: Generates synthetic samples for the minority class.
    ('scaler', StandardScaler()),  # Standardizes features by removing the mean and scaling to unit variance.
    ('classifier', LogisticRegression())
])

# Other sampling techniques
samplers = {
    'SMOTE': SMOTE(),  # Generates synthetic samples for the minority class.
    'Random Undersampling': RandomUnderSampler(),  # Randomly removes samples from the majority class.
    'SMOTEENN': SMOTEENN(),  # Combines SMOTE and Edited Nearest Neighbors for resampling.
    'SMOTETomek': SMOTETomek()  # Combines SMOTE and Tomek links for resampling.
}

# Balanced Random Forest
balanced_rf = BalancedRandomForestClassifier()  # Random Forest with balanced class sampling.

**2. Hyperparametertuning**

In [1]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.base import BaseEstimator
import numpy as np 
from skopt import BayesSearchCV # type: ignore
from deap import base, creator, tools, algorithms
import optuna # type: ignore
from kerastuner import Hyperband, HyperParameters # type: ignore

# 2.1 Grid Search
# param_grid: Dictionary with parameters names as keys and lists of parameter settings to try.
def grid_search(estimator, param_grid, X, y, cv=5):
    grid_search = GridSearchCV(estimator, param_grid, cv=cv, n_jobs=-1)
    grid_search.fit(X, y)
    return grid_search.best_params_, grid_search.best_score_

# 2.2 Random Search
# Randomly samples from a range of hyperparameter values to find the best combination, often faster than grid search.
def random_search(estimator, param_distributions, X, y, cv=5, n_iter=10):
    random_search = RandomizedSearchCV(estimator, param_distributions, n_iter=n_iter, cv=cv, n_jobs=-1)
    random_search.fit(X, y)
    return random_search.best_params_, random_search.best_score_

# 2.3 Bayesian Optimization
# Probabilistic models to find the best hyperparameters by iteratively exploring promising regions of the hyperparameter space.
def bayesian_optimization(estimator, search_spaces, X, y, cv=5, n_iter=50):
    bayes_search = BayesSearchCV(estimator, search_spaces, n_iter=n_iter, cv=cv, n_jobs=-1)
    bayes_search.fit(X, y)
    return bayes_search.best_params_, bayes_search.best_score_

# 2.4 Genetic Algorithm
# Optimizes hyperparameters using techniques as mutation and crossover to evolve better parameter sets over generations.
def genetic_algorithm(estimator, param_grid, X, y, cv=5, population_size=50, generations=10):
    creator.create("FitnessMax", base.Fitness, weights=(1.0,))
    creator.create("Individual", list, fitness=creator.FitnessMax)

    toolbox = base.Toolbox()
    for key, values in param_grid.items():
        toolbox.register(f"attr_{key}", np.random.choice, values)
    
    toolbox.register("individual", tools.initCycle, creator.Individual,
                     [getattr(toolbox, f"attr_{key}") for key in param_grid.keys()], n=1)
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)

    def evaluate(individual):
        params = dict(zip(param_grid.keys(), individual))
        estimator.set_params(**params)
        return np.mean(cross_val_score(estimator, X, y, cv=cv)),

    toolbox.register("mate", tools.cxTwoPoint)
    toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
    toolbox.register("select", tools.selTournament, tournsize=3)
    toolbox.register("evaluate", evaluate)

    population = toolbox.population(n=population_size)
    result, _ = algorithms.eaSimple(population, toolbox, cxpb=0.5, mutpb=0.2, ngen=generations, verbose=False)
    
    best_individual = tools.selBest(result, k=1)[0]
    best_params = dict(zip(param_grid.keys(), best_individual))
    best_score = evaluate(best_individual)[0]
    
    return best_params, best_score

# 2.5 Hyperband
# Adaptive resource allocation to quickly find good hyperparameters by evaluating many configurations with limited resources.
def hyperband_tuning(build_model, hp, X, y, max_epochs=50, factor=3, hyperband_iterations=1):
    tuner = Hyperband(
        build_model,
        objective='val_accuracy',
        max_epochs=max_epochs,
        factor=factor,
        hyperparameters=hp,
        directory='hyperband_dir',
        project_name='hyperband_tuning'
    )
    tuner.search(X, y, epochs=max_epochs, validation_split=0.2)
    best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
    return best_hps.values, tuner.get_best_models()[0]

# 2.6 Optuna
# Employs tree-based algorithms to efficiently explore hyperparameter spaces and find the optimal configuration through a series of trials.
def optuna_tuning(objective, n_trials=100):
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)
    return study.best_params, study.best_value

# Example usage:
# best_params, best_score = grid_search(estimator, param_grid, X, y)
# best_params, best_score = random_search(estimator, param_distributions, X, y)
# best_params, best_score = bayesian_optimization(estimator, search_spaces, X, y)
# best_params, best_score = genetic_algorithm(estimator, param_grid, X, y)

# For Hyperband (requires TensorFlow and Keras):
# def build_model(hp):
#     model = keras.Sequential()
#     model.add(keras.layers.Dense(units=hp.Int('units', min_value=32, max_value=512, step=32),
#                                  activation='relu'))
#     model.add(keras.layers.Dense(10, activation='softmax'))
#     model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
#     return model
# hp = HyperParameters()
# best_hps, best_model = hyperband_tuning(build_model, hp, X, y)

# For Optuna:
# def objective(trial):
#     params = {
#         'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
#         'max_depth': trial.suggest_int('max_depth', 1, 10),
#         'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
#         'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
#     }
#     model = RandomForestClassifier(**params)
#     return np.mean(cross_val_score(model, X, y, cv=5))
# best_params, best_score = optuna_tuning(objective)

  from .autonotebook import tqdm as notebook_tqdm
  from kerastuner import Hyperband, HyperParameters


**3. Model Evaluation**

In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, average_precision_score,
                             mean_squared_error, mean_absolute_error, r2_score,
                             silhouette_score, calinski_harabasz_score, davies_bouldin_score,
                             confusion_matrix, classification_report, roc_curve, precision_recall_curve)
from sklearn.model_selection import learning_curve, validation_curve
import matplotlib.pyplot as plt

# 3.1 Classification Metrics
def classification_metrics(y_true, y_pred, y_prob=None):
    # Computes classification metrics: Accuracy, Precision, Recall, and F1-Score
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),  # Ratio of correctly predicted instances
        'Precision': precision_score(y_true, y_pred, average='weighted'),  # Ratio of true positives to total predicted positives
        'Recall': recall_score(y_true, y_pred, average='weighted'),  # Ratio of true positives to total actual positives
        'F1-Score': f1_score(y_true, y_pred, average='weighted')  # Harmonic mean of Precision and Recall
    }
    
    if y_prob is not None:
        metrics['ROC-AUC'] = roc_auc_score(y_true, y_prob, average='weighted', multi_class='ovr')  # Area under the ROC curve
        metrics['PR-AUC'] = average_precision_score(y_true, y_prob, average='weighted')  # Area under the Precision-Recall curve
    
    return metrics

# 3.2 Regression Metrics
def regression_metrics(y_true, y_pred):
    # Computes regression metrics: MSE, RMSE, MAE, and R-squared
    mse = mean_squared_error(y_true, y_pred)
    return {
        'MSE': mse,  # Mean Squared Error
        'RMSE': np.sqrt(mse),  # Root Mean Squared Error
        'MAE': mean_absolute_error(y_true, y_pred),  # Mean Absolute Error
        'R-squared': r2_score(y_true, y_pred),  # Proportion of variance explained by the model
        'Adjusted R-squared': 1 - (1-r2_score(y_true, y_pred))*(len(y_true)-1)/(len(y_true)-len(y_pred.shape)-1)  # Adjusted for number of predictors
    }

# 3.3 Clustering Metrics
def clustering_metrics(X, labels):
    # Evaluates clustering results: Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index
    return {
        'Silhouette Score': silhouette_score(X, labels),  # Measures how similar an instance is to its own cluster vs. other clusters
        'Calinski-Harabasz Index': calinski_harabasz_score(X, labels),  # Ratio of the sum of between-cluster dispersion to within-cluster dispersion
        'Davies-Bouldin Index': davies_bouldin_score(X, labels)  # Average similarity ratio of each cluster with its most similar cluster
    }

# 3.4 Reports and Visualizations
def plot_confusion_matrix(y_true, y_pred):
    # Plots a confusion matrix to visualize classification performance
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(10,7))
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title('Confusion Matrix')
    plt.colorbar()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()
    
def plot_classification_report(y_true, y_pred):
    # Generates and visualizes a classification report as a heatmap
    report = classification_report(y_true, y_pred, output_dict=True)
    df_report = pd.DataFrame(report).transpose()
    plt.figure(figsize=(10,7))
    plt.heatmap(df_report.iloc[:-1, :-1], annot=True, cmap="Blues", fmt='.2f')
    plt.title('Classification Report')
    plt.ylabel('Classes')
    plt.xlabel('Metrics')
    plt.tight_layout()
    plt.show()

def plot_roc_curve(y_true, y_prob):
    # Plots the ROC curve for binary classification
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    plt.figure()
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.show()

def plot_pr_curve(y_true, y_prob):
    # Plots the Precision-Recall curve
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    plt.figure()
    plt.plot(recall, precision)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.show()

def plot_learning_curve(estimator, X, y, cv=5):
    # Displays the learning curve showing training and cross-validation scores
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5))
    
    plt.figure()
    plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training score')
    plt.plot(train_sizes, np.mean(test_scores, axis=1), label='Cross-validation score')
    plt.xlabel('Training examples')
    plt.ylabel('Score')
    plt.title('Learning Curve')
    plt.legend(loc="best")
    plt.show()

def plot_validation_curve(estimator, X, y, param_name, param_range, cv=5):
    # Plots the validation curve to visualize the effect of different parameter values
    train_scores, test_scores = validation_curve(
        estimator, X, y, param_name=param_name, param_range=param_range, cv=cv, scoring="accuracy", n_jobs=-1)
    
    plt.figure()
    plt.plot(param_range, np.mean(train_scores, axis=1), label="Training score")
    plt.plot(param_range, np.mean(test_scores, axis=1), label="Cross-validation score")
    plt.xlabel(param_name)
    plt.ylabel('Score')
    plt.title('Validation Curve')
    plt.legend(loc="best")
    plt.show()

# Example usage:
# class_metrics = classification_metrics(y_true, y_pred, y_prob)
# reg_metrics = regression_metrics(y_true, y_pred)
# clust_metrics = clustering_metrics(X, labels)
# plot_confusion_matrix(y_true, y_pred)
# plot_classification_report(y_true, y_pred)
# plot_roc_curve(y_true, y_prob)
# plot_pr_curve(y_true, y_prob)
# plot_learning_curve(estimator, X, y)
# plot_validation_curve(estimator, X, y, 'max_depth', range(1,10))

**4. Model Interpretation**

In [6]:
from sklearn.inspection import permutation_importance, PartialDependenceDisplay 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import shap
import lime
from lime.lime_tabular import LimeTabularExplainer
import matplotlib.pyplot as plt

# Feature Importance
def get_feature_importance(pipeline, X, y):
    # Extracts feature importances or coefficients from the model, or uses permutation importance
    model = pipeline.named_steps['classifier'] if 'classifier' in pipeline.named_steps else pipeline.named_steps['regressor']
    
    if hasattr(model, 'feature_importances_'):
        return model.feature_importances_  # Feature importances for tree-based models
    elif hasattr(model, 'coef_'):
        return model.coef_  # Coefficients for linear models
    else:
        return permutation_importance(pipeline, X, y).importances_mean  # Permutation importance for other models

# Recursive Feature Elimination (RFE)
def perform_rfe(pipeline, X, y, n_features_to_select=5):
    # Selects features by recursively removing the least important ones
    model = pipeline.named_steps['classifier'] if 'classifier' in pipeline.named_steps else pipeline.named_steps['regressor']
    rfe = RFE(estimator=model, n_features_to_select=n_features_to_select)
    rfe.fit(X, y)
    return rfe.support_, rfe.ranking_  # Support (selected features) and ranking of features

# SHAP Values
def plot_shap_values(pipeline, X):
    # Plots SHAP values for model interpretability
    explainer = shap.Explainer(pipeline.named_steps['classifier'], X) if 'classifier' in pipeline.named_steps else shap.Explainer(pipeline.named_steps['regressor'], X)
    shap_values = explainer(X)
    shap.summary_plot(shap_values, X)  # Summary plot of SHAP values
    shap.dependence_plot(0, shap_values, X)  # Dependence plot for the first feature
    shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])  # Force plot for the first instance
    shap_interaction_values = explainer.shap_interaction_values(X)
    shap.summary_plot(shap_interaction_values, X)  # Summary plot of SHAP interaction values

# Partial Dependence Plots (PDP)
def plot_pdp(pipeline, X, features):
    # Plots Partial Dependence Plots to show effect of features on predictions
    PartialDependenceDisplay(pipeline, X, features)
    plt.show()

# Individual Conditional Expectation (ICE) Plots
def plot_ice(pipeline, X, feature):
    # Plots Individual Conditional Expectation (ICE) plots for a specific feature
    ice_plot = PartialDependenceDisplay(pipeline, X, [feature], kind='both')
    plt.show()

# Global Surrogate Models
def fit_global_surrogate(pipeline, X, y):
    # Fits a global surrogate model (Decision Tree) to approximate the predictions of the pipeline
    model = pipeline.named_steps['classifier'] if 'classifier' in pipeline.named_steps else pipeline.named_steps['regressor']
    surrogate = DecisionTreeClassifier(max_depth=3) if 'classifier' in pipeline.named_steps else DecisionTreeRegressor(max_depth=3)
    surrogate.fit(X, model.predict(X))
    return surrogate

# LIME
def explain_with_lime(pipeline, X, y, idx=0):
    # Uses LIME to explain a single instance by approximating the model with an interpretable model
    explainer = LimeTabularExplainer(X.values, feature_names=X.columns, class_names=['class_0', 'class_1'], mode='classification') if 'classifier' in pipeline.named_steps else LimeTabularExplainer(X.values, feature_names=X.columns, mode='regression')
    exp = explainer.explain_instance(X.iloc[idx], pipeline.predict, num_features=5)
    exp.show_in_notebook()

# Example usage:
# importances = get_feature_importance(classification_pipeline, X_train_class, y_train_class)
# print("Feature Importances:", importances)

# rfe_support, rfe_ranking = perform_rfe(classification_pipeline, X_train_class, y_train_class)
# print("RFE Support:", rfe_support)
# print("RFE Ranking:", rfe_ranking)

# plot_shap_values(classification_pipeline, X_train_class)

# plot_pdp(classification_pipeline, X_train_class, [0, 1])  # Example for features at index 0 and 1
# plot_ice(classification_pipeline, X_train_class, 0)  # Example for feature at index 0

# surrogate_model = fit_global_surrogate(classification_pipeline, X_train_class, y_train_class)
# print("Global Surrogate Model:", surrogate_model)

# explain_with_lime(classification_pipeline, X_train_class, y_train_class, idx=0)  # Example for instance at index 0


  from .autonotebook import tqdm as notebook_tqdm


**5. Ensemble Methods**

In [4]:
from sklearn.ensemble import (BaggingClassifier, BaggingRegressor, 
                              AdaBoostClassifier, AdaBoostRegressor, 
                              StackingClassifier, StackingRegressor, 
                              VotingClassifier, VotingRegressor)
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR
from sklearn.ensemble import (RandomForestClassifier, RandomForestRegressor, 
                              GradientBoostingClassifier, GradientBoostingRegressor)

# Bagging: 
bagging_classifier = BaggingClassifier(estimator=DecisionTreeClassifier()) # Combines multiple decision trees to improve classification performance by averaging their predictions.
bagging_regressor = BaggingRegressor(estimator=DecisionTreeRegressor()) # Combines multiple decision trees to improve regression performance by averaging their predictions.

# Boosting: 
adaboost_classifier = AdaBoostClassifier() # Boosts weak classifiers (like decision trees) by focusing on the mistakes of previous models.
adaboost_regressor = AdaBoostRegressor() # Boosts weak regressors to improve predictions by focusing on errors of previous models.

xgboost_classifier = XGBClassifier() # Implements gradient boosting with a more efficient and scalable approach, often providing high performance.
xgboost_regressor = XGBRegressor() # Implements gradient boosting for regression with high efficiency and performance.

lightgbm_classifier = LGBMClassifier() # Uses gradient boosting with a focus on speed and performance, especially for large datasets.
lightgbm_regressor = LGBMRegressor() # Uses gradient boosting for regression, optimized for efficiency and scalability.


# Stacking: 
stacking_classifier = StackingClassifier( # Combines predictions from multiple classifiers (RF, SVM, GB) using a logistic regression model as the final predictor.
    estimators=[('rf', RandomForestClassifier()),
                ('svm', SVC()),
                ('gb', GradientBoostingClassifier())],
    final_estimator=LogisticRegression()
)


stacking_regressor = StackingRegressor( # Combines predictions from multiple regressors (RF, SVR, GB) using linear regression as the final predictor.
    estimators=[('rf', RandomForestRegressor()),
                ('svr', SVR()),
                ('gb', GradientBoostingRegressor())],
    final_estimator=LinearRegression()
)

# Voting
voting_classifier = VotingClassifier( # Combines multiple classifiers (LR, RF, SVM) by taking a majority vote to make the final prediction.
    estimators=[('lr', LogisticRegression()),
                ('rf', RandomForestClassifier()),
                ('svm', SVC())],
    voting='hard'
)

voting_regressor = VotingRegressor( # Combines multiple regressors (LR, RF, SVR) by averaging their predictions to make the final prediction.
    estimators=[('lr', LinearRegression()),
                ('rf', RandomForestRegressor()),
                ('svr', SVR())]
)



__________________________________________________________________

In [9]:
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, mean_squared_error

# Load datasets
classification_data = load_iris()
regression_data = fetch_california_housing()
X_class = classification_data.data
y_class = classification_data.target
X_reg = regression_data.data
y_reg = regression_data.target

# Split datasets
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X_class, y_class, test_size=0.2, random_state=0)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=0)

# Define pipelines for classification
classification_pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('scaler', StandardScaler(), [0, 1, 2, 3])  # Assuming all features are numeric
    ])),
    ('classifier', LogisticRegression())  # Default model
])

# Define pipelines for regression
regression_pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('scaler', StandardScaler(), [0, 1, 2, 3])  # Assuming all features are numeric
    ])),
    ('regressor', LinearRegression())  # Default model
])

# Combine all models into a single pipeline with a model selector
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.multioutput import MultiOutputClassifier

class ModelSelector(BaseEstimator, TransformerMixin):
    def __init__(self, model):
        self.model = model
        
    def fit(self, X, y=None):
        self.model.fit(X, y)
        return self
    
    def predict(self, X):
        return self.model.predict(X)
    
    def score(self, X, y):
        return self.model.score(X, y)

# Define a function to create a model pipeline
def create_model_pipeline(model, model_type):
    return Pipeline([
        ('preprocessor', ColumnTransformer([
            ('scaler', StandardScaler(), list(range(X_train_class.shape[1])))
        ])),
        (model_type, ModelSelector(model))
    ])

# Create pipelines for all models
model_pipelines = {
    'Logistic Regression': create_model_pipeline(LogisticRegression(), 'classifier'),
    'Random Forest Classifier': create_model_pipeline(RandomForestClassifier(), 'classifier'),
    'Support Vector Machines': create_model_pipeline(SVC(), 'classifier'),
    'Gradient Boosting Classifier': create_model_pipeline(GradientBoostingClassifier(), 'classifier'),
    'K-Nearest Neighbors': create_model_pipeline(KNeighborsClassifier(), 'classifier'),
    'Naive Bayes': create_model_pipeline(GaussianNB(), 'classifier'),
    'Linear Regression': create_model_pipeline(LinearRegression(), 'regressor'),
    'Ridge Regression': create_model_pipeline(Ridge(), 'regressor'),
    'Lasso Regression': create_model_pipeline(Lasso(), 'regressor'),
    'Elastic Net': create_model_pipeline(ElasticNet(), 'regressor'),
    'Decision Tree Regressor': create_model_pipeline(DecisionTreeRegressor(), 'regressor'),
    'Random Forest Regressor': create_model_pipeline(RandomForestRegressor(), 'regressor'),
    'Gradient Boosting Regressor': create_model_pipeline(GradientBoostingRegressor(), 'regressor'),
    'Support Vector Regression': create_model_pipeline(SVR(), 'regressor')
}

# Train and evaluate classification models
for name, pipeline in model_pipelines.items():
    if 'classifier' in name:
        pipeline.fit(X_train_class, y_train_class)
        preds = pipeline.predict(X_test_class)
        print(f"{name} Classification Report:")
        print(classification_report(y_test_class, preds))

# Train and evaluate regression models
for name, pipeline in model_pipelines.items():
    if 'regressor' in name:
        pipeline.fit(X_train_reg, y_train_reg)
        preds = pipeline.predict(X_test_reg)
        print(f"{name} Mean Squared Error:")
        print(mean_squared_error(y_test_reg, preds))
