# Multi-Classifier Model Selection and Hyperparameter Tuning



> This project explores multiple classification models and optimizes their hyperparameters using GridSearchCV from scikit-learn. Various classifiers are compared based on different scoring functions to identify the best-performing model. The process includes dataset preprocessing, model selection, hyperparameter tuning, and evaluation using classification metrics and confusion matrices. The results highlight the most effective classifiers for the given dataset.



In [1]:
import warnings
warnings.filterwarnings('ignore')

**Importing libraries for ML**


> We start by preparing the environment for our machine learning workflow.
This involves importing essential libraries, loading the dataset *churn-analysis.csv*,
and defining parameters like training set size and random state for reproducibility.





In [32]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split
import seaborn as sns
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

In [3]:
ts = 0.3
random_state = 42
n_splits = 3

In [4]:
df = pd.read_csv('churn-analysis.csv', sep=',', header=0)

**Data exploration**


> Using various methods to explore the dataset and display different aspects of its features.



In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df['Exited'].value_counts()

In [None]:
sns.boxplot(df['Age'])


**Define the models**

> This project defines multiple classification models, including Perceptron, KNN, Decision Tree, Random Forest, AdaBoost, and Naïve Bayes. Each model is paired with a set of hyperparameters for optimization using GridSearchCV. The models are evaluated using various scoring metrics such as accuracy, F1-score, recall, and precision.



In [5]:
model_lbl = ['ln', 'KNN', 'dt', 'rf', 'adb', 'nb']

models = {
    'ln': {
            'name': 'Linear Perceptron',
            'estimator': Perceptron(random_state = random_state),
            'param': {'class_weight': ['balanced', None], 'early_stopping': [True, False]}
          },
    'KNN': {
            'name': 'K Nearest Neighbour',
            'estimator': KNeighborsClassifier(),
            'param': {'n_neighbors': [*range(5,11)], 'weights': ['uniform', 'distance']}
           },
    'dt': {
            'name': 'Decision Tree',
            'estimator': DecisionTreeClassifier(random_state=random_state),
            'param': {'criterion': ['gini','entropy'], 'class_weight': ['balanced', None], 'max_depth': [*range(5,11)]}
          },
    'rf': {
            'name': 'Random Forest',
            'estimator': RandomForestClassifier(random_state=random_state),
            'param': [{'max_depth': [*range(4,10)],'n_estimators':[*range(10,60,10)]}]          },
    'adb': {
            'name': 'AdaBoost',
            'estimator': AdaBoostClassifier(random_state=random_state),
            'param': {'n_estimators':[10,20,30,40,50], 'learning_rate':[0.2,0.5,0.75,1,1.25,1.5]}
          },
    'nb': {'name': 'Gaussian Naive Bayes',
           'estimator': GaussianNB(),
           'param': [{'var_smoothing': [10**exp for exp in range(-3,-13,-1)]}]
          }

}

scoring = ['accuracy', 'f1_macro', 'recall_macro', 'precision_macro']

**Model Training and Evaluation Setup**


> evaluation. A Stratified K-Fold cross-validator is used to ensure balanced class distribution across training folds. An empty results DataFrame is initialized to store performance metrics for different models after hyperparameter tuning.



In [6]:
X = df.drop('Exited', axis=1)
y = df['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=ts)

In [7]:
skf = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)
clfs = []
results = pd.DataFrame(columns=['scoring',
                                'model',
                                'best_params',
                                'accuracy',
                                'precision_macro',
                                'recall_macro',
                                'f1_macro'])


**Hyperparameter Tuning and Model Evaluation**



> The code iterates over multiple scoring metrics and classifiers, applying Grid Search Cross-Validation to find the best hyperparameters. Each trained model is evaluated on the test set, and key performance metrics (accuracy, precision, recall, and F1-score) are stored in a results DataFrame. This process helps identify the most effective classifier and parameter combination for the dataset.



In [10]:
for score in scoring:
  for lbl in model_lbl:
    clf = GridSearchCV(models[lbl]['estimator'], param_grid=models[lbl]['param'], scoring=score, cv=skf, return_train_score=False)
    clf.fit(X_train,y_train)
    clfs.append(clf)
    y_predict = clf.predict(X_test)
    cr = classification_report(y_test, y_predict, output_dict=True)
    results.loc[len(results)] = [score, models[lbl]['name'],
                                clf.best_params_,
                                cr['accuracy'],
                                cr['macro avg']['precision'],
                                cr['macro avg']['recall'],
                                cr['macro avg']['f1-score']
                                ]

**Displaying Model Performance**



> Displaying evaluation results for each scoring metric, filtering the results DataFrame to show only models optimized for the current metric. The results are sorted in descending order based on the selected score, highlighting the top-performing models for each evaluation criterion.



In [None]:
for score in scoring:
  print("Results for Scoring    **"+str(score)+"**")
  display(results[results['scoring'] == score].sort_values(by=[score], ascending=False).head())
  print('\n')

**Confusion Matrix for Best Models**

> For each scoring metric, the best-performing model is identified by selecting the row with the highest score in the results DataFrame. The Confusion Matrix is then displayed for this model using ConfusionMatrixDisplay.from_estimator, providing a visual representation of classification performance. The plot title includes the scoring metric and the corresponding best model.

In [None]:
for score in scoring:
    best_row = results.loc[results.scoring==score,score].idxmax(axis=0)
    disp = ConfusionMatrixDisplay.from_estimator(X=X_test, y=y_test, estimator = clfs[best_row])
    # disp.ax_.set_title("Best Model for {}: {}".format(score,results.at[bests[score],'model']))
    disp.ax_.set_title("Best Model for {}: {}".format(score,results.at[best_row,'model']))