<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    <b>Author:</b> Yap Jheng Khin
</p>
<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    Note that this is the continuation from Part 1, which was done in <a href="https://github.com/polarBearYap/speeddating_AI">here</a>.
    I have also discover many mistakes from part I, and part II will serve as an <b>improvement</b> or postmortem.
</p>
<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    List of mistakes that I have made in part I are:
</p>
<ol>
    <li style="line-height: 2.0; font-size: 14px;">Preprocess on whole dataset, which cause train-test contamination.</li>
    <li style="line-height: 2.0; font-size: 14px;">Perform cross validation instead of nested cross validation.</li>
</ol>
<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    My learning expection in Part II are:
</p>
<ol>
    <li style="line-height: 2.0; font-size: 14px;">Discover various ways to detect correlated features.</li>
    <li style="line-height: 2.0; font-size: 14px;">Perform feature selection to reduce model complexity.</li>
    <li style="line-height: 2.0; font-size: 14px;">Apply nested cross validation on areas like hyperarameter tuning.</li>
    <li style="line-height: 2.0; font-size: 14px;">Discover XAI techniques that can be used in explaining black box models.</li>
</ol>

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    List of dependencies used:
</p>
<ol>
    <li style="line-height: 2.0; font-size: 14px;">tqdm</li>
    <li style="line-height: 2.0; font-size: 14px;">catboost</li>
    <li style="line-height: 2.0; font-size: 14px;">xgboost</li>
    <li style="line-height: 2.0; font-size: 14px;">seaborn 0.11.0</li>
    <li style="line-height: 2.0; font-size: 14px;">alibi</li>
</ol>

In [None]:
!pip install tqdm
!pip install 'seaborn == 0.11.0'
!pip install xgboost
!pip install catboost
!pip install alibi

In [None]:
import time
from itertools import product
from math import ceil

import ast
import numpy as np
import pandas as pd
import pickle
import re
import seaborn as sns
import warnings
from catboost import CatBoostClassifier
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy
from scipy.stats import spearmanr
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.compose import make_column_transformer
from sklearn.ensemble import (
    AdaBoostClassifier,
    BaggingClassifier,
    ExtraTreesClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    StackingClassifier,
    VotingClassifier,
)
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_selection import RFECV
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    RocCurveDisplay,
    auc,
    average_precision_score,
    precision_recall_curve,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import (
    RandomizedSearchCV,
    StratifiedKFold,
    StratifiedShuffleSplit,
    cross_val_predict,
    cross_validate,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from tqdm import tqdm
from xgboost import XGBClassifier

In [None]:
RANDOM_SEED = 42

# Set the default font size of all matplotlib plots
plt.rcParams.update({'font.size': 12})

# Set the display option of pandas objects
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

The functions below are needed to pickle objects later on to cut down total execution time.

In [None]:
def dump_objects(file_name, *objects):
    with open(f'{file_name}.sav', 'wb') as file:
        for obj in objects:
            pickle.dump(obj, file)


def load_objects(file_name, num_objects=1):
    objects = []
    with open(f'../input/pickles/speeddating_pickles/{file_name}.sav', 'rb') as file:
        while num_objects > 0:
            objects.append(pickle.load(file))
            num_objects -= 1
    return objects

# Get Data

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
The dataset we have chosen is SpeedDating from <a href=https://www.openml.org/d/40536>openml</a> which focused on experimental speed dating information, which included the answers of 8,378 participants between 2002 and 2004. Each participant had a 4-minute "first date" with the opposite sex. Once they completed the short-term date, the participants were asked to rate their likelihood of seeing their partner again (between 0 and 10). Everyone is also asked to rate their partners on 6 subjective attributes. According to our preliminary analysis, these subjective attributes and ratings of self-perception, actual age and whether they match other people will form the basis for my inquiry. The attributes are Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.
</p>

In [None]:
BANK_DATA_URL = 'https://raw.githubusercontent.com/polarBearYap/speeddating_AI/main/datasets/speed_dating.csv'
FILE_PATH = '../input/speed-dating/speeddating.csv'


def fetch_data_from_website(path):
    return pd.read_csv(path, low_memory=False)

In [None]:
dating = fetch_data_from_website(FILE_PATH)
dating.head()

# Data Exploration and Problem Understanding

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We have analyzed a dataset which consists of 8378 samples gathered from participants in experimental speed dating events from 2002 until 2004. The dataset output is to determine whether the given partners would match with the users. There are 122 inputs and 1 output which in summation 123 attributes in the dataset. The dataset consists of 7 attributes which datatype are in integer type and the remaining 116 attributes are all in object type.
</p>

In [None]:
# List of all attributes
list(dating.columns.values)

In [None]:
dating.info()

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
There are totally 56 preprocessed features which have undergone the data preprocessing in the dataset such as 'd_importance_same_race' which represents the various types of age difference for the users and given partners. Therefore, all the features with heading 'd_' will show the particular attributes in discrete type which should be filtered out from the raw dataset while performing data preprocessing. However, we do find want to keep one of the features which is 'd_age', since difference in age is quite important in dating in our opinion.
</p>

In [None]:
preprocessed_features = [feature for feature in dating.columns if feature.lower()[
    :2] == 'd_']
preprocessed_features.remove('d_age')
print(
    f'Amount of remaining preproccessed features: {len(preprocessed_features)}')

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
The dataset also exists several irrelevant features such as 'has null' which represents whether the particular sample consisting null values. Several features with 'expected_' means the expectations of the users towards partners and also all the users' self interest features such as sports, movies and others could be considered as subjective features and should also be dropped from the datasets. The fields of study of the users would also not be considered in the dataset. Moreover, the attributes which consist of '_o' represent the opinions of the given partners which are also irrelated in the dataset. For instance, 'pref_o_attractive' means the importance rated by the partners towards the attractiveness of the participants.	Hence, features selection should be performed during the data preprocessing in order to filter out all irrelevant attributes.
</p>

In [None]:
irrelevant_features = ['has_null',
                       'wave',
                       'expected_happy_with_sd_people',
                       'expected_num_interested_in_me',
                       'expected_num_matches',
                       'field',
                       'decision']

self_interest_feature = ['sports',
                         'tvsports',
                         'exercise',
                         'dining',
                         'museums',
                         'art',
                         'hiking',
                         'gaming',
                         'clubbing',
                         'reading',
                         'tv',
                         'theater',
                         'movies',
                         'concerts',
                         'music',
                         'shopping',
                         'yoga']

partner_features = ['age_o',
                    'race_o',
                    'pref_o_attractive',
                    'pref_o_sincere',
                    'pref_o_intelligence',
                    'pref_o_funny',
                    'pref_o_ambitious',
                    'pref_o_shared_interests',
                    'attractive_o',
                    'sinsere_o',
                    'intelligence_o',
                    'funny_o',
                    'ambitous_o',
                    'shared_interests_o']

# Data Cleaning & Preprocessing

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    First, we are going to drop <b>decision</b> and <b>decision_o</b>, since these two features will cause
    <a href="https://www.kaggle.com/alexisbcook/data-leakage">data leakage</a> for our model.
</p>

In [None]:
dating = dating.drop(columns=['decision', 'decision_o'], axis=1)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    Next, we are going to remove preprocessed features since it has high correlation with raw features,
    which complicates the final model.
</p>

In [None]:
dating = dating.drop(columns=preprocessed_features, axis=1)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    Thirdly, we are going to drop has_null feature since it is irrelevant in the prediction.
</p>

In [None]:
dating = dating.drop(columns='has_null', axis=1)

## Data cleaning pipeline

In [None]:
# Use a custom transformer for data preprocessing
class DataCleaner(BaseEstimator, TransformerMixin):

    def __init__(self, y_feature):
        self.y_feature = y_feature
        self.features_with_wrong_data_type = []
        self.numerical_features = []
        self.categorical_features = []
        self.features_with_invalid_value = []
        self.one_hot_features = []
        self.invalid_values = set()

    # Getter for numerical features
    def getNumericalFeatures(self):
        return self.numerical_features

    # Getter for categorical features
    def getCategoricalFeatures(self):
        return self.categorical_features

    # Getter for collected invalid values
    def getInvalidValues(self):
        return self.invalid_values

    # Detect integer value in data using regex/regular expression
    def detect_int_value(self, data):
        return np.any(data.astype(str).str.contains('^\d+$', regex=True))

    # Detect integer value in data using regex/regular expression
    def detect_float_value(self, data):
        return np.any(data.astype(str).str.contains('^-?\d+\.\d+$|^\d+$', regex=True))

    # Detect invalid integer value in data using regex/regular expression
    def get_invalid_int_value(self, data):
        return ', '.join(data[~data.astype(str).str.contains('^\d+$', regex=True)]
                         .value_counts().index.to_list())

    # Detect invalid float value in data using regex/regular expression
    def get_invalid_float_value(self, data):
        return ', '.join(data[~data.astype(str).str.contains('^-?\d+\.\d+$|^\d+$', regex=True)]
                         .value_counts().index.to_list())

    def drop_rows_with_unknow_values(self, data, feature):
        return data[~data[feature].isna()]

    def find_invalid_values(self, data):
        # Iterates all columns in the dating dataset and detect data types automatically
        for feature in data.columns.values:

            # Check if the features casted as object should be casted with float
            if data[feature].dtype == 'object':
                # If the features should be casted with float, flag the feature as 'features_with_wrong_data_type'
                if self.detect_float_value(data[feature]):
                    data[feature] = data[feature].astype(
                        'float64', errors='ignore')
                    invalid_value = self.get_invalid_float_value(data[feature])
                    # If invalid values are found, flag the feature as 'features_with_invalid_value'
                    if invalid_value != '':
                        self.invalid_values.add(invalid_value)
                        self.features_with_invalid_value.append(feature)
                    self.features_with_wrong_data_type.append(feature)
                # If the feature is actually categorical, flag the feature as 'categorical_features'
                else:
                    self.categorical_features.append(feature)

            # Check for invalid integer value in numerical columns with 'int64' datatype
            if data[feature].dtype == 'int64':
                invalid_value = self.get_invalid_int_value(data[feature])
                if invalid_value != '':
                    self.invalid_values.add(invalid_value)
                    self.features_with_invalid_value.append(feature)
                data[feature] = data[feature].astype('float64', errors='raise')
                self.numerical_features.append(feature)

            # Check for invalid integer value in numerical columns with 'float64' datatype
            elif data[feature].dtype == 'float64':
                invalid_value = self.get_invalid_float_value(data[feature])
                if invalid_value != '':
                    self.invalid_values.add(invalid_value)
                    self.features_with_invalid_value.append(feature)
                self.numerical_features.append(feature)

    def fit(self, data, y=None):

        # Detect any numerical features casted with 'object' data type and with invalid values
        self.find_invalid_values(data)

        return self

    def transform(self, data, y=None):

        # Replace '?' value with NaN
        data = data.replace('^\?$', np.NaN, regex=True)

        # Change numerical features with 'object' data type and change to 'float64'
        for feature in self.features_with_invalid_value:
            data[feature] = data[feature].astype('float64', errors='raise')

        # Add the fixed features back to numerical features
        self.numerical_features += self.features_with_invalid_value

        # Remove unwanted quotes: change values like ''Example'' to 'Example'
        for feature in self.categorical_features:
            for value in data[feature].value_counts().index:
                if re.search('^\'.+\'$', value.replace(' ', '')):
                    index = data[data[feature] == value].index
                    data.loc[index, feature] = value[1:-1]

        return data

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    We are going to clean our data. The data cleaning pipeline automatically converts the
    columns into suitable data type depending on the majority of the values, respectively.
    Note that the dataset contains unknown values labelled by
    the value of '?' and is replaced with <i>np.NaN</i>.
</p>

In [None]:
cleaner = DataCleaner('match')
dating1 = cleaner.fit_transform(dating.copy())

In [None]:
print(f'Invalid values found: {cleaner.getInvalidValues()}')

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
    Below are the lists of numerical and categorical features contained in the dataset.
    The <i>match</i> feature is omitted since it is the outcome we want to predict.
</p>

In [None]:
print('List of numerical features:')
num_attr = cleaner.getNumericalFeatures()
num_attr.remove('match')
num_attr

In [None]:
print('List of categorical features:')
cat_attr = cleaner.getCategoricalFeatures()
cat_attr

## Train-test split

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
    <b>StratifiedShuffleSplit</b> is used instead of sklearn.model_selection.train_test_split that implementing randomized splitting since the dataset is imbalanced, that is, 83.53% of negative (not match) and 15.47% of positive (match) samples, respectively. As a result, the percentage of the samples is maintained based on the proportion of 'match' after splitting.
</p>

In [None]:
def split_data(X, y, n_splits=1, test_size=0.2, random_state=RANDOM_SEED):

    # split using stratified sampling
    split = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size,
                                   random_state=random_state)

    train_index, test_index = next(split.split(X, y))

    X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index], y[train_index], y[test_index]

    return X_train, X_test, y_train, y_test

In [None]:
Y_FEATURE = 'match'

X = dating1.copy().drop(Y_FEATURE, axis=1)
y = dating1[Y_FEATURE]

X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.15)

print(f'Traning dataset shape:')
print(f'train X   : {X_train.shape}')
print(f'train y   : {y_train.shape}')
print(f'train X   : {X_test.shape}')
print(f'train y   : {y_test.shape}')

## Data preprocessor Pipeline

The data preprocessor pipeline is created as follows to allow easy integration with
 any models to form a complete training pipeline.

In [None]:
def generate_1_hot_attr(X=X, cat_attr=cat_attr):
    # Generate one-hot-encoded feature's names
    index = np.any(pd.isnull(X[cat_attr]), axis=1)
    X_cat = X.loc[~index, cat_attr].copy()
    one_hot_enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_enc.fit(X_cat)
    return one_hot_enc.categories_, \
        one_hot_enc.get_feature_names(cat_attr)


def make_preprocess_pipeline(num_attr=num_attr, cat_attr=cat_attr):

    one_hot_attrs, _ = generate_1_hot_attr(cat_attr=cat_attr)

    # Impute the null values for categorical attribute
    # Label encode using OneHotEncoder
    categorical_pipleline = make_pipeline(
        SimpleImputer(strategy='most_frequent'),
        OneHotEncoder(categories=one_hot_attrs)
    )

    # Add mean value to the missing values for numerical attribute
    numerical_pipeline = make_pipeline(
        SimpleImputer(strategy='mean'),
        StandardScaler()
    )

    # Combine numerical_pipeline and categorical_pipleline
    preprocess_pipeline = make_column_transformer(
        (numerical_pipeline, num_attr),
        (categorical_pipleline, cat_attr),
        remainder='passthrough')

    return preprocess_pipeline


def make_training_pipeline(ml_model, num_attr=num_attr, cat_attr=cat_attr):

    training_pipeline = make_pipeline(
        make_preprocess_pipeline(num_attr, cat_attr),
        ml_model
    )

    return training_pipeline

# Prepare Classifiers

List of classifiers chosen for model selection
- Logistic Classifier
- Linear Support Vector Classifier
- K Neighbors Classifier
- Multi-layer Perceptron classifier
- Decision Tree
- Random forest Classifier
- Extra Tree Classifier
- AdaBoost Classifier
- Gradient Boosting Classifier
- Bagging Classifier
- CatBoost Classifier
- XGB Classifier'

In [None]:
RANDOM_SEED = 42

short_names = ['log_reg', 'linear_svm', 'k_neighbors', 'neural_network',
               'decision_tree', 'rand_forest', 'extra_tree', 'ada_boost_cf',
               'gradient_b_cf', 'bagging_cf', 'catboost_cf', 'xg_boost']

names = ['Logistic Classifier', 'Linear Support Vector Classifier',
         'K Neighbors Classifier', 'Multi-layer Perceptron classifier',
         'Decision Tree', 'Random forest Classifier', 'Extra Tree Classifier',
         'AdaBoost Classifier', 'Gradient Boosting Classifier',
         'Bagging Classifier', 'CatBoost Classifier', 'XGBClassifier']

functions = [
    LogisticRegression(random_state=RANDOM_SEED, n_jobs=-1, max_iter=1000),
    LinearSVC(C=1, loss="hinge", random_state=RANDOM_SEED),
    KNeighborsClassifier(n_neighbors=20, n_jobs=-1),
    MLPClassifier(random_state=RANDOM_SEED, early_stopping=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    ExtraTreesClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    AdaBoostClassifier(random_state=RANDOM_SEED),
    GradientBoostingClassifier(random_state=RANDOM_SEED),
    BaggingClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    CatBoostClassifier(random_seed=RANDOM_SEED, silent=True),
    XGBClassifier(random_state=RANDOM_SEED, n_jobs=-1)
]

classifiers_idx = {}
classifiers = {}

# Zip all classfiers together into a dictionary for convenient access
for idx, s_name, name, func in zip(range(len(names)), short_names, names, functions):
    classifiers_idx[idx] = {'name': name, 'func': func}
    classifiers[s_name] = {'name': name, 'func': func}

# Model Selection

## Phase 1: Performance Score

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We are going to select a few top machine learning algorithm of different types by evaluating
the performance score measued in nested cross validated training sets. Nested cross validation is
to used such that the scores are better estimates for generalization error. Scoring metrics used
are f1 score, precision, recall and auc of roc.
</p>

In [None]:
def get_models_performance(models, X, y, n_splits,
                           scoring_metrics, num_attr=num_attr,
                           cat_attr=cat_attr, random_state=RANDOM_SEED):

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    cv = StratifiedKFold(n_splits=n_splits,
                         shuffle=True, random_state=random_state)

    mean_cols = []
    std_cols = []
    for name in ['train', 'test']:
        mean_cols += [f'{name}_{metric}' for metric in scoring_metrics]
        std_cols += [f'{name}_{metric}_std' for metric in scoring_metrics]
    cols = mean_cols + std_cols

    results = {'model_name': [], 'duration': []}

    for col in cols:
        results[col] = []

    # Loop through all models
    for idx in range(len(models)):
        cf_name = models[idx]['name']

        print(f'{cf_name} has started...')
        # Count time to get the duration of the models
        start = time.time()

        ml_pipeline = make_training_pipeline(
            clone(models[idx]['func']), num_attr, cat_attr)
        # cross_validate returns both train_score and test_score by setting return_train_score to True
        cv_scores = cross_validate(ml_pipeline, X, y,
                                   scoring=scoring_metrics, cv=cv,
                                   return_train_score=True)

        end = time.time()
        duration = end - start
        print(f'{cf_name} ended in {duration} seconds.\n')

        updateRecord(results, cv_scores, mean_cols, std_cols,
                     cf_name, duration, scoring_metrics)

    # Return as DataFrame instead of dictionary
    return pd.DataFrame(results)

# Append values to the dictionary based on key_name passed into the function


def updateRecord(df, scores, mean_cols, std_cols, model_name, duration, scoring_metrics):
    df['model_name'].append(model_name)
    df['duration'].append(duration)
    for mean_col, std_col in zip(mean_cols, std_cols):
        df[mean_col].append(np.mean(scores[mean_col]))
        df[std_col].append(np.std(scores[mean_col]))


def sortValues(df, cols, sort_idx, ascending=False):
    df = df.copy()
    try:
        cols.remove('model_name')
        cols.remove('duration')
    except ValueError:
        pass
    regex = '(?:^.+)(_after|_before)$'
    for col in cols:
        match = re.search(regex, col)
        if not match:
            col_std = f'{col}_std'
        elif match.group(1) == '_before':
            col_std = f'{col[:-7]}_std_before'
        else:
            col_std = f'{col[:-6]}_std_after'
        df[col] = df[col].astype('float64')
        df[col_std] = df[col_std].astype('float64')
        def display(row): return f'{row[0]:.4f} +/-{row[1]:.4f}'
        df[col] = df[[col, col_std]].apply(display, axis=1)
    cols = np.array(cols)
    sort_cols = list(cols[sort_idx]) if isinstance(sort_idx, list) \
        else [cols[sort_idx]]
    df = df[['model_name'] +
            list(cols)].sort_values(sort_cols, ascending=ascending)
    return df

### First Round: Train Model with 100% Features

In [None]:
scoring_metrics = ['f1', 'roc_auc', 'precision', 'recall']
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
results = get_models_performance(
    classifiers_idx, X_train, y_train, 5, scoring_metrics)

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Based on the table below, all the classifiers were able to achieve more than 0.70 recall rate. Ensemble methods which are Random forest, Extra Trees Classifier, AdaBoost Classifier are clearly overfits the training set, while Complement Naive Bayes, Quadratic Discriminant Analysis performs very poor on precision.
</p>

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
test_results shows the mean test score of each models after undergo 5 folds cross validation.
</p>

In [None]:
roc = ['train_roc_auc', 'test_roc_auc']
sortValues(results, roc, 1)

In [None]:
f1 = ['train_f1', 'test_f1']
sortValues(results, f1, 1)

In [None]:
precision_recall = ['train_precision',
                    'test_precision', 'train_recall', 'test_recall']
sortValues(results, precision_recall, [1, 3])

In [None]:
sortValues(results, precision_recall, [3, 1])

### Second Round: Train Model with Features Selection

Now, our model is very complicated, see if we can cut down any further without sacrificing too much
classification accuracy.

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
In general, instead of using dimensionality reduction techniques like PCA, we
are going to manually discard unimportant features by visualizing using the techniques as mentioned below:
</p>

1. Impurity based importances
2. Permutation importances
3. Spearman rank-order correlation coefficient

In [None]:
def plot_impurity_based_importances(training_pipeline, X, y, feature_names, n_splits,
                                    fig_w, fig_h, random_state=RANDOM_SEED):

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    feature_names = pd.Index(feature_names)

    avg_feature_importances = np.zeros(len(feature_names))

    cv = StratifiedKFold(n_splits=n_splits,
                         shuffle=True, random_state=random_state)

    with tqdm(total=100) as pbar:
        progress_unit = 100/(n_splits)
        for train_ix, _ in cv.split(X, y):
            X_train = X.loc[train_ix]
            y_train = y[train_ix]
            cur_model = clone(training_pipeline)
            cur_model.fit(X_train, y_train)
            avg_feature_importances += cur_model[-1].feature_importances_
            pbar.update(progress_unit)

    tree_feature_importances = avg_feature_importances / n_splits
    sorted_idx = tree_feature_importances.argsort()

    y_ticks = np.arange(0, len(feature_names))
    fig, ax = plt.subplots()
    ax.barh(y_ticks, tree_feature_importances[sorted_idx])
    ax.set_yticklabels(feature_names[sorted_idx])
    ax.set_yticks(y_ticks)
    ax.set_title("Random Forest Feature Importances (MDI)")
    fig.set_size_inches(fig_w, fig_h)
    ax.title.set_fontsize(16)
    plt.show()


def plot_permutation_importances(training_pipeline, X, y, feature_names, n_splits,
                                 n_repeats, plot_title, fig_w, fig_h,
                                 n_jobs=-1, random_state=RANDOM_SEED):

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    avg_importances_mean = np.zeros(len(feature_names))
    avg_importances = np.zeros((len(feature_names), n_repeats))

    cv = StratifiedKFold(n_splits=n_splits,
                         shuffle=True, random_state=random_state)

    with tqdm(total=100) as pbar:
        progress_unit = 100/(n_splits)
        for train_ix, test_ix in cv.split(X, y):
            X_train = X.loc[train_ix]
            y_train = y[train_ix]
            X_test = X.loc[test_ix]
            y_test = y[test_ix]
            cur_model = clone(training_pipeline)
            cur_model.fit(X_train, y_train)
            result = permutation_importance(cur_model, X_test, y_test,
                                            n_repeats=n_repeats, scoring='roc_auc',
                                            random_state=random_state, n_jobs=n_jobs)
            avg_importances_mean += result.importances_mean
            avg_importances += result.importances
            pbar.update(progress_unit)

    avg_importances_mean /= n_splits
    avg_importances /= n_splits
    sorted_idx = avg_importances_mean.argsort()

    fig, ax = plt.subplots()
    ax.boxplot(avg_importances[sorted_idx].T,
               vert=False, labels=feature_names)
    ax.set_title(plot_title)
    fig.set_size_inches(fig_w, fig_h)
    ax.title.set_fontsize(16)
    plt.show()

Since the list of X columns after data preprocessing is not available, we have to code by our own.

In [None]:
_, one_hot_attrs = generate_1_hot_attr()
X_preprocessed_attr = list(num_attr) + list(one_hot_attrs)

It seems that field features are not important in the prediction.
But still, we have to use other means to confirm this hypothesis.

In [None]:
rand_forest_cf_pipeline = make_training_pipeline(
    classifiers['rand_forest']['func'])

plot_impurity_based_importances(rand_forest_cf_pipeline, X_train, y_train,
                                X_preprocessed_attr, 5, 8, 80)

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
According to <a href="https://christophm.github.io/interpretable-ml-book/feature-importance.html">Christoph Molnar (2020)</a>,
    permutation importances will not yield accurate measurement for features
with high correlation. It is because even if one of the features are removed, information from other
correlated features still can cover the loss of information of that removed feature. As a result, we need to use
correlation metrics such as Spearman rank-order correlation coefficient to decide whether a feature is really not
important or having high correlation with other features based on the results from permutation importances.
</p>
<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
Based on the plot below, the lists of unimportant features by types are:
</p>
<table style="width:35%; float: left; display: inline-block;">
  <tr>
    <th>gender</th>
    <th>Example</th>
  </tr>
  <tr>
    <td>age-related features</td>
    <td>age, age_o, d_age</td>
  </tr>
  <tr>
    <td>unknown feature</td>
    <td>wave</td>
  </tr>
   <tr>
    <td>field</td>
    <td>field_sociology, field_money</td>
  </tr>
  <tr>
    <td>interest-related features</td>
    <td>shopping, music</td>
  </tr>
  <tr>
    <td>partner-related features</td>
    <td>intelligence_partner, funny_partner</td>
  </tr>
  <tr>
    <td>race-related features</td>
    <td>race, importance_same_race</td>
  </tr>
  <tr>
    <td>features about partner's preference</td>
    <td>pref_o_intelligence, pref_o_ambitious</td>
  </tr>
  <tr>
    <td>features about partner's rating on self</td>
    <td>intelligence_o, funny_o</td>
  </tr>
  <tr>
    <td>features about self's preference</td>
    <td>ambition_important, funny_important</td>
  </tr>
  <tr>
    <td>features about self's rating on herself/himself</td>
    <td>funny, intelligence</td>
  </tr>
</table>

In [None]:
plot_permutation_importances(rand_forest_cf_pipeline, X_train, y_train,
                             X_train.columns, 5, 8,
                             'Permutation Importances (nested cross validated)',
                             8, 85, RANDOM_SEED)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
We are going to visualize Spearman rank-order correlation coefficient using dendogram and heatmap, respectively.
The code is inspired from this <a href="https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py">sklearn guide</a>.
</p>

In [None]:
def plot_dendro_corr(X, feature_names, fig_w, fig_h,
                     orientation='top', font_size=15,
                     rotation=90):
    fig, ax = plt.subplots(figsize=(fig_w, fig_h))
    corr = spearmanr(X).correlation
    corr_linkage = hierarchy.ward(corr)
    dendro = hierarchy.dendrogram(
        corr_linkage, labels=feature_names, ax=ax, leaf_rotation=rotation,
        leaf_font_size=font_size, orientation=orientation
    )
    fig.tight_layout()
    plt.show()


def plot_heatmap_corr_full(X, X_features, fig_w, fig_h, annot=False, enable_mask=True):

    fig, ax = plt.subplots(figsize=(fig_w, fig_h))

    corr = X.corr(method='spearman')
    corr.index

    # triu
    if enable_mask:
        mask = np.tril(np.ones_like(corr, dtype=bool))
    else:
        mask = False
    sns.heatmap(corr, linewidths=0.1, linecolor='white',
                square=True, annot=annot, mask=mask,
                vmin=-1, vmax=1, center=0, ax=ax,
                xticklabels=True,
                yticklabels=True)

    fig.tight_layout()
    plt.tick_params(axis='both', which='minor', labelsize=15)
    plt.show()


def plot_heatmap_corr(X, X_features, selected_features,
                      fig_w, fig_h, annot=False):

    fig, ax = plt.subplots(figsize=(fig_w, fig_h))

    corr = X.corr(method='spearman')[
        selected_features].drop(index=selected_features)
    non_selected_features = corr.index
    x_axis = selected_features
    y_axis = non_selected_features
    if len(y_axis) < len(x_axis):
        corr = corr.T
        xticklabels = non_selected_features
        yticklabels = selected_features
    else:
        xticklabels = selected_features
        yticklabels = non_selected_features

    sns.heatmap(corr, linewidths=0.1, linecolor='white',
                square=True, annot=annot,
                vmin=-1, vmax=1, center=0, ax=ax,
                xticklabels=True,
                yticklabels=True)

    ax.set_xticklabels(xticklabels, rotation='vertical')
    ax.set_yticklabels(yticklabels, rotation='horizontal')
    fig.tight_layout()
    plt.tick_params(axis='both', which='minor', labelsize=15)
    plt.show()

Since we have 100++ features, we are certainly not going to visualize it as a whole. We going to chop down and
analyze piece by piece.
```
plot_heatmap_corr(X_train_imputed, X_preprocessed_attr, 80, 80)
```

*impute_pipe* is used to preprocess *X_train* before calculating the Spearman correlation.

In [None]:
def make_impute_pipeline(num_attr=num_attr, cat_attr=cat_attr):

    # Impute the null values for categorical attribute
    # Label encode using OneHotEncoder
    categorical_pipleline = make_pipeline(
        SimpleImputer(strategy='most_frequent'),
        OneHotEncoder()
    )

    # Add mean value to the missing values for numerical attribute
    numerical_pipeline = make_pipeline(
        SimpleImputer(strategy='mean'),
        StandardScaler()
    )

    # Combine numerical_pipeline and categorical_pipleline
    impute_pipeline = make_column_transformer(
        (numerical_pipeline, num_attr),
        (categorical_pipleline, cat_attr),
        remainder='passthrough')

    return impute_pipeline

**field**

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
We suspect that <i>field</i> might not be significant in predicting whether there is a match, therefore let's see
if the one-hot encoded <i>field</i> values contains any correlations with other features or not.
</p>

In [None]:
_, one_hot_attrs = generate_1_hot_attr(X_train, cat_attr)
X_preprocessed_attr = list(num_attr) + list(one_hot_attrs)

X_train_imputed = make_impute_pipeline(
    num_attr, cat_attr).fit_transform(X_train.copy())
X_train_imputed = pd.DataFrame(
    X_train_imputed.toarray(), columns=X_preprocessed_attr)

In [None]:
plot_heatmap_corr(X_train_imputed, X_preprocessed_attr,
                  one_hot_attrs, 20, 100)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
Based on the plot above, we can drop <i>field</i> since the it has low permutation importances score and
low correlation with other features.
</p>

In [None]:
dating2 = dating1.drop(columns='field', axis=1)
Y_FEATURE = 'match'

X2 = dating2.copy().drop(Y_FEATURE, axis=1)
y2 = dating2[Y_FEATURE]

X2_train, X2_test, y2_train, y2_test = split_data(X2, y2, test_size=0.15)

print(f'Traning dataset shape:')
print(f'train X   : {X2_train.shape}')
print(f'train y   : {y2_train.shape}')
print(f'train X   : {X2_test.shape}')
print(f'train y   : {y2_test.shape}')

**interest-related features**

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
From the permutation importances plot, we do notice that scores for interest-related features are
quite low. Let's see if we can drop them.
</p>

In [None]:
cat_attr2 = cat_attr[:-1]
_, one_hot_attrs2 = generate_1_hot_attr(X2_train, cat_attr2)
X2_preprocessed_attr = list(num_attr) + list(one_hot_attrs2)

X2_train_imputed = make_impute_pipeline(
    num_attr, cat_attr2).fit_transform(X2_train.copy())
X2_train_imputed = pd.DataFrame(X2_train_imputed, columns=X2_preprocessed_attr)

In [None]:
interest_features = ['sports', 'tvsports', 'exercise', 'dining', 'museums',
                     'art', 'hiking', 'gaming',
                     'clubbing', 'reading', 'tv', 'theater', 'movies',
                     'concerts', 'music', 'shopping', 'yoga',
                     'interests_correlate'
                     ]

plot_heatmap_corr(X2_train_imputed, X2_preprocessed_attr,
                  interest_features, 30, 80, True)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
Based on the plot above, we can drop <i>interest-related features</i> since the most of the
features has  low permutation importances score and low or unmeaningful correlation with other features.
</p>

In [None]:
dating3 = dating2.drop(columns=interest_features, axis=1)
Y_FEATURE = 'match'

X3 = dating3.copy().drop(Y_FEATURE, axis=1)
y3 = dating3[Y_FEATURE]

X3_train, X3_test, y3_train, y3_test = split_data(X3, y3, test_size=0.15)

print(f'Traning dataset shape:')
print(f'train X   : {X3_train.shape}')
print(f'train y   : {y3_train.shape}')
print(f'train X   : {X3_test.shape}')
print(f'train y   : {y3_test.shape}')

In [None]:
cat_attr3 = cat_attr[:-1]
num_attr3 = num_attr[:36] + num_attr[54:]
_, one_hot_attrs3 = generate_1_hot_attr(X3_train, cat_attr3)
X3_preprocessed_attr = list(num_attr3) + list(one_hot_attrs3)

X3_train_imputed = make_impute_pipeline(
    num_attr3, cat_attr3).fit_transform(X3_train.copy())
X3_train_imputed = pd.DataFrame(X3_train_imputed, columns=X3_preprocessed_attr)

In [None]:
plot_heatmap_corr_full(
    X3_train_imputed, X3_preprocessed_attr, 25, 25, enable_mask=False)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
After painstakingly observe every samples, below are the features that have
low permutation importances scores and low correlation with other features at
the same time. Therefore, we can confidently remove these features.
</p>

- wave
- d_age
- age
- age_o
- pref_o_intelligence
- pref_o_funny
- intellicence_important
- funny_important

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Besides, we also can remove race-related features. It is because for self's race (identified
    by the prefix <i>race_</i>) only negatively correlates to itself. It means that if a person is
    Asian-American then that person is not African American. The same applies for partner's race.
    Furthermore, there is visibly no correlation (1) between self's and partner's race, and
    (2) between both races and importance_same_race and importance_same_religion.
</p>

In [None]:
age_related_features = ['d_age', 'age', 'age_o']

race_related_features = ['race', 'race_o', 'samerace',
                         'importance_same_race', 'importance_same_religion']

features_to_be_dropped = ['wave', 'pref_o_intelligence',
                          'pref_o_funny', 'intellicence_important',
                          'funny_important'] + age_related_features + race_related_features

dating4 = dating3.drop(columns=features_to_be_dropped, axis=1)
Y_FEATURE = 'match'

X4 = dating4.copy().drop(Y_FEATURE, axis=1)
y4 = dating4[Y_FEATURE]

X4_train, X4_test, y4_train, y4_test = split_data(X4, y4, test_size=0.15)

print(f'Traning dataset shape:')
print(f'train X   : {X4_train.shape}')
print(f'train y   : {y4_train.shape}')
print(f'train X   : {X4_test.shape}')
print(f'train y   : {y4_test.shape}')

In [None]:
cols4 = list(dating4.columns)
cat_attr4 = [cols4[0]]
num_attr4 = cols4[1:-1]
_, one_hot_attrs4 = generate_1_hot_attr(X4_train, cat_attr4)
X4_preprocessed_attr = list(num_attr4) + list(one_hot_attrs4)

X4_train_imputed = make_impute_pipeline(
    num_attr4, cat_attr4).fit_transform(X4_train.copy())
X4_train_imputed = pd.DataFrame(X4_train_imputed, columns=X4_preprocessed_attr)

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Now, let's visualize the remaining features with full heatmap and dendrogram. Looks good to me ;)
However, as a perfectionist, I think we can still simplify our models while sacrificing as little
performance as possible. We going to try <b>recursive feature elimination</b>. I still hope sklearn
developer will introduce <b>genetic algorithm</b> in feature selection though :(
</p>

In [None]:
plot_heatmap_corr_full(
    X4_train_imputed, X4_preprocessed_attr, 12, 12, enable_mask=False)

In [None]:
plot_dendro_corr(X4_train_imputed, X4_preprocessed_attr,
                 18, 10, orientation='top', font_size=15, rotation=90)

<p style="text-align: justify; line-height: 2.0; line-indent: 5.0%; font-size: 14px; padding-right: 100px;">
But first, let's check the model's performance of current subset of features with models that are trained with
full set of features just now.
</p>

In [None]:
results1 = get_models_performance(classifiers_idx, X4_train, y4_train,
                                  5, scoring_metrics, num_attr4, cat_attr4)

<p style="text-align: justify; line-height: 2.0; line-indent: 5.0%; font-size: 14px; padding-right: 100px;">
To effectively compare results, we are going to merge two results together into one single dataframe.
</p>

In [None]:
sum_results = results.merge(
    results1, on='model_name', suffixes=('_before', '_after'))
sum_results.info()

In [None]:
roc2 = list(map(lambda elem: elem+'_before', roc))
roc2 += list(map(lambda elem: elem+'_after', roc))
roc2

In [None]:
sortValues(sum_results, roc2, [1, 3])

<p style="text-align: justify; line-height: 2.0; line-indent: 5.0%; font-size: 14px; padding-right: 100px;">
After comparing the results, we do not notice any significant improvement or deterioration in the model's
performance. It is due to the removed features are not important in prediciting the match outcome.
</p>

In [None]:
f1_2 = list(map(lambda elem: elem+'_before', f1))
f1_2 += list(map(lambda elem: elem+'_after', f1))
f1_2

In [None]:
sortValues(sum_results, f1_2, [1, 3])

In [None]:
try:
    precision_recall.remove('train_precision')
    precision_recall.remove('train_recall')
except ValueError:
    pass
precision_recall_2 = list(map(lambda elem: elem+'_before', precision_recall))
precision_recall_2 += list(map(lambda elem: elem+'_after', precision_recall))
precision_recall_2

In [None]:
sortValues(sum_results, precision_recall_2, [1, 3])

In [None]:
sortValues(sum_results, precision_recall_2, [3, 1])

### Recursive feature elimination

In [None]:
def recursive_feature_elimination(model, X, y, n_splits, scoring_metrics,
                                  num_attr, cat_attr, preprocessed_attr,
                                  random_state=RANDOM_SEED):

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    cv = StratifiedKFold(n_splits=n_splits,
                         shuffle=True, random_state=random_state)

    rfecv = RFECV(clone(model['func']), step=1, cv=cv,
                  scoring=scoring_metrics, n_jobs=-1)
    ml_pipeline = make_training_pipeline(rfecv, num_attr, cat_attr)
    ml_pipeline.fit(X, y)

    print('Optimal number of features : {}\nDropped features: {}'.format(
        rfecv.n_features_,
        ', '.join(np.array(preprocessed_attr)[~rfecv.support_]))
    )
    # Plot number of features VS. cross-validation scores
    plt.figure()
    plt.xlabel("Number of features selected")
    plt.ylabel("Cross validation score (nb of correct classifications)")
    plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
    plt.show()

    return rfecv

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Hmmm, the results are quite inconsistent across different models. We can drop quite many features for
XGB Classfier but we can't drop any feature for Random Forest Classifier. Well, we have done our best,
I think :)
</p>

In [None]:
rfecv1 = recursive_feature_elimination(classifiers['xg_boost'],
                                       X4_train, y4_train, 5, 'roc_auc',
                                       num_attr4, cat_attr4, X4_preprocessed_attr)

In [None]:
rfecv2 = recursive_feature_elimination(classifiers['log_reg'],
                                       X4_train, y4_train, 5, 'roc_auc',
                                       num_attr4, cat_attr4, X4_preprocessed_attr)

In [None]:
rfecv3 = recursive_feature_elimination(classifiers['rand_forest'],
                                       X4_train, y4_train, 5, 'roc_auc',
                                       num_attr4, cat_attr4, X4_preprocessed_attr)

## Phase 2: Precision-Recall Curve

In [None]:
def plot_precision_vs_recall(classifier, cf_name, X, y, ax, method, n_splits,
                             num_attr, cat_attr, label=False, random_state=RANDOM_SEED):

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)
    # get accurate y_scores using cross_val_predict, not from overfitted models
    # y_scores are generate using 'predict_proba' method of each models,
    # therefore probabilities of each class (total of 2) are returned
    ml_pipeline = make_training_pipeline(classifier, num_attr, cat_attr)
    cv = StratifiedKFold(n_splits=n_splits,
                         shuffle=True, random_state=random_state)
    y_scores_cv = cross_val_predict(ml_pipeline, X, y,
                                    cv=cv, method=method, n_jobs=-1)

    # Get the last columns of the y_scores only if more than one columns are detected
    if y_scores_cv.ndim > 1:
        y_scores_cv = y_scores_cv[:, -1]

    precisions, recalls, thresholds = precision_recall_curve(y, y_scores_cv)
    auc_score = average_precision_score(y, y_scores_cv)

    # Adjust settings for the plot (eg. set title of the plot)
    if label:
        label_name = cf_name
    else:
        label_name = None
    ax.plot(recalls, precisions, label=label_name)
    ax.set(xlabel='recall', ylabel='precision',
           title=f'PR Curve for {cf_name}')
    ax.title.set_fontsize(16)
    ax.grid()

    return precisions, recalls, thresholds, auc_score


def create_subplots(num_subplots, num_cols_per_row, fig_w, fig_h):
    num_rows = ceil(num_subplots / num_cols_per_row)
    indexes = list(product(range(num_rows), range(num_cols_per_row)))
    fig, axs = plt.subplots(num_rows, num_cols_per_row)
    fig.set_size_inches(fig_w, num_rows * fig_h)
    return num_rows, indexes, axs, fig


def plot_pr_curves(classifiers, X, y, n_splits, num_attr, cat_attr,
                   num_cols_per_row=4, fig_w=8, fig_h=6,
                   sameplot=False, random_state=RANDOM_SEED):
    num_rows, indexes, axs, fig = create_subplots(1 if sameplot else len(classifiers),
                                                  num_cols_per_row, fig_w, fig_h)
    auc_pr_curves = []

    with tqdm(total=100) as pbar:
        progress_unit = 100/len(classifiers)

        models_with_no_p_proba = ['Linear Support Vector Classifier']
        for classifier in classifiers.values():
            method = 'predict_proba' if classifier['name'] not in models_with_no_p_proba else 'predict'
            ax = axs if num_cols_per_row == 1 or sameplot else axs[
                idx] if num_rows == 1 else axs[indexes[idx][0]][indexes[idx][1]]
            _, _, _, auc_score = plot_precision_vs_recall(classifier['func'], classifier['name'],
                                                          X, y, ax, method, n_splits, num_attr,
                                                          cat_attr, label=True)
            auc_pr_curves.append(
                {'name': classifier['name'], 'auc_score': auc_score})
            pbar.update(progress_unit)

        ax.set(xlabel='recall', ylabel='precision',
               title=f'PR Curve for All Models')
        ax.title.set_fontsize(20)
        fig.legend(loc='upper right')
        fig.tight_layout()

    return auc_pr_curves

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Based on the cross-validated precision-recall curve below, Complement Naive Bayes, Quadratic Discriminant Analysis are discarded since both performed even worse than a purely random dummy classifier, that is, the area under curve is less than 5.0.
</p>

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
Combining all cross-validated precision recall curve under one figure as shown below, we can clearly observe that Decision Tree Classifier and K Neighbours Classifier have the lowest area under curve as compared to the less of the models. Therefore, these models are later discarded in the phase 4.
</p>

In [None]:
auc_pr_curves = plot_pr_curves(
    classifiers, X4_train, y4_train, 5, num_attr4, cat_attr4, 1, 12, 8, sameplot=True)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
<b>Linear Support Vector Classifier</b> and <b>Decision Tree</b> have a very low auc for precision-recall curve.
</p>

In [None]:
auc_pr_curves = pd.DataFrame(auc_pr_curves)
auc_pr_curves.sort_values('auc_score', ascending=False)

## Phase 3: ROC Curve

In [None]:
def plot_roc_curves(classifiers, X, y, n_splits, num_attr, cat_attr, num_cols_per_row=4,
                    fig_w=8, fig_h=6, sameplot=False, random_state=RANDOM_SEED):
    num_rows, indexes, axs, fig = create_subplots(1 if sameplot else len(classifiers),
                                                  num_cols_per_row, fig_w, fig_h)

    roc_scores = []
    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    with tqdm(total=100) as pbar:
        progress_unit = 100/len(classifiers)

        models_with_no_p_proba = ['Linear Support Vector Classifier']
        # Iterate all classifiers to plot on the same axis
        for classifier in classifiers.values():

            new_cf = clone(classifier['func'])
            method = 'predict_proba' if classifier['name'] not in models_with_no_p_proba else 'predict'
            ax = axs if num_cols_per_row == 1 or sameplot else axs[
                idx] if num_rows == 1 else axs[indexes[idx][0]][indexes[idx][1]]

            # get cross_validated y_score of training set from cross_val_predict,
            # without having to fit the whole training set or use test set
            cv = StratifiedKFold(n_splits=n_splits,
                                 shuffle=True, random_state=random_state)
            ml_pipeline = make_training_pipeline(new_cf, num_attr, cat_attr)
            y_score_cv = cross_val_predict(
                ml_pipeline, X, y, cv=cv, method=method)

            # Get the last columns of the y_scores only if more than one columns are detected
            if y_score_cv.ndim > 1:
                y_score_cv = y_score_cv[:, -1]

            # Plot the ROC curve
            fpr, tpr, threshold = roc_curve(y_train, y_score_cv)
            roc_auc = auc(fpr, tpr)

            graph = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                    estimator_name=classifier['name'])
            graph.plot(ax=ax)

            roc_scores.append(
                {'name': classifier['name'], 'auc_score': roc_auc})
            pbar.update(progress_unit)

        ax.set(xlabel='False positive rate', ylabel='True positive rate',
               title=f'ROC Curve with cross validation')
        ax.title.set_fontsize(20)
        ax.legend(loc="lower right")
        fig.tight_layout()
        plt.show()

        return roc_scores

In [None]:
roc_auc_scores = plot_roc_curves(
    classifiers, X4_train, y4_train, 5, num_attr4, cat_attr4, 1, 12, 8, sameplot=True)

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
<b>Linear Support Vector Classifier</b> and <b>Decision Tree</b> also have a very low auc for ROC curve.
We can discard them later.
</p>

In [None]:
roc_auc_scores = pd.DataFrame(roc_auc_scores)
roc_auc_scores.sort_values('auc_score', ascending=False)

## Phase 4: Final Selection

<ol>
<li>Logistic Classifier</li>
<li>K Neighbors Classifier</li>
<li>Multi-layer Perceptron classifier</li>
<li>Ensemble Tree</li>

<ul>
    <li>Random forest Classifier</li>
    <li>Extra Tree Classifier</li>
</ul>

<li>Bagging Classifier</li>
<li>Boosting algorithm</li>

<ul>
  <li>AdaBoost Classifier</li>
  <li>Gradient Boosting Classifier</li>
  <li>CatBoost Classifier</li>
  <li>XGBClassifier</li>
</ul>

<li>Linear Support Vector Classifier</li>
<li>Decision Tree</li>
</ol>

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We are going to select models that have decent performance and each belonging to
different families of algorithm. I am also going to choose <b>XGB classifier</b> since it
have relatively better f1 score. I am also going to choose <b>Random Forest Classifier, Logistic Classifier,
K Neighbors Classifier, and Multi-layer Perceptron classifier</b>. The reason I choose so many models so that I can build
voting and stacking classifier after model tuning to improve the robustness of the final model.
</p>

In [None]:
roc_scores = roc_auc_scores.merge(
    auc_pr_curves, on='name', suffixes=('_roc', '_pr'))
roc_scores.sort_values(['auc_score_roc', 'auc_score_pr'], ascending=False)

In [None]:
selected_cfs = {}

for key in ['log_reg', 'k_neighbors', 'neural_network', 'rand_forest', 'xg_boost']:
    selected_cfs[key] = classifiers[key]

In [None]:
print('List of Choosen Models')

for idx, classifier in enumerate(selected_cfs.values()):
    print(f'{idx+1} - {classifier["name"]}')

# Model Tuning



<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We are going to use randomized search instead of grid search since it takes too much time to go through all possibilities.
</p>
<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We are going to choose the best combination of hyperparameters sorted by the <b>num_trials, mean_cv_score and mean_test_score.</b>
Assuming <i>X_train</i> and <i>y_train</i> is the intial training set we feed into the algorithm, the algorithm will perform nested cross validated
randomized search by splitting <i>X_train</i> and <i>y_train</i> into k1-outer-fold <i>X_outer_train</i> and <i>y_outer_train</i>. Then, the algorithm will split
<i>X_outer_train</i> and <i>y_outer_train</i> into k2-inner-fold <i>X_inner_train</i> and <i>y_inner_train</i>.
<b>num_trials</b> is the number of times a combination of hyparameters explored by the randomized search during the k2-inner-fold sets.
<b>mean_test_score</b> is the average test score for each k2-inner-fold sets for each hyperparameters combination.
<b>mean_cv_score</b> is the average cross validated score across all k1-outer-fold sets for each hyperparameters combination.
</p>
<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
To put it simply, k2-inner-fold sets are used to find the hyperparameter's combination of the best estimator,
while the k1-outer fold sets are used to evaluate that best estimator with that hyperparameter's combination.
The purpose is to prevent the randomized search from producing overly optimistic results, which cause the model to overfit
the original training set and does not generalize well to real-world data with different variations and distributions.
</p>

In [None]:
def nested_cv_param_search(model, param_grid, X, y, num_attr, cat_attr,
                           n_iter, scoring, n_outer_splits, n_inner_spits,
                           random_state=RANDOM_SEED):

    cv_outer = StratifiedKFold(n_splits=n_outer_splits, shuffle=False)
    cv_inner = StratifiedKFold(
        n_splits=n_inner_spits, shuffle=True, random_state=random_state)

    outer_roc_score = list()
    inner_roc_score = list()

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    with tqdm(total=100) as pbar:
        progress_unit = 75/(n_outer_splits)

        for train_ix, test_ix in cv_outer.split(X, y):
            X_train, X_test = X.loc[train_ix, :], X.loc[test_ix, :]
            y_train, y_test = y[train_ix], y[test_ix]

            estimator = make_training_pipeline(
                clone(model), num_attr, cat_attr)
            search = RandomizedSearchCV(estimator, param_grid, n_iter=n_iter,
                                        scoring=scoring, cv=cv_inner, refit=True)
            result = search.fit(X_train, y_train)

            inner_roc_score.append(result.cv_results_)

            best_model = result.best_estimator_
            y_test_pred = best_model.predict(X_test)
            roc_score = roc_auc_score(y_test, y_test_pred)
            outer_roc_score.append(roc_score)

            pbar.update(progress_unit)

        features = ['params', 'mean_test_score', 'std_test_score']
        base = pd.DataFrame()

        for roc_score in inner_roc_score:
            roc_score = pd.DataFrame(roc_score)[features]
            base = base.append(roc_score, ignore_index=True)

        base['params'] = base['params'].astype('str')
        agg_mean = base.groupby('params')['mean_test_score']
        new_df = {'mean_test_score': agg_mean.mean(),
                  'std_test_score': agg_mean.std(), 'num_trials': agg_mean.count()}
        param_result = pd.DataFrame(new_df).reset_index()

        param_result['mean_cv_score'] = 0
        param_result['std_cv_score'] = 0

        params = list(param_result['params'].value_counts().index)
        progress_unit = 25/(n_outer_splits * len(params))

        for param in params:
            outer_roc_score = []
            for train_ix, test_ix in cv_outer.split(X, y):
                X_train, X_test = X.loc[train_ix, :], X.loc[test_ix, :]
                y_train, y_test = y[train_ix], y[test_ix]

                estimator = make_training_pipeline(
                    clone(model), num_attr, cat_attr)
                estimator.set_params(**ast.literal_eval(param))
                result = estimator.fit(X_train, y_train)

                y_test_pred = estimator.predict(X_test)
                roc_score = roc_auc_score(y_test, y_test_pred)
                outer_roc_score.append(roc_score)

                pbar.update(progress_unit)

            mean = np.mean(outer_roc_score)
            std = np.std(outer_roc_score)
            index = list(param_result['params'] == param).index(True)
            param_result.loc[index, 'mean_cv_score'] = mean
            param_result.loc[index, 'std_cv_score'] = std

    outer_roc_score = pd.DataFrame(
        outer_roc_score, columns=[f'{scoring}_score'])
    return param_result

## Model 1: Logistic Regression

We are going to tune C and l1_ratio for Logistic Regression.

In [None]:
log_reg = LogisticRegression(n_jobs=-1, max_iter=7000, random_state=RANDOM_SEED,
                             solver='saga', penalty='elasticnet')

log_reg_param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000],
                      'logisticregression__l1_ratio': [0, 0.25, 0.50, 0.75, 1]}

**Note**: Results of the randomized search is loaded from the pickle file because it takes time to run. Please run the code below if you insist.

```python
log_param_result = nested_cv_param_search(log_reg, log_reg_param_grid,
                                          X4_train, y4_train, num_attr4,
                                          cat_attr4, 10, 'roc_auc', 5, 4)

dump_objects('log_reg_cv_rand_search', log_param_result)
```

Here's the nested cross validated scores from randomized search for each explored hyperparameters' combinations for Logistic Regression.

In [None]:
[log_param_result] = load_objects(file_name='log_reg_cv_rand_search')
log_param_result

In [None]:
log_param_ranked = log_param_result.sort_values(
    ['num_trials', 'mean_cv_score', 'mean_test_score'], ascending=False).reset_index(drop=True)
log_param_ranked.loc[:10]

The best combination of hyperparameters for Logistic Regression.

In [None]:
log_best_param = log_param_ranked.loc[0, 'params']
log_best_param

## Model 2: K-Nearest Neighbors Classifier

We are going to tune n_neighbors for  K-Nearest Neighbors Classifier.

In [None]:
k_nearest_neigh = KNeighborsClassifier(weights='distance', n_jobs=-1)

k_nearest_neigh_param_grid = {'kneighborsclassifier__n_neighbors': [
    10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}

**Note**: Results of the randomized search is loaded from the pickle file because it takes time to run. Please run the code below if you insist.

```python
knn_param_result = nested_cv_param_search(k_nearest_neigh, k_nearest_neigh_param_grid,
                                          X4_train, y4_train, num_attr4,
                                          cat_attr4, 10, 'roc_auc', 5, 4)

dump_objects('knn_cv_rand_search', knn_param_result)
```

Here's the nested cross validated scores from randomized search for each explored hyperparameters' combinations for K-Nearest Neighbors Classifier.

In [None]:
[knn_param_result] = load_objects(file_name='knn_cv_rand_search')
knn_param_result

In [None]:
knn_param_ranked = knn_param_result.sort_values(
    ['num_trials', 'mean_cv_score', 'mean_test_score'], ascending=False).reset_index(drop=True)
knn_param_ranked.loc[:5]

The best combination of hyperparameters for K-Nearest Neighbors Classifier.

In [None]:
knn_best_param = knn_param_ranked.loc[0, 'params']
knn_best_param

## Model 3: Multi-layer Perceptron Classifier

We are going to tune alpha, beta_1, and beta_2 for Multi-layer Perceptron Classifier.

In [None]:
neural_network = MLPClassifier(early_stopping=True, random_state=RANDOM_SEED)

neural_network_param_grid = {
    'mlpclassifier__alpha': [0.0001, 0.001, 0.1, 1, 10],
    'mlpclassifier__beta_1': [0.1, 0.3, 0.6, 0.9],
    'mlpclassifier__beta_2': [0.1, 0.3, 0.6, 0.9]
}

**Note**: Results of the randomized search is loaded from the pickle file because it takes time to run. Please run the code below if you insist.

```python
neural_network_param_result = nested_cv_param_search(neural_network, neural_network_param_grid,
                                                     X4_train, y4_train, num_attr4,
                                                     cat_attr4, 10, 'roc_auc', 5, 4)

dump_objects('neural_network_cv_rand_search', neural_network_param_result)
```

Here's the nested cross validated scores from randomized search for each explored hyperparameters' combinations for Multi-layer Perceptron Classifier.

In [None]:
[neural_network_param_result] = load_objects(
    file_name='neural_network_cv_rand_search')
neural_network_param_result

In [None]:
neural_network_param_ranked = neural_network_param_result.sort_values(
    ['num_trials', 'mean_cv_score', 'mean_test_score'], ascending=False).reset_index(drop=True)
neural_network_param_ranked.loc[:10]

The best combination of hyperparameters for Multi-layer Perceptron Classifier.

In [None]:
neural_network_best_param = neural_network_param_ranked.loc[0, 'params']
neural_network_best_param

## Model 4: Random Forest Classifier

We are going to tune min_samples_split, min_samples_leaf, and max_samples for Random Forest Classifier.

In [None]:
rand_forest = RandomForestClassifier(
    bootstrap=True, random_state=RANDOM_SEED, n_jobs=-1)

rand_forest_param_grid = {
    'randomforestclassifier__min_samples_split': [0.01, 0.05, 0.10, 0.15],
    'randomforestclassifier__min_samples_leaf': [0.01, 0.05, 0.10, 0.15],
    'randomforestclassifier__max_samples': [0.70, 0.80, 0.90],
}

**Note**: Results of the randomized search is loaded from the pickle file because it takes time to run. Please run the code below if you insist.

```python
rand_forest_param_result = nested_cv_param_search(rand_forest, rand_forest_param_grid,
                                                  X4_train, y4_train, num_attr4,
                                                  cat_attr4, 15, 'roc_auc', 5, 4)

dump_objects('rand_forest_cv_rand_search', rand_forest_param_result)
```

Here's the nested cross validated scores from randomized search for each explored hyperparameters' combinations for Random Forest Classifier.

In [None]:
[rand_forest_param_result] = load_objects(
    file_name='rand_forest_cv_rand_search')
rand_forest_param_result

In [None]:
rand_forest_param_ranked = rand_forest_param_result.sort_values(
    ['num_trials', 'mean_cv_score', 'mean_test_score'], ascending=False).reset_index(drop=True)
rand_forest_param_ranked.loc[:10]

In [None]:
rand_forest_param_ranked = rand_forest_param_result.sort_values(
    ['mean_cv_score', 'num_trials', 'mean_test_score'], ascending=False).reset_index(drop=True)
rand_forest_param_ranked.loc[:10]

The best combination of hyperparameters for Random Forest Classifier.

In [None]:
rand_forest_best_param = rand_forest_param_ranked.loc[1, 'params']
rand_forest_best_param

## Model 5: XGB Classifier

We are going to tune learning_rate, min_child_weight, subsample and colsample_bytree for XGB Classifier.

In [None]:
xgb_cf = XGBClassifier(random_state=RANDOM_SEED, n_jobs=-1)

xgb_cf_param_grid = {
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.12, 0.20],
    'xgbclassifier__min_child_weight': [3, 6, 9],
    'xgbclassifier__subsample': [0.5, 0.7, 0.9],
    'xgbclassifier__colsample_bytree': [0.5, 0.7, 0.9],
}

**Note**: Results of the randomized search is loaded from the pickle file because it takes time to run. Please run the code below if you insist.

```python
xgb_cf_param_result = nested_cv_param_search(xgb_cf, xgb_cf_param_grid,
                                             X4_train, y4_train, num_attr4,
                                             cat_attr4, 20, 'roc_auc', 5, 4)

dump_objects('xgb_cf_cv_rand_search', xgb_cf_param_result)
```

Here's the nested cross validated scores from randomized search for each explored hyperparameters' combinations for XGB Classifier.

In [None]:
[xgb_cf_param_result] = load_objects(file_name='xgb_cf_cv_rand_search')
xgb_cf_param_result

In [None]:
xgb_cf_param_ranked = xgb_cf_param_result.sort_values(
    ['num_trials', 'mean_cv_score', 'mean_test_score'], ascending=False).reset_index(drop=True)
xgb_cf_param_ranked.loc[:10]

The best combination of hyperparameters for XGB Classifier.

In [None]:
xgb_cf_best_param = xgb_cf_param_ranked.loc[0, 'params']
xgb_cf_best_param

## Combining Models

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
Let's see if combining the models together into voting and stacking classifiers will improve the model performance or not.
</p>

In [None]:
optimized_models = []
models_name = ['log_reg', 'knn', 'neural_network', 'rand_forest', 'xgb_cf']
best_params = [log_best_param, knn_best_param,
               neural_network_best_param, rand_forest_best_param, xgb_cf_best_param]

for cf, param in zip(selected_cfs.values(), best_params):
    new_cf = clone(cf['func'])
    param = re.sub('(?<=\').+__(?=.+\')', '', param)
    new_cf.set_params(**ast.literal_eval(param))
    optimized_models.append(new_cf)

optimized_models = list(zip(models_name, optimized_models))
optimized_models

In [None]:
def get_stack_cf_cv_scores(model, X, y, num_attr, cat_attr,
                           n_outer_splits, n_inner_spits,
                           random_state=RANDOM_SEED):

    cv_outer = StratifiedKFold(n_splits=n_outer_splits, shuffle=False)
    cv_inner = StratifiedKFold(
        n_splits=n_inner_spits, shuffle=True, random_state=random_state)

    model.set_params(cv=cv_inner)

    outer_roc_score = list()

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    with tqdm(total=100) as pbar:
        progress_unit = 100/(n_outer_splits)

        for train_ix, test_ix in cv_outer.split(X, y):
            X_train, X_test = X.loc[train_ix, :], X.loc[test_ix, :]
            y_train, y_test = y[train_ix], y[test_ix]

            estimator = make_training_pipeline(
                clone(model), num_attr, cat_attr)
            estimator.fit(X_train, y_train)

            y_test_pred = estimator.predict(X_test)
            roc_score = roc_auc_score(y_test, y_test_pred)
            outer_roc_score.append(roc_score)

            pbar.update(progress_unit)

    outer_roc_score = pd.DataFrame(outer_roc_score, columns=[f'roc_score'])
    return outer_roc_score


def get_voting_cf_cv_scores(model, X, y, num_attr, cat_attr,
                            n_outer_splits, n_inner_spits, scoring='roc_auc',
                            random_state=RANDOM_SEED):

    cv_outer = StratifiedKFold(n_splits=n_outer_splits, shuffle=False)
    cv_inner = StratifiedKFold(
        n_splits=n_inner_spits, shuffle=True, random_state=random_state)

    roc_scores = pd.DataFrame()

    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    with tqdm(total=100) as pbar:
        progress_unit = 100/(n_outer_splits)

        for train_outer_ix, test_outer_ix in cv_outer.split(X, y):
            X_train, X_test = X.loc[train_outer_ix, :], X.loc[test_outer_ix, :]
            y_train, y_test = y[train_outer_ix], y[test_outer_ix]

            estimator = make_training_pipeline(
                clone(model), num_attr, cat_attr)
            cv_scores = cross_validate(estimator, X_train, y_train,
                                       scoring=scoring, cv=cv_inner, n_jobs=-1)

            inner_roc_scores = cv_scores['test_score']

            estimator.fit(X_train, y_train)
            y_test_pred = estimator.predict(X_test)
            outer_roc_score = roc_auc_score(y_test, y_test_pred)

            new_row = {'outer_roc_score': outer_roc_score,
                       'inner_roc_score_mean': np.mean(inner_roc_scores),
                       'inner_roc_score_std': np.std(inner_roc_scores)}
            roc_scores = roc_scores.append(new_row, ignore_index=True)

            pbar.update(progress_unit)

    return roc_scores

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
<b>roc_score</b> in the <i>outer_roc_score_stack</i> means the AUC of ROC calculated for each k1-outer-fold split of <i>X4_train, y4_train</i>.
The stacking classifier uses the k2-inner-fold split of splitted <i>X4_train, y4_train</i>. We use k1 = 5 and k2 = 4 to measure the
nested cross-validated scores.
</p>

In [None]:
stack_cf = StackingClassifier(
    optimized_models, stack_method='auto', n_jobs=-1, passthrough=False)
outer_roc_score_stack = get_stack_cf_cv_scores(
    stack_cf, X4_train, y4_train, num_attr4, cat_attr4, 5, 4)
outer_roc_score_stack

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
<b>outer_roc_score</b> in the <i>outer_roc_score_voting</i> means the AUC of ROC calculated for each k1-outer-fold split of <i>X4_train, y4_train</i>.
While <b>inner_roc_score_mean</b> means the average of the AUC of ROC calculated for k2-inner-fold split of splitted <i>X4_train, y4_train</i>.
We use k1 = 5 and k2 = 4 to measure the nested cross-validated scores.
</p>

In [None]:
voting_cf = VotingClassifier(optimized_models, voting='soft', n_jobs=-1)
outer_roc_score_voting = get_voting_cf_cv_scores(
    voting_cf, X4_train, y4_train, num_attr4, cat_attr4, 5, 4)
outer_roc_score_voting

<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
Unfortunately, voting and stacking classifiers don't yield any significant improvement :(
</p>

# Conclusion

<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
We are not going to test our models since the nested cross-validated performance score already tell a lot about our overall model performance.
Regardless, I definitely going to do another improvement on the models in the coming future. However, we do achieve
quite a lot in this part. We are able to prevent train-test contamination by only preprocess the train set instead of
the whole dataset. Besides, we are also able to perform feature selection without sacrificing too much on model performance.
Furthermore, I also have apply nested cross-validation on many areas especially in hyperparameter tuning.
</p>
<p style="text-align: justify; line-height: 2.0; text-indent: 5%; font-size: 14px; padding-right: 100px;">
In part III, we are going to understand why the models suck at doing their jobs. I suspect that outlier might have
something to do with the poor performance as shown in the plot below. However, we cannot deny other possible factors like insufficient
relevant features or samples, which is something that I can't fulfill :)
</p>
<p style="text-align: justify; line-height: 2.0; font-size: 14px; padding-right: 100px;">
If you guys have any feedbacks or suggestions please feel free to share,
    I started to learn data science &#60; 6 months ago  and still have many more to learn :)
</p>

In [None]:
def create_subplots(cols, num_cols_per_row, fig_w, fig_h):
    num_rows = ceil(len(cols) / num_cols_per_row)
    indexes = list(product(range(num_rows), range(num_cols_per_row)))
    fig, axs = plt.subplots(num_rows, num_cols_per_row)
    fig.set_size_inches(fig_w, num_rows * fig_h)
    return num_rows, indexes, axs


def plot_boxplot(df, cols, num_cols_per_row=4, fig_w=16, fig_h=7):
    num_rows, indexes, axs = create_subplots(
        cols, num_cols_per_row, fig_w, fig_h)

    for idx, col in enumerate(cols):
        ax = axs[idx] if num_rows == 1 else axs[indexes[idx][0]][indexes[idx][1]]
        sns.boxplot(y=df[col], ax=ax)
        ax.set(title=col, ylabel=None)

In [None]:
plot_boxplot(dating4, num_attr4)