# Searching of Classifier Hyperparameters

#### Introduction

One baseline dataset [dimension = 128, random seed = None] was used for searching of optimal classifer hyperparmeters.

Two classifiers were experiemented in focus, namely Bagging SVM, Random Forest. 

The data are normalised. The dataset is split into training and testing set. GridSearchCV is employed to build the classifier pipeline and cross validation for each set of hyperparamters. Three scorers, top20, top100 and AUPRC are used to evaluate the classifiers. The results are used for selection of the best performers. 

User can use this notebook to amend the hyperparameters as well as the scorer.

#### Evaluation of model performance:

Input:	dataset, hyperparameters grid, performance scorers

Process:

-  Configuration scorers
-  Split the baseline dataset into training and testing dataset, 
-  Configure GridsearchCV pipeline
-  Fitting of training data
-  display the results

Quality control:

-	verify total number of CV splits and the total number of combinations of hyperparameter

Output:	

-	Evaluation results of by the GridSearch method

Remarks: User can amend the hyperparameters settings


#### Import Libraries

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pulearn import BaggingPuClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, average_precision_score
from sklearn.model_selection import train_test_split

#### Define helper functions

In [None]:
# defined performance metrics scorers

def auprc_score(y_true, y_pred):
    '''
    scoring function of AUPRC
    parametes:
        y_true: pandas series, series of value of the class label
        y_pred: numpy array, the predicted probality of positive class
    return:
        AUPRC value
    '''
    return average_precision_score(y_true, y_pred)

def topk(y_true, y_pred, top_k=100, get_mask=False):
    '''
    scoring function of top k hit. 
    sort the prediction probability, from the toppest k predictions, count the numbers of true positives being predicted
    parametes:
        y_true: pandas series, series of true label
        y_pred: numpy array, the predicted probality of positive class
        top_k: int, the k value to be set for the scorer for calculating how many hits on the toppest k predictions
        get_mask: bool, to control the function to export number of hits or the positive mask of the top k predictions
    return:
        if get_mask is false: return top k hit score
        if get_mask is true: return the positve mask 
    '''
    sorted_indices = y_pred.argsort()[::-1]
    top_k_indices = sorted_indices[:top_k]
    y_pred_top_k_mask = np.full(y_true.shape, False, dtype=bool)
    y_pred_top_k_mask[top_k_indices] = True
    top_k_hits = y_true.values[y_pred_top_k_mask].sum()
    
    if get_mask == False:
        return top_k_hits
    else: 
        return y_pred_top_k_mask

In [None]:
np.set_printoptions(precision=10)

#### Load baseline dataset (d=128, random seed = None)

In [None]:
# load the subject dataset

dataset_filename = 'dataset_p_4_q_1_dim_128_walkleng_100_numwalks_500.csv'

file_path = os.path.join('data', 'datasets', dataset_filename)
dataset = pd.read_csv(file_path)

#### Splitting the dataset

In [None]:
# X is the data of feature1 to feature128 in the dataset
X = dataset.drop(['id', 'y'], axis=1)
# y is the target value of the last column in the dataset
y = dataset['y']

# split the dataset into training and testing with 80% to 20% proportion and randomly shuffle the data
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size=0.2, random_state=37,stratify=y)

train_indices = train_X.index
test_indices = test_X.index

# print the shape of the data
print(f'Training dataset shape, X: {train_X.shape}')
print(f'Training dataset shape, y: {train_y.shape}')
print(f'Testing dataset shape, X: {test_X.shape}')
print(f'Testing dataset shape, y: {test_y.shape}')

Training dataset shape, X: (12968, 128)
Training dataset shape, y: (12968,)
Testing dataset shape, X: (3242, 128)
Testing dataset shape, y: (3242,)


#### Search for Bagging SVM parameters

In [None]:
# normalise the feature data and build Bagging SVM classifer pipline

scaler = StandardScaler()
base_clf = SVC()
clf = BaggingPuClassifier(base_estimator=base_clf, n_jobs = -1, random_state=44, verbose=0)

pipe = Pipeline([
    ("scale", scaler),
    ("clf", clf)
])

In [None]:
# setup hyperparmeter grid by phases, broader search of key parameters, such as C

pu_estimator = GridSearchCV(estimator=pipe, 
                            param_grid={
                                        'clf__base_estimator__C':[1, 2, 3, 4],
                                        'clf__n_estimators':[200],
                                        'clf__max_samples': [500, 700]
                                        },
                            scoring={
                                     'top20k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=20),
                                     'top100k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=100),
                                     'auprc_scorer' : make_scorer(auprc_score, needs_proba=True)},                                     
                            refit='auprc_scorer',
                            return_train_score=True,
                            cv=3
                            )
pu_estimator.fit(train_X, train_y)


In [51]:

cv_results = pd.DataFrame(pu_estimator.cv_results_)

In [56]:
# set the display limit to be shown in this notebook
pd.set_option('display.max_columns', 80)

In [None]:
# show cv_results
cv_results

In [None]:
# narrow down some best hyperparameters and search of other hyperparameters

pu_estimator = GridSearchCV(estimator=pipe, 
                            param_grid={
                                        'clf__max_features' : [0.5, 0.75, 1.0],
                                        'clf__base_estimator__C':[3, 4],
                                        'clf__n_estimators':[100, 200],
                                        'clf__max_samples': [500, 600], 
                                        },
                            scoring={
                                     'top20k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=20),
                                     'top100k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=100),
                                     'auprc_scorer' : make_scorer(auprc_score, needs_proba=True)},                                     
                            refit='auprc_scorer',
                            return_train_score=True,
                            cv=3
                            )
pu_estimator.fit(train_X, train_y)


In [None]:
cv_results = pd.DataFrame(pu_estimator.cv_results_)
pd.set_option('display.max_columns', 80)

#### Search for hyperparameters of Random Forest

In [None]:
# build Random Forest classifier pipeline

scaler = StandardScaler()
base_clf = DecisionTreeClassifier()
clf = BaggingPuClassifier(base_estimator=base_clf, n_jobs = -1, random_state=44, verbose=0)

pipe = Pipeline([
    ("scale", scaler),
    ("clf", clf)
])

In [None]:
# perform comprehensive search of the hyperparamters. This would take some time 
# as there are over 4000 set of hyperparmeter combinations.

pu_estimator = GridSearchCV(estimator=pipe, 
                            param_grid={
                                        'clf__bootstrap_features': [True, False],
                                        'clf__base_estimator__max_leaf_nodes': [100, 120, 140],
                                        'clf__base_estimator__max_depth': [12, 17, 22],
                                        'clf__base_estimator__min_samples_leaf': [1, 2, 4],
                                        'clf__base_estimator__min_samples_split': [2, 5, 10],
                                        'clf__max_features' : [0.5, 0.75, 1.0],
                                        'clf__n_estimators':[100, 200, 300],
                                        'clf__max_samples': [500, 600, 700], 
                                        },
                            scoring={
                                     'top20k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=20),
                                     'top100k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=100),
                                     'auprc_scorer' : make_scorer(auprc_score, needs_proba=True)},                                     
                            refit='auprc_scorer',
                            return_train_score=True,
                            cv=3
                            )
pu_estimator.fit(train_X, train_y)


In [None]:
cv_results = pd.DataFrame(pu_estimator.cv_results_)
pd.set_option('display.max_columns', 80)
cv_results

In [None]:
# due to the size of this search is enormous, a tool is built to summaries the results

cv_results_summary = cv_results.copy()

col_list = [                                       
    'param_clf__bootstrap_features',
    'param_clf__base_estimator__max_leaf_nodes',
    'param_clf__base_estimator__max_depth',
    'param_clf__base_estimator__min_samples_leaf',
    'param_clf__base_estimator__min_samples_split',
    'param_clf__max_features',
    'param_clf__n_estimators',
    'param_clf__max_samples'
]

cat_summaries = []

for col_name in col_list:
        cat_summary = cv_results_summary.groupby(col_name).agg({'mean_test_top20k': 'mean', 'rank_test_top20k': 'mean', 'mean_test_top100k': 'mean', 'rank_test_top100k': 'mean', 'mean_test_auprc_scorer':'mean', 'rank_test_auprc_scorer' : 'mean'}).reset_index()
        cat_summary.rename(columns={col_name: 'parameters'}, inplace=True)
        cat_summary.insert(0, 'param_name', col_name)
        cat_summaries.append(cat_summary)

stacked_df = pd.concat(cat_summaries, axis=0, ignore_index=True)
stacked_df

In [None]:
# set constants or narrow down the range of certain parameters, and perform a second round of search

pu_estimator = GridSearchCV(estimator=pipe, 
                            param_grid={
                                        'clf__base_estimator__max_leaf_nodes': [120],
                                        'clf__base_estimator__max_depth': [8, 10, 12],
                                        'clf__base_estimator__min_samples_leaf': [2, 3, 4],
                                        'clf__base_estimator__min_samples_split': [2],
                                        'clf__n_estimators':[200],
                                        'clf__max_samples': [500, 600], 
                                        },
                            scoring={
                                     'top20k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=20),
                                     'top100k': make_scorer(topk, greater_is_better=True, needs_proba=True, top_k=100),
                                     'auprc_scorer' : make_scorer(auprc_score, needs_proba=True)},                                     
                            refit='auprc_scorer',
                            return_train_score=False,
                            cv=3
                            )
pu_estimator.fit(train_X, train_y)


In [None]:
cv_results = pd.DataFrame(pu_estimator.cv_results_)
pd.set_option('display.max_columns', 80)
cv_results

In [None]:
cv_results_summary = cv_results.copy()

col_list = [                                       
    'param_clf__base_estimator__max_depth',
    'param_clf__base_estimator__min_samples_leaf',
    'param_clf__base_estimator__min_samples_split',
    'param_clf__max_samples'
]

cat_summaries = []

for col_name in col_list:
        cat_summary = cv_results_summary.groupby(col_name).agg({'mean_test_top20k': 'mean', 'rank_test_top20k': 'mean', 'mean_test_top100k': 'mean', 'rank_test_top100k': 'mean', 'mean_test_auprc_scorer':'mean', 'rank_test_auprc_scorer' : 'mean'}).reset_index()
        cat_summary.rename(columns={col_name: 'parameters'}, inplace=True)
        cat_summary.insert(0, 'param_name', col_name)
        cat_summaries.append(cat_summary)

stacked_df = pd.concat(cat_summaries, axis=0, ignore_index=True)
stacked_df