## Tuning the hyperparameters of a model

#### Trying out different hyperparameters of a model to find the best set of hyperparameters to use

#### Tags:
    Data: labeled data, Kaggle competition
    Technologies: python, pandas, scikit-learn
    Techniques: hyperparameter tuning
    
#### Resources:
[Kaggle competition data - Predicting Pulsar stars](https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star/data)



### Hyperparameters of a model

Different models and machine learning techniques have different ways of calculating the target variable (in predictive analytics) and this also means there are a certain set of levers that exist in each tecnique that allows us to fit the data better to the task and the data at hand - hyperparameters.

Hyperparameters are different from parameters in that they are not a part of the model (think parametric models like linear regression), and they do not have to be estimated. But none the less they can be used to change the out of functions that are used in different models.

### Tuning the hyperparameters

The process of changing the hyperparameters is called hyperparameter tuning, as in we are changing the hyperparameters to maximize the metric we evaluate the model by. In essence we change the levers of a model to fit the model better to data at hand. 

### Hyperparameter tuning for algorithms in classification

#### Predicting if the star is a pulsar based 

We will use the Kaggle data to tune the hyperparameters of the following models:
    
    1. LogisticRegression
    2. GradientBoostingClassifier
    3. KNeighboursClassifier
    
So lets take a look!

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
%matplotlib inline

In [3]:
# import the relevant dataset
df = pd.read_csv('../data/pulsar_stars.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 Mean of the integrated profile                  17898 non-null float64
 Standard deviation of the integrated profile    17898 non-null float64
 Excess kurtosis of the integrated profile       17898 non-null float64
 Skewness of the integrated profile              17898 non-null float64
 Mean of the DM-SNR curve                        17898 non-null float64
 Standard deviation of the DM-SNR curve          17898 non-null float64
 Excess kurtosis of the DM-SNR curve             17898 non-null float64
 Skewness of the DM-SNR curve                    17898 non-null float64
target_class                                     17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


##### There are 17898 observations and 9 columns in the dataset 

In [4]:
# Data preparation
X = df.drop(['target_class'],axis=1)
y = df['target_class']

# Scaling the data
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X),columns=X.columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test  = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

### Setting the parameters to cycle through Grid Search 

For each of the models we set the hyperparameters we want to tune and the values we whish to tune. In this way a matrix of possible combination of parameters is built and the model is created with those hyperparameters. Then each model is evaluated through cross-validation and the hyperparameters of the best model are stored and available at the end. In sklearn this is called GridSearchCV.


In [5]:
# setting the needed classifiers
classifiers = {}
classifiers['Logistic Regression'] = LogisticRegression(random_state=42)
classifiers['Gradient Boosting'] = GradientBoostingClassifier(random_state=42)
classifiers['KNN'] = KNeighborsClassifier()

# setting the wanted hyperparameters
hyperparams = {}
hyperparams['Logistic Regression'] = {
                                        'penalty': ['l1', 'l2'],
                                        'C': [0.1, 0.25, 0.5],
                                        'solver': ['liblinear']
                                    }
hyperparams['Gradient Boosting'] = {
                                'loss': ['deviance', 'exponential'],
                                'learning_rate': [0.05, 0.1, 0.3],
                                'n_estimators':[10, 20, 50],
                                'min_samples_split':[10, 50, 100]
                               }
hyperparams['KNN'] = {
                        'n_neighbors': [5, 10, 20],
                        'weights': ['uniform', 'distance']
                    }

In [6]:
# Train the different models

def train_predict_gscv(classifiers, hyperparams, X_train, y_train, X_test, y_test):
    '''
    Given a model train the model given the data
    '''
    
    best_params={}
    test_score={}
    for model in classifiers:
        
# use GridSearchCV class to create and object with certain parameters        
        gscv = GridSearchCV(estimator=classifiers[model],param_grid=hyperparams[model],scoring='roc_auc',cv=10,verbose=1)         

# fit the GridSearchCV using the above provided params    
        gscv.fit(X_train, y_train)  

# store the best parameters    
        best_params[model] = gscv.best_params_        
        
# predict with the best model found        
        y_hat = gscv.predict(X_test)
        
        test_score[model] = roc_auc_score(y_test,y_hat)
        
    return best_params, test_score


In [28]:
# results
best_params, test_score = train_predict_gscv(classifiers, hyperparams, X_train, y_train, X_test, y_test)

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    9.8s finished


Fitting 10 folds for each of 54 candidates, totalling 540 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed:  2.3min finished


Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   10.7s finished


In [36]:
best_params

{'Logistic Regression': {'C': 0.5, 'penalty': 'l1', 'solver': 'liblinear'},
 'Gradient Boosting': {'learning_rate': 0.3,
  'loss': 'exponential',
  'min_samples_split': 50,
  'n_estimators': 50},
 'KNN': {'n_neighbors': 20, 'weights': 'distance'}}

In [37]:
test_score

{'Logistic Regression': 0.9146838995582806,
 'Gradient Boosting': 0.9206075865635448,
 'KNN': 0.8955320468886067}

### Selecting the best model

By inspecting the AUC score we found out that all of the models perform extremely well, with Gradient Boosting achieving AUC score of 0.92. Interesting is that both Logistic Regression and KNN are not that far away from that score. The data set is such that classifying if a star is a pulsar can be done very nicely.

More details on the different hyperparameters for each of the models can be found below:

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

[KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)