# Hyperparameter Tuning and Feature Selection

## Table of Contents

1. [**Introduction**](#Intro)
2. [**Cross-Validation**](#CrossVal)
3. [**Hyperparameter Tuning**](#HyperTun)

    3.1. [**Grid Search**](#GridSrch)

    3.2. [**Random Search**](#RandSrch)

4. [**Feature Selection**](#FeatureSlc)

    4.1. [**Feature Importance**](#FImport)
  
    4.2. [**Recursive Feature Elimination**](#RFE)

  


# 1 Introduction <a name="Intro"></a>




In this notebook, we will cover some aspects related to fine tuning a model. First, we introduce cross-validation and how it can be used for the evaluation of the performance of a model. Then, we look into hypterparameter tuning using grid search and random search. Finally, a brief discussion on feature selection and how to identify the most imprtant features for model development.

# 2 Cross-Validation <a name="CrossVal"></a>


The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.


A value of __k=10__ is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset. The reason for this is studies were performed and k=10 was found to provide good trade-off of low computational cost and low bias in an estimate of model performance.

Let's try it out using a logistic regression model. We are going to use the steel manufacturing data that we have seen before.

Steel plate faults dataset is provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. In this dataset, the faults of steel plates are classified into 7 types. Since it has been donated on October 26,2010, this dataset has been widely used in machine learning for automatic pattern recognition. Types of fault and corresponding numbers of sample are shown in the table below

<img src="https://docs.google.com/uc?export=download&id=1pw1oJ7plDsTASg_ntI_QSVivQ-tMhlqq" width="500">


The numbers of sample vary a lot from one category to another. Meanwhile, fault 7 is a special class because it contains all other faults except the first six kinds of fault. In other words, samples in class 7 may have no obvious common characteristics. For every sample, 27 features are recorded, providing evidences for its fault class. All attributes are expressed by integers or real numbers. Detailed information about these 27 independent variables is listed out in the following table.

<img src="https://docs.google.com/uc?export=download&id=1lAV-mPa2seL9VWkezbaCicnZVwOup2c6" width="500">



In [None]:
import pandas as pd
import numpy as np

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/Steel_Plates_Faults/Data.csv')
df = pd.read_csv(url,names=['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
                            'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
                            'Length_of_Conveyer', 'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
                            'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index', 'Edges_X_Index',
                            'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas', 'Log_X_Index', 'Log_Y_Index',
                            'Orientation_Index', 'Luminosity_Index', 'SigmoidOfAreas', 'Pastry', 'Z_Scratch',
                            'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults'])           
df.head()

In [None]:
Features = ['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
             'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
             'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness']
             
x_data=np.array(df[Features])
y_data=df['Stains']


from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.3)

In [None]:
from sklearn.model_selection import KFold  
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# By setting the random_state argument, it ensures that we get 
# the same randomly generated examples each time the code is run.

# create model
model = LogisticRegression()

# evaluate model
scores = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print(f'Accuracy: {np.mean(scores)}')

In [None]:
scores

we can split a dataset randomly in such a way that maintains the same class distribution in each subset. This is called __stratification__ or __stratified__ sampling. we can use a version of k-fold cross-validation that preserves the imbalanced class distribution in each fold. It is called __stratified k-fold cross-validation__ and will enforce the class distribution in each split of the data to match the distribution in the complete training dataset.

In [None]:
import matplotlib.pyplot as plt

print('Target variable distribution in training data:')
print( y_train.value_counts() )
fig = plt.figure(figsize=(15,4))
fig.add_subplot(1,2,1)
y_train.value_counts().plot(kind='barh')
plt.title('Training Data')

print('Target variable distribution in test data:')
print( y_test.value_counts() )
fig.add_subplot(1,2,2)
y_test.value_counts().plot(kind='barh')
plt.title('Test Data');

In [None]:
from sklearn.model_selection import StratifiedKFold

# prepare the cross-validation procedure
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
# By setting the random_state argument, it ensures that we get 
# the same randomly generated examples each time the code is run.

# create model
model = LogisticRegression(solver= 'newton-cg', penalty='l2', C = 5e-06)

# evaluate model
scores = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print(f'Accuracy: {np.mean(scores)}')

# 3 Hyperparameter Tuning <a name="HyperTun"></a>


## 3.1 Grid Search <a name="GridSrch"></a>

A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.

We can use scikit-learn `GridSearchCV` to perform this task. 

We can perform a grid search for __SVM__ hyperparameters for example.

Recall the SVM had two important hyperparameters, `C` and `gamma`. 

- <font color='red'> __C__ </font> (Regularization Parameter) tells the SVM optimization how much you want to avoid misclassifying each training example. If __C__ is high, the optimization will choose smaller margin hyperplane, so training data misclassification rate will be low. On the other hand, if __C__ is low, then the margin will be big, even if there will be misclassified training data examples.

<img src="https://docs.google.com/uc?export=download&id=1B0ZEQqulcumqQBhSqtD18gnbWS87Z1SS" width="900">

(ref for figure: https://www.vebuso.com/2020/03/svm-hyperparameter-tuning-using-gridsearchcv)

- <font color='red'> __Gamma__ </font>: The gamma parameter defines how far the influence of a single training example reaches. This means that high Gamma will consider only points close to the plausible hyperplane and low Gamma will consider points at greater distance. This parameter comes into play if only kernels 'rbf', 'poly' and 'sigmoid' are used.

<img src="https://docs.google.com/uc?export=download&id=18ioDU_qX_fCKw-fnAVBapBfUnqgd1nHF" width="900">

(ref for figure: https://www.vebuso.com/2020/03/svm-hyperparameter-tuning-using-gridsearchcv)

In [None]:
import warnings
warnings.filterwarnings("ignore")


from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# define model
model = SVC()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

# define search space
space = dict()
#space['kernel'] = ['rbf', 'poly']
space['gamma'] = [1, 0.1, 0.01]
space['C'] = [0.1, 1]

# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}' )

As another example, we can look at Logistic Regression. For this algorithm, we can search for the best values for the numerical solver, the penalty term (which could be `L1` or `L2` norms or a combination `elasticnet`).

Recall the regularization terms:

$L_1=\frac{1}{C} \sum_{i=1}^{n}|w_i|$

$L_2=\frac{1}{C} \sum_{i=1}^{n}w_i^2$

In logistic regression implementation, regularization strength is determined by parameter `C`. From the relationships, smaller values specify stronger regularization.


Let's try Logistic Regression

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import GridSearchCV

# define model
model = LogisticRegression()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}' )

## 3.2 Random Search <a name="RandSrch"></a>

We can use scikit-learn `RandomizedSearchCV` to perform this task. We must set the number of iterations or samples to draw from the search space via the `n_iter` argument.

For SVM as an example

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

# define model
model = SVC()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

# define search space
space = dict()
#space['kernel'] = ['rbf', 'poly']
space['gamma'] = [1, 0.1, 0.01, 0.001]
space['C'] = [0.1, 1, 10, 100]

# define search
search = RandomizedSearchCV(model, space, n_iter=10, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}')


For logistic Regression as another example

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import RandomizedSearchCV

# define model
model = LogisticRegression()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 
              5e-2, 1e-1, 5e-1, 1, 5, 10, 50, 100, 500, 10000]

# define search
search = RandomizedSearchCV(model, space, n_iter=20, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}')


# 4 Feature Selection <a name="FeatureSlc"></a>




"feature selection… is the process of selecting a subset of relevant features for use in model construction". Also, "Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features."

## 4.1 Feature Importance <a name="FImport"></a>

Using __logistic__ regression model:

In [None]:
# define the model
model = LogisticRegression()
# fit the model
model.fit(x_train, y_train)
# get importance
importance = model.coef_[0]
print(importance)

In [None]:
# summarize feature importance
for rank, feature in sorted(zip(Features,importance,)):
  print(rank,feature)

# plot feature importance
plt.figure(figsize=(10,4))
plt.bar([x for x in range(len(importance))], importance)
plt.xticks(ticks=range(len(importance)), labels=Features, rotation=90)
plt.show()

## 4.2 Recursive Feature Elimination (RFE) <a name="RFE"></a>

RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFE

# define the method
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)
# fit the model
rfe.fit(x_train, y_train)

# evaluate model
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

n_scores = cross_val_score(rfe, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print(f'Accuracy: {np.mean(n_scores)}')


We can also look a the ranking for the features

In [None]:
Features

In [None]:
for rank, feature in sorted(zip(rfe.ranking_,Features)):
  print(rank,feature)