<a href="https://colab.research.google.com/github/silvererudite/ML_algos_onSomeDatasets/blob/master/EDA_wine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this post we will learn about basic Hyperparameter tuning using `GridSearch`. Since I want to focus solely on this concept I will skip other pre-processing and feature selection techniques which are also important steps in a Machine Learning pipeline.

In [62]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine

data = load_wine()
wine_data = pd.DataFrame(data.data, columns=data.feature_names)
wine_target = pd.DataFrame(data.target, columns=['wine_class'])
wine_data.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


We will split the data into train and test sets so that we can do all of our EDA on the training set and keep the test set untouched.

In [55]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(wine_data, wine_target, test_size=0.2, random_state=42)




Let's quickly generate a summary statistics to get an overview of our data and 

find if there are any `null` values as well

In [56]:
print(X_train.describe())
print(X_train.info())

          alcohol  malic_acid  ...  od280/od315_of_diluted_wines      proline
count  142.000000  142.000000  ...                    142.000000   142.000000
mean    12.979085    2.373521  ...                      2.592817   734.894366
std      0.820116    1.143934  ...                      0.722141   302.323595
min     11.030000    0.890000  ...                      1.270000   278.000000
25%     12.332500    1.615000  ...                      1.837500   502.500000
50%     13.010000    1.875000  ...                      2.775000   660.000000
75%     13.677500    3.135000  ...                      3.170000   932.750000
max     14.830000    5.800000  ...                      4.000000  1547.000000

[8 rows x 13 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 158 to 102
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       142 non-null   

It looks like we are good to go as there are no `null` values or incompatible data types

#### Hyperparameter Search

Hyperparameters are features that are intrinsic to the model. For instance in case of KNeighbors classifiers number of neighbors and in case of Decision tree classifier, number of trees is the hyperparameter. These are things that we cannot learn from the data and are unique to the particular model that we would use. Hyperparameters are also found to affect model performance and the way to know what is the best hyperparameters requires experimentation. The process of finding the best value of hyperparameters is called `tuning`. `GridSearch` and `RandomSearch` are common techniques for hyperparameter tuning. For this article, we will look at a very basic type of GridSearch where will only tune 1 parameter the `n_neighbbors` .

In [57]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

k_values = [1,2,3,4,5,6,7]
mse_values = []

for k in k_values:
  knn = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
  knn.fit(X_train, y_train)
  y_pred = knn.predict(X_test)
  mse_values.append(accuracy_score(y_test,y_pred))

print(mse_values)



[0.7777777777777778, 0.7222222222222222, 0.8055555555555556, 0.75, 0.7222222222222222, 0.7222222222222222, 0.6944444444444444]


  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
