# Model Performance Measures, ML Pipeline and Hyperparameter Tuning

## Can you correctly identify glass type?

## Context:
    
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 
values)

# Content

Attribute Information:

Id number: 1 to 214 

RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute) 
    -- 1 building_windows_float_processed 
    -- 2 building_windows_non_float_processed 
    -- 3 vehicle_windows_float_processed 
    -- 4 vehicle_windows_non_float_processed (none in this database) 
    -- 5 containers 
    -- 6 tableware 
    -- 7 headlamps

## Source:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification

# 1.  Import necessary libraries and load the data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data=pd.read_csv("glass.csv")

data.head(10)

Unnamed: 0,ID,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron,Type
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,building_windows_float_processed
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,building_windows_float_processed
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,building_windows_float_processed
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,building_windows_float_processed
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,building_windows_float_processed
5,6,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.0,0.26,building_windows_float_processed
6,7,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0.0,0.0,building_windows_float_processed
7,8,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0.0,0.0,building_windows_float_processed
8,9,1.51918,14.04,3.58,1.37,72.08,0.56,8.3,0.0,0.0,building_windows_float_processed
9,10,1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0.0,0.11,building_windows_float_processed


In [8]:
data.shape
data.dtypes


ID                    int64
refractive index    float64
Sodium              float64
Magnesium           float64
Aluminum            float64
Silicon             float64
Potassium           float64
Calcium             float64
Barium              float64
Iron                float64
Type                 object
dtype: object

In [10]:
data.isna().sum()

ID                  0
refractive index    0
Sodium              0
Magnesium           0
Aluminum            0
Silicon             0
Potassium           0
Calcium             0
Barium              0
Iron                0
Type                0
dtype: int64

# 2. Split the data into dependent and independent variables. Also see how the looks like

Hint: you can make use of nay method(iloc or drop method)

In [16]:
X=data.iloc[:,1:10].values
y=data.iloc[:,10].values
X.shape
y.shape



(214,)

# 3. Convert Target variable into numerical

In [17]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
y= le.fit_transform(y)

le.transform(['building_windows_float_processed','building_windows_non_float_processed','containers','headlamps',
              'tableware','vehicle_windows_float_processed'])


array([0, 1, 2, 3, 4, 5])

# 4. Split the dataset into train set test set also the validation 
Always a good practice to split the dataset into 3 sets

In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30 ,random_state=0)
X_train,X_val,y_train, y_val =train_test_split(X_train,y_train,test_size=0.25,random_state=1)

X_train.shape



(111, 9)

# 5. Build the pipeline
Steps:
Instantiate the pipeline, as first defining standard scaler and on the scaled data run the PCA and then feed it to the logistic regression(or any other algo)

Hint:

Import standard scaler to standardize the data

You can take an algorithm of choice and build a pipeline

In [30]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe_lr= Pipeline([('scl',StandardScaler()),('pca',PCA(n_components=3)),('clr',LogisticRegression(random_state=1))])
pipe_lr.fit(X_train,y_train)
pipe_lr.score(X_test,y_test)

0.5230769230769231

# 6.Follow the above steps and check if you can tweak the logistic regression parameters above and make use of Grid search(can use any algorithm)

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

pipe_svc=Pipeline([('sc1',StandardScaler()),('pca',PCA()),('svc',SVC())])

param_grid = {'pca__n_components':[4,5],'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10], 'svc__kernel':['rbf','poly']} 

grid=GridSearchCV(pipe_svc,param_grid=param_grid,cv=5)
grid.fit(X_train,y_train)
print(" dddd : {:.2f}".format(grid.best_score_))
print(" dddd : ",grid.best_params_)
print("22222 : {:.2f}  ".format(grid.score(X_test,y_test)))





 dddd : 0.74
 dddd :  {'pca__n_components': 5, 'svc__C': 10, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}
22222 : 0.71  


# 7. Optimize the model parameters(can make use of any algorithm)

Make use of Grid search for hyper parameter

Steps:
Split the dataset into train and test set

Make use of any algorithm , from the list of hyper parameters you get apply param grid 

Once hyper parameter grid is defined, import grid search CV and fit x_train and y_train

Find the best params and mean test score



In [41]:
grid.predict(X_test)

array([3, 0, 1, 4, 2, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 1, 0, 1, 2, 0,
       3, 3, 1, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, 1, 3, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 4, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [42]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf=KNeighborsClassifier()

In [44]:
X_train,X_test, y_train,y_test = train_test_split(X,y,stratify=y , random_state=7)
knn_clf.fit(X_train,y_train)
knn_clf

KNeighborsClassifier()

In [45]:
from sklearn.metrics import accuracy_score
param_grid ={'n_neighbors' : list(range(1,9)) ,'algorithm' : ('auto','ball_tree','kd_tree','brute')}

gs=GridSearchCV(knn_clf,param_grid,cv=10)
gs.fit(X_train,y_train)





GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute'),
                         'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8]})

In [47]:
gs.best_params_

{'algorithm': 'auto', 'n_neighbors': 1}

In [48]:
gs.best_score_

0.675

In [49]:
gs.cv_results_['mean_test_score']

array([0.675  , 0.65625, 0.63125, 0.65625, 0.625  , 0.61875, 0.58125,
       0.6    , 0.675  , 0.65625, 0.63125, 0.65625, 0.625  , 0.61875,
       0.58125, 0.6    , 0.675  , 0.65625, 0.63125, 0.65625, 0.625  ,
       0.61875, 0.58125, 0.6    , 0.675  , 0.65625, 0.63125, 0.65625,
       0.625  , 0.61875, 0.58125, 0.6    ])

In [53]:
gs.score(X_test,y_test)

0.8148148148148148