## Classification with Loss Function and evaluation

In this notebook, we explore KNN classifier with loss function

In [7]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
url ='https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'

In [2]:
column_names = ['pregnancy_x','plasma_con','blood_pressure','skin_mm','insulin','bmi','pedigree_func','age','target']
# exclude target
feature_names = column_names[:-1]
all_data =pd.read_csv(url,names = column_names)

In [9]:
X,y = all_data[feature_names], all_data['target']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=123, stratify=y)

In [11]:
knn = KNeighborsClassifier()
params ={'n_neighbors':list(range(3,20,1))}
knn_rs = RandomizedSearchCV(knn,params,cv=10,n_iter=15)
knn_rs.fit(X_train,y_train)

RandomizedSearchCV(cv=10, error_score='raise',
          estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
          fit_params=None, iid=True, n_iter=15, n_jobs=1,
          param_distributions={'n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [12]:
knn_rs.best_score_

0.74429967426710097

In [14]:
# confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = knn_rs.predict(X_test)
confusion_matrix(y_test,y_pred)

array([[89, 11],
       [25, 29]], dtype=int64)

In [17]:
# More specific, we obtain 
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
(tn, fp, fn, tp)

(89, 11, 25, 29)

### Note: 
The confusion matrix indicates that fn =25 (false negative) or predicted 25 people are not diabetes while they are. This is a big serious problem due to these people may not follow any treament. 
Other issue is that, false position = 11, implies that 11 people are wrongly predicted as diabetes while they are normal. These people are waste money for unnecessary treatment.

Here we want to maximize __sentivity__ or recall since it is better to waste money instead of wrongly conclusion.
$ Sensitivity=\frac{people correctly labelled having diabetes}{All people who have diabetes}$.   
To maximize a sensitivity, we expect that our model correctly predict all cases. With scklearn , we process

In [19]:
from sklearn.metrics import recall_score
# check recall
recall_score(y_test,y_pred)

0.53703703703703709

In [21]:
from sklearn.metrics import make_scorer
recall_scorer = make_scorer(recall_score, greateris_better=True)

## Parameter search

In [28]:
## Parameter search
from sklearn.metrics import roc_auc_score
params = {'n_neighbors':list(range(3,20,1))}
knn_rs = RandomizedSearchCV(knn,params,cv=10,n_iter=15, scoring=make_scorer(roc_auc_score, greater_is_better=True))
knn_rs.fit(X_train,y_train)

RandomizedSearchCV(cv=10, error_score='raise',
          estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
          fit_params=None, iid=True, n_iter=15, n_jobs=1,
          param_distributions={'n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=make_scorer(roc_auc_score),
          verbose=0)

In [29]:
knn_rs.best_score_

0.69067492632232041