# Introduction
In this kernel, parameters of KNN Algorithm are described and effects of these paremeters on result are observed. First prediction is predicted with default parameters and this result is used for comparing. After that, best value of every parameters are found and are discussed their effects on result. <br/>

Finally, GridSearch algorithm is used to find best values of each parameters. So results can be compared each other in the conclusion part. To purpose of this kernel, understanding parameters of KNN Classifier algorithm and gain experience about hyperparameter tunings.

<font color='blue'>
Content
    
1. [Load Data and PreCheck](#1)
1. [Preparing Data Set](#2)
1. [First Prediction With Default Parameters](#3)
1. [Parameters Description and Observation of Effects](#4)
    * [n_neighbors](#5)
    * [weights](#6)
    * [algorithm](#7)
    * [p](#8)
1. [GridSearch](#9)
1. [Conclusion](#10)

<a id='1'></a>
# Load Data and PreCheck

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#read data from csv file
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

In [None]:
df.head()

* "Unnamed: 32" has NaN values. It should be check how many rows have NaN value.
* "id" column keep unique id of each column. So it is useless feature for classification. It would be better if It is dropped.

In [None]:
df.info()

* Dataframe has 569 entries.
* Unnamed: 32 column has 569 null value. So it is useless column. It should be dropped.
* There is one categorical and 32 numeric columns but one of them is id column and it should be dropped.
* Categorical column is diagnosis column. It is the target column. It would be better if it is converted to int type to avoid any error during implementation of KNN Algorithms.

<a id='2'></a>
# Preparing Data Set

In [None]:
#Drop id and Unnamed: 32 columns
df.drop(['id','Unnamed: 32'],inplace = True, axis = 1)

#Convert Diagnosis from categorical to numeric
df.diagnosis = df.diagnosis.map({"M":1,"B":0})

#Show final info of dataframe
df.info()

id and Unnamed: 32 columns were dropped and data type of diagnosis column is int64. So, data set can be splitted as train and test.

In [None]:
#create X and Y objects
X = df.drop(["diagnosis"],axis = 1)
Y = df.diagnosis.values.reshape(-1,1)

#create test and train data sets
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state = 0,test_size = 0.2)

print("X_train Shape:",X_train.shape)
print("X_test Shape:",X_test.shape)
print("Y_train Shape:",Y_train.shape)
print("Y_test Shape:",Y_test.shape)

result_train = {}
result_test = {}

<a id='3'></a>
# First Prediction With Default Parameters
Default parameters of KNN Algorithm are listed below.
* n_neighbors = 5
* weights = 'uniform'
* algorithm = 'auto'
* leaf_size = 30
* p = 2
* metric = 'minkowski'

In [None]:
#This function concatenate test and train scores in a dataframe
def prepare_dataframe(test_score_dict,train_score_dict,key,columns):
    df_test = pd.DataFrame(test_score_dict,index = ["Test Score"])
    df_train = pd.DataFrame(train_score_dict,index = ["Train Score"])
    np_result = np.concatenate([df_test,df_train],axis = 0)
    df_result = pd.DataFrame(np_result)

    df_result.index = ["Test Score","Train Score"]
    df_result.columns = [key + str(c) for c in columns]  
    return df_result

In [None]:
#Implementing KNN algoritms with different parameters
def knn_model(n_neighbors = 5,weights = 'uniform',algorithm = 'auto',p = 2):
    knn = KNeighborsClassifier(n_neighbors = n_neighbors,
                               weights = weights,
                               algorithm = algorithm,
                               p = p
                              )
    
    accuracies_train = cross_val_score(estimator = knn, X = X_train, y = Y_train, cv = 3)
    train_score = np.mean(accuracies_train)
    
    knn.fit(X_train,Y_train)
    test_score = knn.score(X_test,Y_test)
   
    return train_score, test_score
       


In [None]:
#KNN algorithm implementation with default parameters
train_score,test_score = knn_model()
result_train["Default-Train"] = train_score
result_test["Default-Test"] = test_score
print("Mean accuracy of train set:",train_score)
print("Mean accuracy of test set:", test_score) 

<a id='4'></a>
# Parameters Description and Observation of Effects

In this part, each parameter is analyzed separately. Each parameter is tried by using different values and effects of these changes are examined.

<a id='5'></a>
## n_neighbors
n_neighbors parameter is one of the most important parameters for KNN Algorithm. KNN means K - Nearest Neighbors and n_neighbors parameter is this K value. It means that if n_neighbors is selected 3, algorithm finds 3 nearest points and classify the new point according to majority of these 3 points.


In [None]:
#KNN algorith implementation with different n_neighbors 
k_list = list(np.arange(1,285))
test_score_dict = {}
train_score_dict = {}

for k in k_list:
    train_score,test_score = knn_model(n_neighbors = k)
    train_score_dict[k] = (train_score)
    test_score_dict[k] = (test_score)
    
df_result = prepare_dataframe(test_score_dict,train_score_dict,"K = ",k_list)
df_result


In [None]:
#Plot score of different k values for test and train datas
plt.figure(figsize = (20,5))
plt.subplot(1,2,1)
plt.plot(k_list,list(train_score_dict.values()))
plt.xlabel("K Value")
plt.ylabel("Train Score")

plt.subplot(1,2,2)
plt.plot(k_list,list(test_score_dict.values()))
plt.xlabel("K Value")
plt.ylabel("Test Score")
plt.show()

In [None]:
#Results stored in dictionaries
result_train["K-Best-Train"] = np.asarray(list(train_score_dict.values())).max()
result_test["K-Best-Test"] = np.asarray(list(test_score_dict.values())).max()

* If value of n_neighbors is too low, It can cause of overfitting. Overfitting means that KNN model memorize data instead of learning data.
* If value of n_neighbors is too high, It can cause of underfitting. Underfitting means that KNN model doesn't learn data efficiently. So probability of false prediction may be high. 
* The result table and graphs shows best value of n_neighbors is 9. So, the best value of n_neighbors value is **9** for now!

<a id='6'></a>
## weights

Basically, KNN algorithm uses **uniform** weights. For uniform weight, all neighbors have equal vote on query point. For example; to classify a query point, if five nearest point is selected, there are 3 class1 and 2 class2 in these points then result is class1. The distance of the neighbors are not important for classification.

Unlike uniform weights there is another option which name is **distance**. If distance is selected, the distance of neighbors become important to classification. Nearer neighbors contribute more to the fit. It means that majority isn't enough to classification, distance between query point and neighbors may prevail over the majority.

In [None]:
#KNN algorith implementation with different weights 
w_list = ['uniform','distance']

test_score_dict = {}
train_score_dict = {}

for w in w_list:
    train_score,test_score = knn_model(weights = w)
    train_score_dict[w] = (train_score)
    test_score_dict[w] = (test_score)

prepare_dataframe(test_score_dict,train_score_dict,"Weight = ",w_list)

In [None]:
#Results stored in dictionaries
result_train["Weight-Best-Train"] = np.asarray(list(train_score_dict.values())).max()
result_test["Weight-Best-Test"] = np.asarray(list(test_score_dict.values())).max()

If you look at the table above, you can see that train score of distance a little bit higher than uniform. Changing weight parameter improve train score but there is no effect on test score. This may be due to the structure of the data set  which is reserved for the test. However, according to these results, we can say that the weights parameter has a slight effect on the results.

<a id='7'></a>
# algorithm

Algorithm parameter sets the algorithm which uses to compute nearest neighbors. There are 3 algorithm but 4 option for this parameter. Algorithms which can be used, are listed below;
* BallTree
* KDTree
* Brute Force

Fourth option is auto. Auto option attempt to decide best algorithm for best result.


In [None]:
#KNN algorith implementation with different algorithm
a_list = ['auto','ball_tree','kd_tree','brute']

test_score_dict = {}
train_score_dict = {}

for a in a_list:
    train_score,test_score = knn_model(algorithm = a)
    train_score_dict[a] = (train_score)
    test_score_dict[a] = (test_score)

prepare_dataframe(test_score_dict,train_score_dict,"Algorithm = ",a_list)

In [None]:
#Results stored in dictionaries
result_train["Algorithm-Best-Train"] = np.asarray(list(train_score_dict.values())).max()
result_test["Algorithm-Best-Test"] = np.asarray(list(test_score_dict.values())).max()

According to result ,changing algorithm parameters didn't change scores. But **auto** algorithm is the best options because it tries all algorithm options to find best one.

<a id='8'></a>
# p

Different distance metrics are available which use for the tree (like KDTree or Ball Tree). One of them is Minkowski metric. This p parameter is power parameter for Minkowski metric.
* If p = 1, this is equivalent to using **manhattan_distance(l_1)**
* If p = 2, this is equivalent to using **euclidean_distance(l_2)**
* If p > 2, It is **minkowski distance(l_p)**.

To who wonder what these functions are;
* EuclideanDistance => sqrt(sum((x - y)^2))
* ManhattanDistance => sum(|x - y|)
* MinkowskiDistance => sum(|x - y|^p)^(1/p)


In [None]:
#KNN algorith implementation with different p
p_list = list(np.arange(1,11))


test_score_dict = {}
train_score_dict = {}

for p in p_list:
    train_score,test_score = knn_model(p = p)
    train_score_dict[p] = (train_score)
    test_score_dict[p] = (test_score)

prepare_dataframe(test_score_dict,train_score_dict,"P = ",p_list)

In [None]:
plt.figure(figsize = (20,5))
plt.subplot(1,2,1)
plt.plot(p_list,list(train_score_dict.values()))
plt.xlabel("P Value")
plt.ylabel("Train Score")

plt.subplot(1,2,2)
plt.plot(p_list,list(test_score_dict.values()))
plt.xlabel("P Value")
plt.ylabel("Test Score")
plt.show()

In [None]:
#Results stored in dictionaries
result_train["P-Best-Train"] = np.asarray(list(train_score_dict.values())).max()
result_test["P-Best-Test"] = np.asarray(list(test_score_dict.values())).max()

According to graphs and table above **"p"** parameter changes the result of prediction and the best p value seems to be 1 for this dataset. In other words Manhattan Distance is the best distance metric for this dataset.

<a id='9'></a>
# GridSearch


So far, the parameters have been studied separately and the following results have been obtained as best options for parameters.
* n_neighbors = "9"
* wieght = "distance"
* algorithm = "auto"
* p = "1"

Of course, there are some methods to select best parameters for KNN algorithm. One of them is GridSearchCV method. So, a dictionary is prepared with different values of n_neighbors, p, algorithm and weights parameters and it uses by GridSearchCV algorithm to find best values.

In [None]:
#Implementation of GridSearch
grid = {'n_neighbors':np.arange(1,235),
        'p':np.arange(1,3),
        'weights':['uniform','distance'],
        'algorithm':['auto','ball_tree','kd_tree','brute']
       }
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn,grid,cv=3)
knn_cv.fit(X_train,Y_train)

print("Hyperparameters:",knn_cv.best_params_)
print("Train Score:",knn_cv.best_score_)
result_train["GridSearch-Best-Train"] = knn_cv.best_score_

In [None]:
#Results stored in dictionaries
result_test["GridSearch-Best-Test"] = knn_cv.score(X_test,Y_test)
print("Test Score:",knn_cv.score(X_test,Y_test))

<a id='10'></a>
# Conclusion

In [None]:
#Result dataframe
columns = ["Default","K-Best","Weight-Best","Algorithm-Best","P-Best","GridSearchCV"]
prepare_dataframe(result_test,result_train,"",columns)

In [None]:
#Bar plot for showing result of parameters both train and test datas
plt.figure(figsize = (12,5))
X = np.arange(len(result_train))
ax = plt.subplot(111)
ax.bar(X, result_train.values(), width=0.2, color='b', align='center')
ax.bar(X-0.2, result_test.values(), width=0.2, color='g', align='center')
ax.legend(('Train Results','Test Results'))
plt.xticks(X, columns)
plt.title("Comparing Results of Parameters", fontsize=17)
plt.show()

* It is understood from the results that the parameter selection affects the quality of the model which is created with the KNN algorithm.
* In fact, when the results are examined, it can be said that some parameters have more effect on the result like n_neighbors and p parameters.
* Considering that the KNN algorithm classifies according to the closest neighbors, it is quite logical that the p and n_neighbors  are more effective because these two parameters are related with calculating nearest neighbors. One of them refers number of neighbors and the other one is related with distance between query point and neighbor.
* Also it can be said that **auto** option can be preferred as algorithm parameter option. Because it already tries all options and decides the best one.
* In this kernel, although the parameter effects are examined separately, as a result it is best to try all parameter combinations and find the best combination at once with the GridSearch algorithm.