# K-Nearest Neighbour


K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). A new record is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor. 

Suppose you want to find out the class of the blue star (BS) in the image below. BS can either be RC or GS. The “K” is KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three datapoints on the plane. The three closest points to BS is all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm.

<img src="img1.png">

The distance functions used for Numeric fields are given below:
<img src="img2.png">

### Libraries useful in K-NN are listed below

### Get The Data. Load data "bank-data.csv"

In [2]:
import pandas as pd
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [6]:
# import dataset
dataset = pd.read_csv("bank-data.csv",index_col=0)
dataset

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,FEMALE,INNER_CITY,17546.00,NO,1,NO,NO,NO,NO,YES
ID12102,40,MALE,TOWN,30085.10,YES,3,YES,NO,YES,YES,NO
ID12103,51,FEMALE,INNER_CITY,16575.40,YES,0,YES,YES,YES,NO,NO
ID12104,23,FEMALE,TOWN,20375.40,YES,3,NO,NO,YES,NO,NO
ID12105,57,FEMALE,RURAL,50576.30,YES,0,NO,YES,NO,NO,NO
ID12106,57,FEMALE,TOWN,37869.60,YES,2,NO,YES,YES,NO,YES
ID12107,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES
ID12108,58,MALE,TOWN,24946.60,YES,0,YES,YES,YES,NO,NO
ID12109,37,FEMALE,SUBURBAN,25304.30,YES,2,YES,NO,NO,NO,NO
ID12110,54,MALE,TOWN,24212.10,YES,2,YES,YES,YES,NO,NO


In [7]:
# import library for preprocessing
le = preprocessing.LabelEncoder()


In [8]:
# Tranform data using "fit_transform(attribute)" function  
dataset.car = le.fit_transform(dataset.car)
dataset.sex = le.fit_transform(dataset.sex)
dataset.save_act = le.fit_transform(dataset.save_act)
dataset.married = le.fit_transform(dataset.married)
dataset.current_act = le.fit_transform(dataset.current_act)
dataset.mortgage = le.fit_transform(dataset.mortgage)
dataset.pep = le.fit_transform(dataset.pep)

In [10]:
# Convert "Region" into presence absence attribute
dataset = pd.concat([dataset,pd.get_dummies(dataset['region'], prefix='REG')],axis=1)
dataset.drop(['region'],axis=1, inplace=True)


Unnamed: 0_level_0,age,sex,income,married,children,car,save_act,current_act,mortgage,pep,REG_INNER_CITY,REG_RURAL,REG_SUBURBAN,REG_TOWN
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ID12101,48,0,17546.0,0,1,0,0,0,0,1,1,0,0,0
ID12102,40,1,30085.1,1,3,1,0,1,1,0,0,0,0,1
ID12103,51,0,16575.4,1,0,1,1,1,0,0,1,0,0,0
ID12104,23,0,20375.4,1,3,0,0,1,0,0,0,0,0,1
ID12105,57,0,50576.3,1,0,0,1,0,0,0,0,1,0,0


In [11]:
# display dataframe first 5 columns
dataset.head(5)

Unnamed: 0_level_0,age,sex,income,married,children,car,save_act,current_act,mortgage,pep,REG_INNER_CITY,REG_RURAL,REG_SUBURBAN,REG_TOWN
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ID12101,48,0,17546.0,0,1,0,0,0,0,1,1,0,0,0
ID12102,40,1,30085.1,1,3,1,0,1,1,0,0,0,0,1
ID12103,51,0,16575.4,1,0,1,1,1,0,0,1,0,0,0
ID12104,23,0,20375.4,1,3,0,0,1,0,0,0,0,0,1
ID12105,57,0,50576.3,1,0,0,1,0,0,0,0,1,0,0


### Train and Test Split

In [12]:
dataset.columns

Index(['age', 'sex', 'income', 'married', 'children', 'car', 'save_act',
       'current_act', 'mortgage', 'pep', 'REG_INNER_CITY', 'REG_RURAL',
       'REG_SUBURBAN', 'REG_TOWN'],
      dtype='object')

In [13]:
# Select the independent variables and the target attribute
X = dataset[['age', 'sex', 'income', 'married', 'children', 'car', 'save_act',
       'current_act', 'mortgage', 'REG_INNER_CITY', 'REG_RURAL',
       'REG_SUBURBAN', 'REG_TOWN']]
Y = dataset['pep']

In [29]:
X.head()

Unnamed: 0_level_0,age,sex,income,married,children,car,save_act,current_act,mortgage,REG_INNER_CITY,REG_RURAL,REG_SUBURBAN,REG_TOWN
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
ID12101,48,0,17546.0,0,1,0,0,0,0,1,0,0,0
ID12102,40,1,30085.1,1,3,1,0,1,1,0,0,0,1
ID12103,51,0,16575.4,1,0,1,1,1,0,1,0,0,0
ID12104,23,0,20375.4,1,3,0,0,1,0,0,0,0,1
ID12105,57,0,50576.3,1,0,0,1,0,0,0,1,0,0


#### Obtain X_train, X_test, Y_train, Y_test by splitting the dataset into 70-30 ratio with a random state value 30.

Note: random_state is used to control the shuffling in the split of the dataset.

This method is called the Hold Out Method.

In [15]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Import the k-NN Classifier library

In [16]:
# import KNeighborsClassifier library
from sklearn.neighbors import KNeighborsClassifier


### Train 3-NN by using euclidean distance as distance measure

In [28]:
# Apply the classifier
knn=KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X_train, Y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=3)

### Prediction and Evaluation

In [18]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [19]:
# predictions for testing partition
predictions = knn.predict(X_test)

In [20]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.55      0.60      0.57        80
           1       0.44      0.39      0.41        64

    accuracy                           0.51       144
   macro avg       0.50      0.50      0.49       144
weighted avg       0.50      0.51      0.50       144

Confusion Matrix
[[48 32]
 [39 25]]

 Accuracy
0.5069444444444444


###  k(5) - fold method for training and testing split and check the difference in performance of 3-NN 

In [25]:
# Import required library for K-fold and test the performance of model using Eucledean distance
from sklearn.model_selection import cross_val_score
knn_cv = KNeighborsClassifier(n_neighbors=3, metric = 'euclidean')
scores = cross_val_score(knn_cv, X, Y, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.52083333 0.55208333 0.5625     0.58333333 0.55789474]
mean score:  0.5553289473684211


In [26]:
#accuracy is better than holdout method 

#### Q1: Normalize (min-max normalization) age,income and children columns of the dataset and apply 3-NN using both euclidean and manhattan distance

In [33]:
#min-max normalization
for col in ['age','income','children']:
    dataset[col] = (dataset[col]-dataset[col].min())/(dataset[col].max()-dataset[col].min())
dataset.head()

Unnamed: 0_level_0,age,sex,income,married,children,car,save_act,current_act,mortgage,pep,REG_INNER_CITY,REG_RURAL,REG_SUBURBAN,REG_TOWN
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ID12101,0.612245,0,0.215634,0,0.333333,0,0,0,0,1,1,0,0,0
ID12102,0.44898,1,0.431395,1,1.0,1,0,1,1,0,0,0,0,1
ID12103,0.673469,0,0.198933,1,0.0,1,1,1,0,0,1,0,0,0
ID12104,0.102041,0,0.26432,1,1.0,0,0,1,0,0,0,0,0,1
ID12105,0.795918,0,0.783987,1,0.0,0,1,0,0,0,0,1,0,0


In [34]:
# Select the independent variables and the target attribute
X = dataset[['age', 'sex', 'income', 'married', 'children', 'car', 'save_act',
       'current_act', 'mortgage', 'REG_INNER_CITY', 'REG_RURAL',
       'REG_SUBURBAN', 'REG_TOWN']]
Y = dataset['pep']

In [36]:
# KNN using euclidean distance
knn_cv_euc = KNeighborsClassifier(n_neighbors=3, metric = 'euclidean')
scores = cross_val_score(knn_cv_euc, X, Y, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.60416667 0.625      0.59375    0.59375    0.56842105]
mean score:  0.5970175438596491


In [38]:
# KNN using manhattan distance
knn_cv_man = KNeighborsClassifier(n_neighbors=3, metric = 'manhattan')
scores = cross_val_score(knn_cv_man, X, Y, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.66666667 0.70833333 0.66666667 0.65625    0.61052632]
mean score:  0.661688596491228


#### Q2: Write your observation regarding change in the perfromance of KNN

#### Q3: Find the accuracy of 1-NN model

In [39]:
# Use euclidean distance
knn_cv = KNeighborsClassifier(n_neighbors=1, metric = 'euclidean')
scores = cross_val_score(knn_cv, X, Y, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.625      0.70833333 0.59375    0.63541667 0.63157895]
mean score:  0.6388157894736842


#### Q4: Implement the weighted k-NN model. Use k-fold method for train-test split

In [41]:
# Refer labsheet 4
knn = KNeighborsClassifier(n_neighbors=3, metric = 'euclidean' , weights= 'distance') 
scores = cross_val_score(knn, X, Y, cv=5, scoring='accuracy')
print("Accuracy score", scores.mean())

Accuracy score 0.6262061403508772


#### Q5: What is the best k value in this model

In [44]:
# Refer labsheet 4
lis = []
for k in range(1,21):
    knn_cv = KNeighborsClassifier(n_neighbors=k, metric = 'euclidean', weights= 'distance')
    scores = cross_val_score(knn_cv, X, Y, cv=5, scoring='accuracy')
    print("Accuracy score for k : " +str(k) + " ", scores.mean())
    lis.append(scores.mean());

Accuracy score for k : 1  0.6388157894736842
Accuracy score for k : 2  0.6388157894736842
Accuracy score for k : 3  0.6262061403508772
Accuracy score for k : 4  0.6429385964912282
Accuracy score for k : 5  0.626140350877193
Accuracy score for k : 6  0.6387280701754386
Accuracy score for k : 7  0.6511842105263158
Accuracy score for k : 8  0.6491666666666667
Accuracy score for k : 9  0.6429605263157895
Accuracy score for k : 10  0.6325000000000001
Accuracy score for k : 11  0.6428947368421053
Accuracy score for k : 12  0.6387061403508772
Accuracy score for k : 13  0.6303508771929824
Accuracy score for k : 14  0.6262061403508772
Accuracy score for k : 15  0.6346052631578948
Accuracy score for k : 16  0.6262719298245614
Accuracy score for k : 17  0.6325657894736842
Accuracy score for k : 18  0.6346710526315789
Accuracy score for k : 19  0.6430043859649123
Accuracy score for k : 20  0.6576535087719299


In [None]:
# Refer labsheet 4

In [None]:
# Refer labsheet 4

#### Q6: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of KNN classifier

In [45]:
# Select the independent variables and the target attribute after dropping specified column
X.drop('current_act', axis = 1, inplace =True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [47]:
# Use euclidean distance as distance metric and k(5) -fold method of test train
knn_cv_euc = KNeighborsClassifier(n_neighbors=3, metric = 'euclidean')
scores = cross_val_score(knn_cv_euc, X, Y, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.57291667 0.625      0.69791667 0.6875     0.58947368]
mean score:  0.6345614035087719


#### Q7: Write your observation

In [48]:
#Accuracy has improved after removing "current-act" attribute