# k-Nearest Neighbours

		
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm: 
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Most Popular distance functions are

<img src="img/KNN_similarity.png">

## Lab - Universal Bank Dataset

Predict whether a given customer accepts his/her personal loan offer based on the Universal Bank dataset. There are a total of 5,000 customers in the data set and 14 variables. A brief description of the 14 variables are given below:

ID: Customer ID 

Age: Customer's age in completed year 

Experience: # years of professional experience 

Income: Annual income of the customer (1,000) 

ZIPcode: Home address ZIP code 

Family: Family size of the customer 

CCAvg: Average monthly credit card spending (1, 000) 

Education: Education level: 1: undergrad; 2, Graduate; 3; Advance/Professional 

Mortgage: Value of house mortgage if any (1, 000) 

Securities Acct: Does the customer have a securities account with the bank? 

CD Account: Does the customer have a certifcate of deposit (CD) account with the bank? 

Online: Does the customer use internet bank facilities? 

CreditCard: Does the customer use a credit card issued by the Bank?

Personal loan: Did this customer accept the personal loan offered in he last campaign? 1, yes; 0, no (target variable)

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
bank=pd.read_csv("UnivBank.csv",na_values=["?",","])
print(bank.shape)
print(type(bank))

(5000, 14)
<class 'pandas.core.frame.DataFrame'>


In [4]:
print(bank.columns)
print(bank.dtypes)

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage              float64
Personal Loan           int64
Securities Account    float64
CD Account             object
Online                  int64
CreditCard              int64
dtype: object


In [5]:
bank.head(6)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0.0,0,1.0,0,0,0
1,2,45,19,34,90089,3,1.5,1,0.0,0,1.0,0,0,0
2,3,39,15,11,94720,1,1.0,1,0.0,0,0.0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0.0,0,0.0,#,0,0
4,5,35,8,45,91330,4,1.0,2,0.0,0,0.0,0,0,1
5,6,37,13,29,92121,4,0.4,2,155.0,0,0.0,0,1,0


### Typecast required variables into categorical

In [6]:
bank['Education']=bank['Education'].astype('category')

In [7]:
bank=pd.get_dummies(bank)

In [8]:
bank.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Mortgage,Personal Loan,Securities Account,Online,CreditCard,Education_1,Education_2,Education_3,CD Account_#,CD Account_0,CD Account_1
0,1,25,1,49,91107,4,1.6,0.0,0,1.0,0,0,1,0,0,0,1,0
1,2,45,19,34,90089,3,1.5,0.0,0,1.0,0,0,1,0,0,0,1,0
2,3,39,15,11,94720,1,1.0,0.0,0,0.0,0,0,1,0,0,0,1,0
3,4,35,9,100,94112,1,2.7,0.0,0,0.0,0,0,0,1,0,1,0,0
4,5,35,8,45,91330,4,1.0,0.0,0,0.0,0,1,0,1,0,0,1,0


#### Check for missing values

In [10]:
pd.isna(bank).sum()

ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Mortgage              2
Personal Loan         0
Securities Account    2
Online                0
CreditCard            0
Education_1           0
Education_2           0
Education_3           0
CD Account_#          0
CD Account_0          0
CD Account_1          0
dtype: int64

#### Fill NA values with mean

In [12]:
bank=bank.fillna(bank.mean())
pd.isna(bank).sum()

ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Mortgage              0
Personal Loan         0
Securities Account    0
Online                0
CreditCard            0
Education_1           0
Education_2           0
Education_3           0
CD Account_#          0
CD Account_0          0
CD Account_1          0
dtype: int64

#### Seperate out Target Variable from the dataset and do train-test split

In [9]:
# Divide in to train and test
y=bank["Personal Loan"]
X=bank.drop('Personal Loan', axis=1)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [10]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4000, 17)
(1000, 17)
(4000,)
(1000,)


In [11]:
scaler = StandardScaler()
scaler.fit(X_train)

  return self.partial_fit(X, y)


StandardScaler(copy=True, with_mean=True, with_std=True)

In [12]:
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

  """Entry point for launching an IPython kernel.
  


# Build KNN Classifier

In [13]:
model= KNeighborsClassifier(algorithm = 'brute',n_neighbors=5)
model.fit(X_train,y_train)

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [14]:
y_pred = model.predict(X_test)

In [15]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.957


### GridSearch Cross validation

The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance. 

While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training

<img src='img/HyperParameterVsParameter.png'>

### K-fold Cross Validation

<img src='img/K-fold.png'>

In [16]:
parameters = {'n_neighbors':list(range(3,15))}
clf = GridSearchCV(KNeighborsClassifier(), parameters, n_jobs=4,verbose=2, cv=5)
clf.fit(X=X_train, y=y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    4.6s
[Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed:    7.3s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [17]:
knn_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 

0.95475 {'n_neighbors': 3}


In [18]:
y_pred_test=knn_model.predict(X_test)
print(accuracy_score(y_test,y_pred_test))

0.954


## Regression

Explore function KNeighborsRegressor
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [19]:
## Randomly Generate some Data
data  = pd.DataFrame(np.random.randint(2,100,size=(1000, 4)),
                     columns=list('TABC'))

In [20]:
data.head(5)

Unnamed: 0,T,A,B,C
0,79,47,69,14
1,99,31,51,83
2,40,10,5,45
3,61,65,27,88
4,13,31,59,6


In [21]:
train, test = train_test_split(data, test_size=0.2)
print(train.shape, test.shape)

(800, 4) (200, 4)


In [22]:
Y_train = train["T"]

In [23]:
y_test = test["T"]

In [24]:
scaler = MinMaxScaler(feature_range=(0, 1))

scaler.fit(train.iloc[:,1:])

stdtrain = pd.DataFrame(scaler.transform(train.iloc[:,1:]), columns=list("bcd"))
stdtest = pd.DataFrame(scaler.transform(test.iloc[:,1:]), columns=list("bcd"))

  return self.partial_fit(X, y)


In [25]:
print(stdtrain.head(5))
print(stdtest.head(5))

          b         c         d
0  0.783505  0.701031  0.237113
1  0.412371  0.865979  0.237113
2  0.948454  0.927835  0.567010
3  0.783505  0.608247  0.391753
4  0.639175  0.608247  0.185567
          b         c         d
0  0.886598  0.505155  0.938144
1  0.536082  0.288660  0.886598
2  0.082474  0.154639  0.597938
3  0.814433  0.474227  0.216495
4  0.876289  0.649485  0.793814


In [26]:
print(stdtrain.shape)
print(Y_train.shape)

(800, 3)
(800,)


In [27]:
stdtrain.head(5)

Unnamed: 0,b,c,d
0,0.783505,0.701031,0.237113
1,0.412371,0.865979,0.237113
2,0.948454,0.927835,0.56701
3,0.783505,0.608247,0.391753
4,0.639175,0.608247,0.185567


In [28]:
print(stdtest.shape)
print(y_test.shape)

(200, 3)
(200,)


In [29]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(stdtrain, Y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform')

In [30]:
predictions = knn.predict(stdtest)

In [31]:
def mse(predictions,y):
    mse = (((predictions - y) ** 2).sum()) / len(predictions)
    return mse

In [32]:
mse(predictions,y_test)

916.837