# k-Nearest Neighbours

		
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm: 
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Most Popular distance functions are

<img src="img/KNN_similarity.png">

## Lab - Universal Bank Dataset

Predict whether a given customer accepts his/her personal loan offer based on the Universal Bank dataset. There are a total of 5,000 customers in the data set and 14 variables. A brief description of the 14 variables are given below:

ID: Customer ID 

Age: Customer's age in completed year 

Experience: # years of professional experience 

Income: Annual income of the customer (1,000) 

ZIPcode: Home address ZIP code 

Family: Family size of the customer 

CCAvg: Average monthly credit card spending (1, 000) 

Education: Education level: 1: undergrad; 2, Graduate; 3; Advance/Professional 

Mortgage: Value of house mortgage if any (1, 000) 

Securities Acct: Does the customer have a securities account with the bank? 

CD Account: Does the customer have a certifcate of deposit (CD) account with the bank? 

Online: Does the customer use internet bank facilities? 

CreditCard: Does the customer use a credit card issued by the Bank?

Personal loan: Did this customer accept the personal loan offered in he last campaign? 1, yes; 0, no (target variable)

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score

In [None]:
bank=pd.read_csv("UnivBank.csv",na_values=["?",","])
print(bank.shape)
print(type(bank))

In [None]:
#Get the top 6 rows of the data


In [None]:
#Get the column names and types of the columns



### Typecast required variables into categorical

In [None]:
bank=pd.get_dummies(bank)

In [None]:
bank.head(5)

#### Check for missing values

#### Fill NA values with mean

#### Seperate out Target Variable from the dataset and do train-test split

In [None]:
# Divide in to train and test
y=bank["Personal Loan"]
X=bank.drop('Personal Loan', axis=1)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

In [None]:
#Transform the scaler on X_train and X_test and store it in same variables



# Build KNN Classifier

In [None]:
model= KNeighborsClassifier(n_neighbors=5)
model.fit(X_train,y_train)

In [None]:
#y_pred = 

In [None]:
from sklearn.metrics import accuracy_score

#Get the accuracy of the model on test data

### GridSearch Cross validation

The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance. 

While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training

<img src='img/HyperParameterVsParameter.png'>

### K-fold Cross Validation

<img src='img/K-fold.png'>

### Model building employing GridSearchCV and Cross Validation

In [None]:
parameters = {'n_neighbors':list(range(3,15))}
clf = GridSearchCV(KNeighborsClassifier(), parameters, n_jobs=4,verbose=2, cv=5)
clf.fit(X=X_train, y=y_train)

In [None]:
knn_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 

In [None]:
y_pred_test=knn_model.predict(X_test)
print(accuracy_score(y_test,y_pred_test))

## Regression

Explore function KNeighborsRegressor
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [None]:
## Randomly Generate some Data
data  = pd.DataFrame(np.random.randint(2,100,size=(1000, 4)),
                     columns=list('TABC'))

In [None]:
data.head(5)

In [None]:
train, test = train_test_split(data, test_size=0.2)
print(train.shape, test.shape)

In [None]:
Y_train = train["T"]

In [None]:
y_test = test["T"]

In [None]:
scaler = MinMaxScaler()

#We are leaving out the target column which is first one here. We are taking all rows
scaler.fit(train.iloc[:,1:])

In [None]:
stdtrain = pd.DataFrame(scaler.transform(train.iloc[:,1:]), columns=list("ABC"))
stdtest = pd.DataFrame(scaler.transform(test.iloc[:,1:]), columns=list("ABC"))

In [None]:
print(stdtrain.head(5))
print(stdtest.head(5))

#### Check the shapes of X_train, y_train and X_test and y_test

In [None]:
print(stdtrain.shape)
print(Y_train.shape)

In [None]:
print(stdtest.shape)
print(y_test.shape)

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(stdtrain, Y_train)

In [None]:
predictions = knn.predict(stdtest)

In [None]:
def mse(predictions,y):
    mse = (((predictions - y) ** 2).sum()) / len(predictions)
    return mse

In [None]:
mse(predictions,y_test)