# K-Nearest Neighbours

		
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm: 
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Most Popular distance functions are:

<img src="img/KNN_similarity.png">



### Reference: 

Regressor:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

Classifier:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

---

## Activity 1: Classification

## Dataset - Universal Bank Dataset

Background:							
							
A relatively young bank is growing rapidly in terms of overall customer acquisition. Majority of these are Liability customers with varying sizes of relationship with the bank. The customer base of Asset customers is quite small, and the bank WANTS to grow	this base rapidly to bring in more loan business. Specifically, it wants to explore ways of converting its liability customers to Personal Loan customers.							
							
A campaign the bank ran for liability customers last year showed a healthy conversion rate of over 9% successes. This has encouraged the Retail Marketing department to devise smarter campaigns with better target marketing.							

* Analytics Objective:							

Predict whether a given customer accepts his/her personal loan offer based on the Universal Bank dataset. There are a total of 5,000 customers in the data set and 14 variables. A brief description of the 14 variables is given below:

ID: Customer ID 

Age: Customer's age in completed year 

Experience: # of years of professional experience 

Income: Annual income of the customer ($000) 

ZIPcode: Home address ZIP code 

Family: Family size of the customer 

CCAvg: Average monthly credit card spending ($000)

Education: Education level: 1: Undergrad; 2: Graduate; 3: Advanced/Professional 

Mortgage: Value of house mortgage, if any ($000)

Securities Acct: Does the customer have a securities account with the bank? 

CD Account: Does the customer have a certifcate of deposit (CD) account with the bank? 

Online: Does the customer use internet banking facilities? 

CreditCard: Does the customer use a credit card issued by the bank?

Personal loan: Did this customer accept the personal loan offered in the last campaign? 1 - yes; 0 - no (target variable)

#### Import all the required packages and classes

In [1]:
import os
import numpy as np
import pandas as pd
import math

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
#from sklearn.preprocessing import Imputer

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
import warnings
warnings.filterwarnings('ignore')

#### Read the UnivBank.csv file into a pandas dataframe

In [3]:
bank=pd.read_csv("UnivBank.csv",na_values=["?","#"])

#### Display the first 5 records

In [4]:
bank.head()

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0.0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0.0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0.0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0.0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0.0,0,0,0,0,1


#### Display the dimensions, column names and column datatypes

In [5]:
print(bank.shape)

(5000, 14)


In [6]:
print(bank.columns)

Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'PersonalLoan', 'SecuritiesAccount',
       'CDAccount', 'Online', 'CreditCard'],
      dtype='object')


In [7]:
print(bank.dtypes)

ID                     int64
Age                    int64
Experience             int64
Income                 int64
ZIPCode                int64
Family                 int64
CCAvg                float64
Education              int64
Mortgage             float64
PersonalLoan           int64
SecuritiesAccount      int64
CDAccount              int64
Online                 int64
CreditCard             int64
dtype: object


#### Check the summary (descriptive statistics)  for all attributes

In [8]:
bank.describe(include='all')

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,4997.0,5000.0,4997.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.93644,1.881,56.53272,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.746609,0.839869,101.73491,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


#### Check the unique levels in the target attribute PersonalLoan and also check for the percentage distribution

In [9]:
print(bank["PersonalLoan"].value_counts())

0    4520
1     480
Name: PersonalLoan, dtype: int64


In [10]:
bank['PersonalLoan'].value_counts(normalize=True) * 100

0    90.4
1     9.6
Name: PersonalLoan, dtype: float64

#### Check the number of unique ZIP Codes present in the dataset 

In [11]:
print("The number of Unique ZIP Codes in the bank data set is",bank['ZIPCode'].nunique())
print("\n")
print(bank['ZIPCode'].value_counts())

The number of Unique ZIP Codes in the bank data set is 467


94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94970      1
94598      1
90068      1
94087      1
Name: ZIPCode, Length: 467, dtype: int64


#### Remove the unncessary columns (ID and ZipCode)

In [12]:
bank=bank.drop(["ID","ZIPCode"],axis=1)

In [13]:
bank.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,25,1,49,4,1.6,1,0.0,0,1,0,0,0
1,45,19,34,3,1.5,1,0.0,0,1,0,0,0
2,39,15,11,1,1.0,1,0.0,0,0,0,0,0
3,35,9,100,1,2.7,2,0.0,0,0,0,0,0
4,35,8,45,4,1.0,2,0.0,0,0,0,0,1


#### Check the count of Education values in each level

In [14]:
print("The number of values in different Education levels:\n")
print(bank['Education'].value_counts())

The number of values in different Education levels:

1    2096
3    1501
2    1403
Name: Education, dtype: int64


#### Check the count of Family values in each level

In [15]:
print("The number of values in different Family levels:\n")
print(bank['Family'].value_counts())

The number of values in different Family levels:

1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64


#### Convert the attributes to the right data type based on the dataset description

In [16]:
cat_attr=['Education', 'Family', 'CDAccount', 'Online','CreditCard','SecuritiesAccount']
for cols in cat_attr :
    bank[cols]=bank[cols].astype('category')

In [17]:
bank.dtypes

Age                     int64
Experience              int64
Income                  int64
Family               category
CCAvg                 float64
Education            category
Mortgage              float64
PersonalLoan            int64
SecuritiesAccount    category
CDAccount            category
Online               category
CreditCard           category
dtype: object

#### Creating dummy variables

If we have k levels in a category, then we create k-1 dummy variables as the last one would be redundant. 
So we use the parameter drop_first in pd.get_dummies function that drops the first level in each of the category.


In [18]:
bank = pd.get_dummies(columns=cat_attr,data=bank,drop_first=True)

In [19]:
bank.head()

Unnamed: 0,Age,Experience,Income,CCAvg,Mortgage,PersonalLoan,Education_2,Education_3,Family_2,Family_3,Family_4,CDAccount_1,Online_1,CreditCard_1,SecuritiesAccount_1
0,25,1,49,1.6,0.0,0,0,0,0,0,1,0,0,0,1
1,45,19,34,1.5,0.0,0,0,0,0,1,0,0,0,0,1
2,39,15,11,1.0,0.0,0,0,0,0,0,0,0,0,0,0
3,35,9,100,2.7,0.0,0,1,0,0,0,0,0,0,0,0
4,35,8,45,1.0,0.0,0,1,0,0,0,1,0,0,1,0


####  Check for missing values 

In [20]:
bank.isnull().sum()

Age                    0
Experience             0
Income                 0
CCAvg                  3
Mortgage               3
PersonalLoan           0
Education_2            0
Education_3            0
Family_2               0
Family_3               0
Family_4               0
CDAccount_1            0
Online_1               0
CreditCard_1           0
SecuritiesAccount_1    0
dtype: int64

#### Split the data into train and test

In [21]:
y=bank["PersonalLoan"]
X=bank.drop('PersonalLoan', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,stratify=y,random_state=123)  

In [22]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4000, 14)
(1000, 14)
(4000,)
(1000,)


In [23]:
print(y_train.value_counts())
print(y_test.value_counts())

0    3616
1     384
Name: PersonalLoan, dtype: int64
0    904
1     96
Name: PersonalLoan, dtype: int64


In [24]:
y_train.value_counts(normalize=True) * 100

0    90.4
1     9.6
Name: PersonalLoan, dtype: float64

In [25]:
y_test.value_counts(normalize=True) * 100

0    90.4
1     9.6
Name: PersonalLoan, dtype: float64

#### Split the attributes into numerical and categorical types

In [26]:
num_attr=X_train.select_dtypes(['int64','float64']).columns
num_attr

Index(['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage'], dtype='object')

In [27]:
cat_attr = X_train.select_dtypes('category').columns
cat_attr

Index([], dtype='object')

#### Checking for missing values in train and test dataset

In [28]:
X_train.isnull().sum()

Age                    0
Experience             0
Income                 0
CCAvg                  3
Mortgage               2
Education_2            0
Education_3            0
Family_2               0
Family_3               0
Family_4               0
CDAccount_1            0
Online_1               0
CreditCard_1           0
SecuritiesAccount_1    0
dtype: int64

In [29]:
X_test.isnull().sum()

Age                    0
Experience             0
Income                 0
CCAvg                  0
Mortgage               1
Education_2            0
Education_3            0
Family_2               0
Family_3               0
Family_4               0
CDAccount_1            0
Online_1               0
CreditCard_1           0
SecuritiesAccount_1    0
dtype: int64

#### Imputing missing values with median

In [30]:
imputer = SimpleImputer(strategy='median')
imputer = imputer.fit(X_train[num_attr])

X_train[num_attr] = imputer.transform(X_train[num_attr])
X_test[num_attr] = imputer.transform(X_test[num_attr])

#### Imputing missing values with KNN-Imputer

In [32]:
from sklearn.impute import KNNImputer
knn_impute_obj = KNNImputer(n_neighbors=2)

# Fitting on the Train Data
knn_impute_obj.fit(X_train)

############################################################################

# Imputing on the Train Data
X_train = knn_impute_obj.transform(X_train)

# Making the output as a dataframe
X_train = pd.DataFrame(X_train, columns=X.columns)

############################################################################

# Imputing on the Train Data
X_test = knn_impute_obj.transform(X_test)

# Making the output as a dataframe
X_test = pd.DataFrame(X_test, columns=X.columns)

In [33]:
X_train.isnull().sum()

Age                    0
Experience             0
Income                 0
CCAvg                  0
Mortgage               0
Education_2            0
Education_3            0
Family_2               0
Family_3               0
Family_4               0
CDAccount_1            0
Online_1               0
CreditCard_1           0
SecuritiesAccount_1    0
dtype: int64

In [34]:
X_test.isnull().sum()

Age                    0
Experience             0
Income                 0
CCAvg                  0
Mortgage               0
Education_2            0
Education_3            0
Family_2               0
Family_3               0
Family_4               0
CDAccount_1            0
Online_1               0
CreditCard_1           0
SecuritiesAccount_1    0
dtype: int64

#### Standardize the data (numerical attributes only) - Import StandardScaler


In [None]:
scaler = StandardScaler()
scaler.fit(X_train[num_attr])

In [None]:
scaler.mean_

In [None]:
scaler.var_

In [None]:
X_train[num_attr]=scaler.transform(X_train[num_attr])
X_test[num_attr]=scaler.transform(X_test[num_attr])

#### Build KNN Classifier Model

In [None]:
model= KNeighborsClassifier(n_neighbors=5,metric="euclidean")
model.fit(X_train,y_train)

#### Predict on the Test data

In [None]:
y_pred = model.predict(X_test)
y_pred

#### FIne the accuracy classification score

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

### Finding out the IDEAL K-value for the given dataset

### Method-1

In [None]:
# Creating list of different K values for KNN
myList = list(range(2,12))

# Empty list that will hold cv scores
cv_scores = []

# Perform 5-fold cross validation
for k in myList:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
    # print("scores=",scores)
    cv_scores.append(scores.mean())
    # print("cv_scores=",cv_scores)

In [None]:
cv_scores

In [None]:
# Changing to misclassification error
MCE = [1 - x for x in cv_scores]

# Determining best k
optimal_k = myList[MCE.index(min(MCE))]
print("The optimal number of neighbors is %d" % optimal_k)

# plot misclassification error vs k
plt.figure(figsize=(15,5))
plt.plot(myList, MCE)

plt.xticks(np.arange(2, 12, 1))
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

### Method-2: GridSearch Cross validation

The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance. 

While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training

### K-fold Cross Validation:

#### 1. Use the GridSearchCV 

In [None]:
parameters = {'n_neighbors':list(range(2,12))}

clf = GridSearchCV(KNeighborsClassifier(metric="euclidean", n_jobs=-1),
                   parameters,verbose=1, cv=5)

clf.fit(X=X_train, y=y_train)

In [None]:
knn_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 

#### 2. Predict on the test data using the best model

In [None]:
y_pred_test=knn_model.predict(X_test)

#### 3. Compute confusion matrix to evaluate the accuracy of the classification 

In [None]:
print(confusion_matrix(y_test, y_pred_test))

#### 4. Accuracy classification score

In [None]:
print(accuracy_score(y_test,y_pred_test))

## Activity 2: Regression 

##### Error Metrics for Regression


* Mean Absolute Error (MAE):

$$MAE = \dfrac{1}{n}\times\sum_{i = 1}^{n}|y_{i} - \hat{y_{i}}|$$


* Mean Squared Error (MSE):

$$MSE = \dfrac{1}{n}\times\sum_{i = 1}^{n}(y_{i} - \hat{y_{i}})^2$$


* Root Mean Squared Error (RMSE):

$$RMSE = \sqrt{\dfrac{1}{n}\times\sum_{i = 1}^{n}(y_{i} - \hat{y_{i}})^2}$$

#### 1. Import KNeighborsRegressor (from Sklearn)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

#### 2. Randomly generate a dataframe of 1000 rows and 4 columns. Consider the 3 columns as the independent variables and the 4th column as Target

In [None]:
data  = pd.DataFrame(np.random.randint(1,50,size=(1000, 4)), columns=list('ABCT'))

#### 3. Displaying the first 5 recods

In [None]:
data.head()

#### 4. Split the data into train and test using the train_test_split() function.

In [None]:
train, test = train_test_split(data, test_size=0.2,random_state=123)
print(train.shape, test.shape)

#### 5. Extract the target column from train and test datasets

In [None]:
y_train = train["T"]

In [None]:
y_test = test["T"]

#### 6. Normalize the independent variables using MinMaxScaler() in both train and test

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(train.iloc[:,:3])
X_train = pd.DataFrame(scaler.transform(train.iloc[:,:3]), columns=list("abc"))
X_test = pd.DataFrame(scaler.transform(test.iloc[:,:3]), columns=list("abc"))

#### 7. Displaying the first 5 records from the normalized data

In [None]:
print(X_train.head(5))
print(X_test.head(5))

#### 8. Build the KNN Regression Model

In [None]:
knn = KNeighborsRegressor(algorithm='brute', n_neighbors=5, metric = "euclidean")
knn.fit(X_train, y_train)

In [None]:
train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)

In [None]:
print("The Mean Absolute Error on train dataset: {} \n".format(mean_absolute_error(y_pred=train_pred,y_true=y_train)))
print("The Mean Absolute Error on test dataset: {} \n".format(mean_absolute_error(y_pred=test_pred,y_true=y_test)))

print("The Mean Squared Error on train dataset: {} \n".format(mean_squared_error(y_pred=train_pred,y_true=y_train)))
print("The Mean Squared Error on test dataset: {} \n".format(mean_squared_error(y_pred=test_pred,y_true=y_test)))

print("The Root Mean Squared Error on train dataset: {} \n".format(math.sqrt(mean_squared_error(y_pred=train_pred,y_true=y_train))))
print("The Root Mean Squared Error on test dataset: {} \n".format(math.sqrt(mean_squared_error(y_pred=test_pred,y_true=y_test))))

In [None]:
# Check for sklearn version (SimpleImputer)
import sklearn
print(sklearn.__version__)

In [None]:
# To upgrade scikit-learn to the latest version, run below command in Anaconda prompt:
# conda update scikit-learn
# pip install -U scikit-learn
# Refer "https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html" for help on SimpleImputer

---
# Condensed KNN

In [None]:
from imblearn.under_sampling import CondensedNearestNeighbour

In [None]:
c_knn_obj = CondensedNearestNeighbour(sampling_strategy='auto', random_state=123, 
                                      n_neighbors=5, n_seeds_S=1, n_jobs=1)

In [None]:
X_res, y_res = c_knn_obj.fit_resample(X_train, y_train)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
k_neigh = KNeighborsRegressor(n_neighbors=3)
k_neigh.fit(X_res, y_res)
y_pred = (k_neigh.predict(X_test))

In [None]:
y_pred