# K-Nearest Neighbours

## K nearest neighbors
		
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm: 
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

## Classification - Personal Loan Dataset

This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. 

In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns to better target marketing to increase the success ratio with a minimal budget.

The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

**Dataset Description**:

| Feature | Description |
| --- | --- |
| ID | Customer ID |
| Age | Customer's age in completed years |
| Experience | # years of professional experience |
| Income | Annual income of the customer (In 1,000 dollars) |
| ZIPcode | Home address ZIP code |
| Family | Family size of the customer |
| CCAvg | Average monthly spending on credit cards (In 1,000 dollars) |
| Education | Education level: 1: undergrad; 2: Graduate; 3: Advance/Professional |
| Mortgage | Mortgage Value of house mortgage if any. (In 1,000 dollars) |
| Securities Acct | Does the customer have a securities account with the bank? |
| CD Account | Does the customer have a certifcate of deposit (CD) account with the bank? |
| Online | Does the customer use internet bank facilities? |
| CreditCard | Does the customer use a credit card issued by the UniversalBank? |
| **Personal loan** | **Did this customer accept the personal loan offered in he last campaign? 1: yes; 0: no (target variable)** | 

**The classification goal is to predict if the client will subscribe (yes/no) a term loan (variable y).**
___

The dataset is available at the path `datasets` from the current directory.

### Install Necessary Packages

In [None]:
!pip install imblearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Import all the required packages and classes

In [None]:
import math
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.impute import SimpleImputer

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score 

from imblearn.under_sampling import CondensedNearestNeighbour

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score


### Mount the Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import warnings
warnings.filterwarnings('ignore')

#### Read the data

In [None]:
data = pd.read_csv("/content/drive/My Drive/mlknn/UnivBank.csv",na_values=['?','#'], header=0)

#### Display the first 5 records

In [None]:
data.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0.0,0,1.0,0.0,0,0
1,2,45,19,34,90089,3,1.5,1,0.0,0,1.0,0.0,0,0
2,3,39,15,11,94720,1,1.0,1,0.0,0,0.0,0.0,0,0
3,4,35,9,100,94112,1,2.7,2,0.0,0,0.0,,0,0
4,5,35,8,45,91330,4,1.0,2,0.0,0,0.0,0.0,0,1


#### Display the dimensions, column names and column datatypes

In [None]:
print(data.columns)
print(data.dtypes)

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage              float64
Personal Loan           int64
Securities Account    float64
CD Account            float64
Online                  int64
CreditCard              int64
dtype: object


#### Check the summary (descriptive statistics)  for all attributes

In [None]:
data.describe(include='all')

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,4998.0,5000.0,4998.0,4999.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.521409,0.096,0.104442,0.060412,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.727873,0.294621,0.305863,0.238273,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


#### Check the unique levels in the target attribute PersonalLoan 

In [None]:
data['Personal Loan'].value_counts(normalize=True)*100 #nunique()

0    90.4
1     9.6
Name: Personal Loan, dtype: float64












































#### Check the number of unique ZIP Codes present in the dataset 

In [None]:
data['ZIP Code'].nunique()

467

#### Check the 'ID' present in the dataset 

In [None]:
data['ID'].value_counts().sum()

5000

### Think how should we deal with these attributes?

#### Remove the unncessary columns (ID and ZipCode)

In [None]:
data.drop('ID',axis=1,inplace=True)

In [None]:
data.drop('ZIP Code',axis=1,inplace=True)

#### Check the count of Education values in each level

In [None]:
data['Education'].value_counts()

1    2096
3    1501
2    1403
Name: Education, dtype: int64

In [None]:
data['Mortgage'].value_counts()

0.0      3460
98.0       17
119.0      16
89.0       16
91.0       16
         ... 
547.0       1
458.0       1
505.0       1
361.0       1
541.0       1
Name: Mortgage, Length: 347, dtype: int64

#### Check the count of Family values in each level

In [None]:
data['Family'].value_counts()

1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64

### Think what should be their actual datatypes?

In [None]:
data.dtypes

Age                     int64
Experience              int64
Income                  int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage              float64
Personal Loan           int64
Securities Account    float64
CD Account            float64
Online                  int64
CreditCard              int64
dtype: object

#### Convert the attributes to the right data type based on the dataset description

In [None]:
column=  ['Education','CreditCard','Family','Online','Securities Account','CD Account']
for col in column:
  data[col] = data[col].astype('category')

In [None]:
data.dtypes

Age                      int64
Experience               int64
Income                   int64
Family                category
CCAvg                  float64
Education             category
Mortgage               float64
Personal Loan            int64
Securities Account    category
CD Account            category
Online                category
CreditCard            category
dtype: object

In [None]:
data.shape

(5000, 12)

#### Creating dummy variables



In [None]:
data = pd.get_dummies(columns=column,data=data)

In [None]:
data.shape

(5000, 21)

####  Check for missing values 

In [None]:
data.isnull().sum()

Age                       0
Experience                0
Income                    0
CCAvg                     0
Mortgage                  2
Personal Loan             0
Education_1               0
Education_2               0
Education_3               0
CreditCard_0              0
CreditCard_1              0
Family_1                  0
Family_2                  0
Family_3                  0
Family_4                  0
Online_0                  0
Online_1                  0
Securities Account_0.0    0
Securities Account_1.0    0
CD Account_0.0            0
CD Account_1.0            0
dtype: int64

#### Split the data into train and test

In [None]:
X = data.drop('Personal Loan',axis=1)
y = data['Personal Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,stratify=y,random_state=123)

In [None]:
# check the dimensions of the data
# dim of IDV train data
print(X_train.shape)

# check for IDV test data
print(X_test.shape)

# check for DV train data
print(y_train.shape)

# check for DV test data
print(y_test.shape)

(4000, 20)
(1000, 20)
(4000,)
(1000,)


In [None]:
# checking the frequency distribution of DV in train data
print(y_train.value_counts())

# check the frequency distribution of DV in test data
print(y_test.value_counts())

0    3616
1     384
Name: Personal Loan, dtype: int64
0    904
1     96
Name: Personal Loan, dtype: int64


In [None]:
X_train.dtypes

Age                         int64
Experience                  int64
Income                      int64
CCAvg                     float64
Mortgage                  float64
Education_1                 uint8
Education_2                 uint8
Education_3                 uint8
CreditCard_0                uint8
CreditCard_1                uint8
Family_1                    uint8
Family_2                    uint8
Family_3                    uint8
Family_4                    uint8
Online_0                    uint8
Online_1                    uint8
Securities Account_0.0      uint8
Securities Account_1.0      uint8
CD Account_0.0              uint8
CD Account_1.0              uint8
dtype: object

#### Split the attributes into numerical and categorical types

### Can we do it with simple code?

In [None]:
num_attr=X_train.select_dtypes(['int64','float64']).columns
num_attr

Index(['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage'], dtype='object')

In [None]:
cat_attr = X_train.select_dtypes('category').columns
cat_attr

Index([], dtype='object')

#### Checking for missing values in train and test dataset

#### Imputing missing values with median

In [None]:
X_train['Mortgage'].median()

0.0

In [None]:
# creating an object of imputer
imputer = SimpleImputer(strategy='median')
imputer = imputer.fit(X_train[num_attr])

# imputing on train data
X_train[num_attr] = imputer.transform(X_train[num_attr])

# impute on test data
X_test[num_attr] = imputer.transform(X_test[num_attr])

In [None]:
imputer.statistics_

array([45. , 20. , 63.5,  1.5,  0. ])

In [None]:
X_train.isnull().sum()

Age                       0
Experience                0
Income                    0
CCAvg                     0
Mortgage                  0
Education_1               0
Education_2               0
Education_3               0
CreditCard_0              0
CreditCard_1              0
Family_1                  0
Family_2                  0
Family_3                  0
Family_4                  0
Online_0                  0
Online_1                  0
Securities Account_0.0    0
Securities Account_1.0    0
CD Account_0.0            0
CD Account_1.0            0
dtype: int64

In [None]:
X_test.isnull().sum()

Age                       0
Experience                0
Income                    0
CCAvg                     0
Mortgage                  0
Education_1               0
Education_2               0
Education_3               0
CreditCard_0              0
CreditCard_1              0
Family_1                  0
Family_2                  0
Family_3                  0
Family_4                  0
Online_0                  0
Online_1                  0
Securities Account_0.0    0
Securities Account_1.0    0
CD Account_0.0            0
CD Account_1.0            0
dtype: int64

#### Imputation for missing values for categoric attributes

In [1]:
# creating an object of imputer
imputer1 = SimpleImputer(strategy='mode')
imputer1 = imputer.fit(X_train[cat_attr])

# imputing on train data
X_train[cat_attr] = imputer1.transform(X_train[cat_attr])

# impute on test data
X_test[cat_attr] = imputer1.transform(X_test[cat_attr])

NameError: ignored

In [None]:
print(X_train.isnull().sum())
print(X_test.isnull().sum())

Age                       0
Experience                0
Income                    0
CCAvg                     0
Mortgage                  0
Education_1               0
Education_2               0
Education_3               0
CreditCard_0              0
CreditCard_1              0
Family_1                  0
Family_2                  0
Family_3                  0
Family_4                  0
Online_0                  0
Online_1                  0
Securities Account_0.0    0
Securities Account_1.0    0
CD Account_0.0            0
CD Account_1.0            0
dtype: int64
Age                       0
Experience                0
Income                    0
CCAvg                     0
Mortgage                  0
Education_1               0
Education_2               0
Education_3               0
CreditCard_0              0
CreditCard_1              0
Family_1                  0
Family_2                  0
Family_3                  0
Family_4                  0
Online_0                  0
Online_

###  Activity on sampling and scaling

#### Standardize the data (numerical attributes only) - Import StandardScaler


In [None]:
# creating an object of scaler
scaler = StandardScaler()
# fit on train
std=scaler.fit(X_train[num_attr])

In [None]:
# transform on train
X_train[num_attr]=std.transform(X_train[num_attr])
# transform on test
X_test[num_attr]=std.transform(X_test[num_attr])

#### Build KNN Classifier Model

In [None]:
model_knn= KNeighborsClassifier(n_neighbors=7)  #n_neighbors=5 (By default)
model_knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=7)

#### Predict on the Test data

In [None]:
y_train_pred_knn=model_knn.predict(X_train)
y_test_pred_knn=model_knn.predict(X_test)

#### Find Accuracy for KNN

In [None]:
print('Accuracy on training set: {:.3f}'.format(model_knn.score(X_train,y_train)))
print('Accuracy on testing set: {:.3f}'.format(model_knn.score(X_test,y_test)))

Accuracy on training set: 0.978
Accuracy on testing set: 0.953


#### Find the recall using classification score

In [None]:
print(classification_report(y_train,y_train_pred_knn))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3616
           1       0.99      0.77      0.87       384

    accuracy                           0.98      4000
   macro avg       0.98      0.89      0.93      4000
weighted avg       0.98      0.98      0.98      4000



In [None]:
print(classification_report(y_test,y_test_pred_knn))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97       904
           1       0.90      0.57      0.70        96

    accuracy                           0.95      1000
   macro avg       0.93      0.78      0.84      1000
weighted avg       0.95      0.95      0.95      1000



### **Finding out the IDEAL K-value for the given dataset**

### Grid Search K-fold Cross Validation:

#### 1. Use the GridSearchCV 

In [None]:
parameters = {'n_neighbors':list(range(2,6))}

clf = GridSearchCV(KNeighborsClassifier(metric="cityblock", n_jobs=-1),
                   parameters,verbose=1, cv=5,scoring='recall')

clf.fit(X=X_train, y=y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=KNeighborsClassifier(metric='cityblock', n_jobs=-1),
             param_grid={'n_neighbors': [2, 3, 4, 5]}, scoring='recall',
             verbose=1)

In [None]:

knn_model = clf.best_estimator_
knn_model

KNeighborsClassifier(metric='cityblock', n_jobs=-1, n_neighbors=3)

In [None]:
print (clf.best_score_, clf.best_params_) 

0.5494531784005467 {'n_neighbors': 3}


#### 2. Predict on the test data using the best model

In [None]:
y_pred_test=knn_model.predict(X_test)

In [None]:
y_train_pred = knn_model.predict(X_train)
#  FOR TEST
y_test_pred  = knn_model.predict(X_test)

In [None]:
y_pred_test[:5]

array([0, 0, 0, 0, 0])

In [None]:
print("Accuracy on training set: {:.3f}".format(knn_model.score(X_train, y_train)))
print("Accuracy on training set: {:.3f}".format(knn_model.score(X_test, y_test)))

Accuracy on training set: 0.978
Accuracy on training set: 0.954


In [None]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3616
           1       0.99      0.78      0.87       384

    accuracy                           0.98      4000
   macro avg       0.99      0.89      0.93      4000
weighted avg       0.98      0.98      0.98      4000



In [None]:
#  FOR TEST
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98       904
           1       0.93      0.56      0.70        96

    accuracy                           0.95      1000
   macro avg       0.94      0.78      0.84      1000
weighted avg       0.95      0.95      0.95      1000



#### 3. Compute confusion matrix to evaluate the accuracy of the classification 

In [None]:
print(confusion_matrix(y_test,y_pred_test))

[[900   4]
 [ 42  54]]


#### 4.  classification score

In [None]:
from sklearn.metrics import recall_score
print(recall_score(y_test,y_pred_test))

0.5625


## CNN With KNN

In [None]:
cnn = CondensedNearestNeighbour(n_neighbors=3)
X_cnn_train, y_cnn_train = cnn.fit_resample(X_train, y_train)
X_cnn_test, y_cnn_test = cnn.fit_resample(X_test,y_test)

In [None]:
# Check Shapes of train & test sets for all
X_train.shape

(4000, 20)

In [None]:
X_cnn_train.shape

(692, 20)

In [None]:
y_cnn_train.shape

(692,)

In [None]:
X_cnn_train.head()

Unnamed: 0,Age,Experience,Income,CCAvg,Mortgage,Education_1,Education_2,Education_3,CreditCard_0,CreditCard_1,Family_1,Family_2,Family_3,Family_4,Online_0,Online_1,Securities Account_0.0,Securities Account_1.0,CD Account_0.0,CD Account_1.0
0,0.763183,0.783159,-0.344484,-0.367592,-0.559325,0,0,1,1,0,0,0,0,1,1,0,1,0,1,0
1,-1.256633,-1.323642,0.794945,-0.990617,-0.559325,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0
2,1.641364,1.748777,1.224918,1.048374,-0.559325,1,0,0,1,0,1,0,0,0,1,0,1,0,1,0
3,0.851001,0.783159,-0.043503,-0.933978,0.887982,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0
4,1.37791,1.397643,-1.376419,-0.87734,-0.559325,0,1,0,1,0,0,0,0,1,0,1,1,0,1,0


In [None]:
y_test.shape

(1000,)

In [None]:
y_cnn_test.shape

(206,)

In [None]:
y_cnn_test.head()

0    0
1    0
2    0
3    0
4    0
Name: Personal Loan, dtype: int64

In [None]:
# check the scores


In [None]:
y_cnn_pred_test=knn_model.predict(X_cnn_test)

In [None]:
y_cnn_train_pred = knn_model.predict(X_cnn_train)
#  FOR TEST
y_cnn_test_pred  = knn_model.predict(X_cnn_test)

In [None]:
print("Accuracy on training set: {:.3f}".format(knn_model.score(X_cnn_train, y_cnn_train)))
print("Accuracy on training set: {:.3f}".format(knn_model.score(X_cnn_test, y_cnn_test)))

Accuracy on training set: 0.873
Accuracy on training set: 0.791


In [None]:
print(classification_report(y_cnn_train, y_cnn_train_pred))

              precision    recall  f1-score   support

           0       0.78      0.99      0.87       308
           1       0.99      0.78      0.87       384

    accuracy                           0.87       692
   macro avg       0.89      0.88      0.87       692
weighted avg       0.90      0.87      0.87       692



In [None]:
#  FOR TEST
print(classification_report(y_cnn_test, y_cnn_test_pred))

              precision    recall  f1-score   support

           0       0.72      0.99      0.84       110
           1       0.98      0.56      0.72        96

    accuracy                           0.79       206
   macro avg       0.85      0.78      0.78       206
weighted avg       0.84      0.79      0.78       206



In [None]:
print(confusion_matrix(y_cnn_test,y_cnn_pred_test))

[[109   1]
 [ 42  54]]
