# K-Nearest Neighbours

## Classification - Personal Loan Dataset

This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. 

In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns to better target marketing to increase the success ratio with a minimal budget.

The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

**Dataset Description**:

| Feature | Description |
| --- | --- |
| ID | Customer ID |
| Age | Customer's age in completed years |
| Experience | # years of professional experience |
| Income | Annual income of the customer (In 1,000 dollars) |
| ZIPcode | Home address ZIP code |
| Family | Family size of the customer |
| CCAvg | Average monthly spending on credit cards (In 1,000 dollars) |
| Education | Education level: 1: undergrad; 2: Graduate; 3: Advance/Professional |
| Mortgage | Mortgage Value of house mortgage if any. (In 1,000 dollars) |
| Securities Acct | Does the customer have a securities account with the bank? |
| CD Account | Does the customer have a certifcate of deposit (CD) account with the bank? |
| Online | Does the customer use internet bank facilities? |
| CreditCard | Does the customer use a credit card issued by the UniversalBank? |
| **Personal loan** | **Did this customer accept the personal loan offered in he last campaign? 1: yes; 0: no (target variable)** | 

**The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).**
___

The dataset is available at the path `datasets` from the current directory.

## K nearest neighbors
		
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm: 
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

#### Import all the required packages and classes

In [1]:
import numpy as np
import pandas as pd

In [2]:
import warnings
warnings.filterwarnings('ignore')

# Preprocessing
## Read the file

In [3]:
bank=pd.read_csv("UnivBank.csv",na_values=["?","#"])

# Preprocessing

In [4]:
bank

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0.0,0,1.0,0.0,0,0
1,2,45,19,34,90089,3,1.5,1,0.0,0,1.0,0.0,0,0
2,3,39,15,11,94720,1,1.0,1,0.0,0,0.0,0.0,0,0
3,4,35,9,100,94112,1,2.7,2,0.0,0,0.0,,0,0
4,5,35,8,45,91330,4,1.0,2,0.0,0,0.0,0.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0.0,0,0.0,0.0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85.0,0,0.0,0.0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0.0,0,0.0,0.0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0.0,0,0.0,0.0,1,0


In [5]:
bank = bank.drop(['ID','ZIP Code'], axis=1)

In [6]:
bank.dtypes #finding the type of data

Age                     int64
Experience              int64
Income                  int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage              float64
Personal Loan           int64
Securities Account    float64
CD Account            float64
Online                  int64
CreditCard              int64
dtype: object

In [7]:
bank

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,25,1,49,4,1.6,1,0.0,0,1.0,0.0,0,0
1,45,19,34,3,1.5,1,0.0,0,1.0,0.0,0,0
2,39,15,11,1,1.0,1,0.0,0,0.0,0.0,0,0
3,35,9,100,1,2.7,2,0.0,0,0.0,,0,0
4,35,8,45,4,1.0,2,0.0,0,0.0,0.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4995,29,3,40,1,1.9,3,0.0,0,0.0,0.0,1,0
4996,30,4,15,4,0.4,1,85.0,0,0.0,0.0,1,0
4997,63,39,24,2,0.3,3,0.0,0,0.0,0.0,0,0
4998,65,40,49,3,0.5,2,0.0,0,0.0,0.0,1,0


In [8]:
#converting the some objects into categorical

In [9]:
bank.columns

Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
       'Mortgage', 'Personal Loan', 'Securities Account', 'CD Account',
       'Online', 'CreditCard'],
      dtype='object')

In [10]:
for col in ['Family', 'Education', 'Securities Account', 'CD Account',
       'Online', 'CreditCard']:
    bank[col] = bank[col].astype('category')

In [11]:
bank.dtypes

Age                      int64
Experience               int64
Income                   int64
Family                category
CCAvg                  float64
Education             category
Mortgage               float64
Personal Loan            int64
Securities Account    category
CD Account            category
Online                category
CreditCard            category
dtype: object

In [12]:
bank.isnull().sum()

Age                   0
Experience            0
Income                0
Family                0
CCAvg                 0
Education             0
Mortgage              2
Personal Loan         0
Securities Account    2
CD Account            1
Online                0
CreditCard            0
dtype: int64

In [13]:
bank.shape

(5000, 12)

In [14]:
bank = bank.dropna()

In [15]:
bank.isnull().sum()

Age                   0
Experience            0
Income                0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

In [16]:
num_cols = ['Age', 'Experience', 'CCAvg', 'Mortgage', 'Income']

In [17]:
cat_cols = ['Family', 'Education', 'Securities Account', 'CD Account',
       'Online', 'CreditCard']

In [18]:
bank.shape

(4995, 12)

In [19]:
X = bank.drop('Personal Loan', axis=1)
y = bank['Personal Loan']

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state = 0)

In [22]:
X_train

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard
352,52,28,91,4,1.0,2,0.0,0.0,0.0,0,1
4830,37,12,60,4,2.1,3,217.0,0.0,0.0,1,0
1087,38,13,54,3,0.7,2,196.0,0.0,0.0,0,0
4078,36,12,58,1,3.6,2,0.0,0.0,0.0,0,0
1765,26,0,149,2,7.2,1,154.0,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
4936,45,20,94,3,0.5,3,0.0,0.0,0.0,0,0
3269,58,34,68,2,2.8,1,113.0,0.0,0.0,0,0
1658,50,25,14,4,0.8,1,0.0,0.0,0.0,1,0
2612,50,26,40,4,1.1,2,131.0,0.0,0.0,0,0


In [23]:
y_train

352     0
4830    0
1087    0
4078    0
1765    0
       ..
4936    0
3269    0
1658    0
2612    0
2737    0
Name: Personal Loan, Length: 4495, dtype: int64

In [24]:
X_train[cat_cols]

Unnamed: 0,Family,Education,Securities Account,CD Account,Online,CreditCard
352,4,2,0.0,0.0,0,1
4830,4,3,0.0,0.0,1,0
1087,3,2,0.0,0.0,0,0
4078,1,2,0.0,0.0,0,0
1765,2,1,0.0,0.0,0,0
...,...,...,...,...,...,...
4936,3,3,0.0,0.0,0,0
3269,2,1,0.0,0.0,0,0
1658,4,1,0.0,0.0,1,0
2612,4,2,0.0,0.0,0,0


In [25]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')

In [26]:
enc.fit(X_train[cat_cols])

In [27]:
cat_ohe=enc.get_feature_names_out()

In [28]:
cat_ohe

array(['Family_1', 'Family_2', 'Family_3', 'Family_4', 'Education_1',
       'Education_2', 'Education_3', 'Securities Account_0.0',
       'Securities Account_1.0', 'CD Account_0.0', 'CD Account_1.0',
       'Online_0', 'Online_1', 'CreditCard_0', 'CreditCard_1'],
      dtype=object)

In [29]:
X_train_ohe=enc.transform(X_train[cat_cols]).toarray()

In [30]:
X_test_ohe=enc.transform(X_test[cat_cols]).toarray()

In [31]:
from sklearn.preprocessing import LabelEncoder

In [32]:
le=LabelEncoder()

In [33]:
le.fit(y_train)

In [34]:
y_train=le.transform(y_train)

In [35]:
y_test = le.transform(y_test)

In [36]:
from sklearn.preprocessing import StandardScaler

In [37]:
scaler = StandardScaler()

In [38]:
scaler.fit(X_train[num_cols])

In [39]:
num_scale=scaler.get_feature_names_out()

In [40]:
num_scale

array(['Age', 'Experience', 'CCAvg', 'Mortgage', 'Income'], dtype=object)

In [41]:
X_train_std = scaler.transform(X_train[num_cols])

In [42]:
X_test_std = scaler.transform(X_test[num_cols])

In [43]:
X_train_con=np.concatenate((X_train_ohe,X_train_std),axis=1)

In [44]:
X_test_con=np.concatenate((X_test_ohe,X_test_std),axis=1)

In [45]:
X_train_con.shape

(4495, 20)

In [46]:
X_test_con.shape

(500, 20)

In [47]:
from sklearn.linear_model import LogisticRegression

In [48]:
logr=LogisticRegression(max_iter=200)

In [49]:
model_logr=logr.fit(X_train_con,y_train)

In [50]:
y_pred_train_logr=model_logr.predict(X_train_con)

In [51]:
y_pred_test_logr=model_logr.predict(X_test_con)

In [52]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score

In [53]:
def error_metrics(act, pred):
    print("Confusion Matrix \n", confusion_matrix(act, pred))
    print("Accurcay : ", accuracy_score(act, pred))
    print("Recall   : ", recall_score(act, pred))
    print("Precision: ", precision_score(act, pred)) 

In [54]:
train_logr=error_metrics(y_train,y_pred_train_logr)

Confusion Matrix 
 [[4016   37]
 [ 144  298]]
Accurcay :  0.9597330367074527
Recall   :  0.6742081447963801
Precision:  0.8895522388059701


In [55]:
test_logr = error_metrics(y_test,y_pred_test_logr)

Confusion Matrix 
 [[456   6]
 [ 13  25]]
Accurcay :  0.962
Recall   :  0.6578947368421053
Precision:  0.8064516129032258


In [56]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X_train_con,y_train)

In [57]:
y_pred_train_dt=dt.predict(X_train_con)

In [58]:
y_pred_test_dt=dt.predict(X_test_con)

In [59]:
train_dt=error_metrics(y_train,y_pred_train_dt)

Confusion Matrix 
 [[4053    0]
 [   0  442]]
Accurcay :  1.0
Recall   :  1.0
Precision:  1.0


In [60]:
test_dt=error_metrics(y_test,y_pred_test_dt)

Confusion Matrix 
 [[459   3]
 [  1  37]]
Accurcay :  0.992
Recall   :  0.9736842105263158
Precision:  0.925


In [61]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(criterion='entropy')   
rf_clf.fit(X_train_con,y_train)

In [62]:
y_train_pred_rf = rf_clf.predict(X_train_con)

In [63]:
y_test_pred_rf = rf_clf.predict(X_test_con)

In [64]:
train_rf=error_metrics(y_train,y_train_pred_rf)

Confusion Matrix 
 [[4053    0]
 [   0  442]]
Accurcay :  1.0
Recall   :  1.0
Precision:  1.0


In [65]:
test_rf=error_metrics(y_test,y_test_pred_rf)

Confusion Matrix 
 [[462   0]
 [  1  37]]
Accurcay :  0.998
Recall   :  0.9736842105263158
Precision:  1.0


In [66]:
from sklearn.svm import SVC

In [67]:
svc = SVC()

In [68]:
svc.fit(X_train_con,y_train)

In [69]:
y_train_pred_svc = svc.predict(X_train_con)

In [70]:
y_test_pred_svc = rf_clf.predict(X_test_con)

In [71]:
train_svc=error_metrics(y_train,y_train_pred_svc)

Confusion Matrix 
 [[4049    4]
 [  70  372]]
Accurcay :  0.9835372636262514
Recall   :  0.8416289592760181
Precision:  0.9893617021276596


In [72]:
test_rf=error_metrics(y_test,y_test_pred_svc)

Confusion Matrix 
 [[462   0]
 [  1  37]]
Accurcay :  0.998
Recall   :  0.9736842105263158
Precision:  1.0


In [73]:
from sklearn.neighbors import KNeighborsClassifier

In [74]:
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

In [75]:
classifier.fit(X_train_con, y_train)

In [76]:
y_train_pred_knn = classifier.predict(X_train_con)

In [77]:
y_test_pred_knn = classifier.predict(X_test_con)

In [78]:
train_knn=error_metrics(y_train,y_train_pred_knn)

Confusion Matrix 
 [[4051    2]
 [ 137  305]]
Accurcay :  0.9690767519466074
Recall   :  0.6900452488687783
Precision:  0.993485342019544


In [79]:
train_knn=error_metrics(y_test,y_test_pred_knn)

Confusion Matrix 
 [[461   1]
 [ 16  22]]
Accurcay :  0.966
Recall   :  0.5789473684210527
Precision:  0.9565217391304348
