# Personal Loan Acceptance

University Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. Most customers are liability customers (depositors) with varying sizes of relationships with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use k-NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.

The file contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. “Personal Loan” column holds the class information (customer response).

Partition the data into training (60%) and validation (40%) sets.

a. Consider the following customer:
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.

Note that we do not know the actual class of this customer. Append this record to the validation set using add_row() function from dplyr (you may check the documentation via ?add_row). It is better that you keep the original validation set for step b and do your transformation (appending) on a copy and assign it to a new object for use in step d.

Perform a 1-NN classification with all predictors except ID and ZIP code. How would this customer be classified?

Remember to transform categorical predictors with more than two categories into dummy variables first. You may use the base function model.matrix or fastDummies package (https://jacobkap.github.io/fastDummies/). If the factor has n categories, n - 1 dummies should be created (dummy for an arbitrary reference level is redundant).

You should also normalize all features according to the 0-1 range method. Do not forget to include the dummies (already normalized) into the final normalized dataset.

Specify the success class as 1 (loan acceptance).

b.  Using the original validation set (without the new record in a) ), what is the best choice of k according to the accuracy measure (TP + TN / TP + TN + FP + FN)? Call this k*.

c.  Show the confusion matrix for the validation data you used in step b that results from using k*.

d.  Now classify the new record you added in step a using the k* you found in step b. Did the fitted class change between k = 1 and k = k*?

# Answer

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(BBmisc) # for easy normalization of data
library(class) # for kNN classification algorithm 
library(fastDummies) # for dummies
options(warn=-1) # for suppressing messages

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

In [None]:
mybank <- fread(sprintf("%s/csv/08_01_MyBank.csv", datapath))

In [None]:
mybank %>% str

In [None]:
# add the new record
mybank2 <- mybank %>%
add_row(ID = 5001,
       Age = 40,
       Experience = 10,
       Income = 84,
       Family = 2,
       CCAvg = 2,
       Education = 2,
       Mortgage = 0,
       Securities_Account = 0,
       CD_Account = 0,
       Online = 1,
       CreditCard = 1)

In [None]:
# delete unnecessary features
mybank2[, c("ID", "ZIP_Code") := NULL]

In [None]:
# make education a factor
mybank2[, Education := as.factor(Education)]

In [None]:
mybank3 <- copy(mybank2)

In [None]:
# normalize numeric features
# make education a dummy
mybank4 <- mybank3 %>%
mutate_at(c("Age", "Experience", "Income",
               "Family", "CCAvg", "Mortgage"), BBmisc::normalize, "range") %>%
fastDummies::dummy_cols("Education", remove_first_dummy = T,
                       remove_selected_columns = T)

In [None]:
mybank4 %>% str

In [None]:
# generate row ids for train, ensure that the new 5001th row is not in the train set
set.seed(100)
trainids <- mybank4[, sample((.N-1), (.N-1)*0.6)]

In [None]:
mybank_train <- mybank4[trainids] %>% select(-Personal_Loan)

a) Get the class for new row when k = 1

In [None]:
# the first test set is just the new row
mybank_test1 <- mybank4[.N] %>% select(-Personal_Loan)

In [None]:
mybank_train_labels <- mybank4[trainids, Personal_Loan]

In [None]:
mybank_test_pred1 <- class::knn(train = mybank_train,
                            test = mybank_test1,
                            cl = mybank_train_labels,
                            k = 1)

In [None]:
mybank_test_pred1

b) Get the best k

In [None]:
mybank_test2 <- mybank4[-.N][-trainids] %>% select(-Personal_Loan)

In [None]:
mybank_test_labels <- mybank4[-.N][-trainids, Personal_Loan]

In [None]:
# get classes into a list for all k's in 1:100
classes_l <- lapply(1:100, function(x) class::knn(train = mybank_train,
                            test = mybank_test2,
                            cl = mybank_train_labels,
                            k = x))

In [None]:
# get the accuracies for all k's
accuracies <- sapply(classes_l, function(x) sum(x == mybank_test_labels)/length(x))

In [None]:
plot(accuracies, type = "l")

In [None]:
# get the best k
k_star <- which.max(accuracies)
k_star

d) Get the class of new row with k_star

In [None]:
mybank_test_pred2 <- class::knn(train = mybank_train,
                            test = mybank_test1,
                            cl = mybank_train_labels,
                            k = k_star)

In [None]:
mybank_test_pred2

Since k is the same, label is the same (in some seeds, the k_star might be 3)