# Troy Quicksall
# K-Nearest-Neighbor for Loan Approval Prediction

## 1. Importing the dataset and ensure that it loaded properly.

In [1]:
import pandas as pd

loan_df = pd.read_csv('Loan_Train.csv')
loan_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 2. Preparing the data for modeling by performing the following steps:
Drop the column “Load_ID.”
Drop any rows with missing data.
Convert the categorical features into dummy variables.

In [2]:
loan_df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [3]:
# dropping 'Loan_ID'
loan_df = loan_df.drop('Loan_ID', axis=1)
# Dropping any rows with missing data using dropna
loan_df = loan_df.dropna()
# Converting categorical features to dummy

def convert_to_dummy(df):
    cat_columns = []
    for col in df.columns:
        # if column is object type, but not the target variable
        if df.dtypes[col] == 'O' and col != 'Loan_Status':
            cat_columns.append(col)
    df = pd.get_dummies(df, prefix=cat_columns, columns=cat_columns)
    return df

loan_df = convert_to_dummy(loan_df)

loan_df.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Female,Gender_Male,Married_No,Married_Yes,...,Dependents_1,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
1,4583,1508.0,128.0,360.0,1.0,N,0,1,0,1,...,1,0,0,1,0,1,0,1,0,0
2,3000,0.0,66.0,360.0,1.0,Y,0,1,0,1,...,0,0,0,1,0,0,1,0,0,1
3,2583,2358.0,120.0,360.0,1.0,Y,0,1,0,1,...,0,0,0,0,1,1,0,0,0,1
4,6000,0.0,141.0,360.0,1.0,Y,0,1,1,0,...,0,0,0,1,0,1,0,0,0,1
5,5417,4196.0,267.0,360.0,1.0,Y,0,1,0,1,...,0,1,0,1,0,0,1,0,0,1


## 3. Splitting the data into a training and test set, with the “Loan_Status” column as the target.

In [4]:
from sklearn.model_selection import train_test_split
x_data = loan_df.drop(['Loan_Status'], axis=1)
target_data = loan_df['Loan_Status']
# splitting the data using sklearn
# using 20% as test size
x_train, x_test, target_train, target_test = train_test_split(x_data,
                                                              target_data, test_size=0.2, random_state=42)

## 4. Creating a pipeline with a min-max scaler and a KNN classifier 

In [5]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

mm_scaler = MinMaxScaler()
knn = KNeighborsClassifier()

pipeline = Pipeline([("scaler", mm_scaler), ("classifier", knn)])

## 5. Fit a default KNN classifier to the data with this pipeline

In [6]:
from sklearn.metrics import accuracy_score

# fitting the pipeline model
pipeline.fit(x_train, target_train)


# using the accuracy_score to get the model's accuracy 
predictions = pipeline.predict(x_test)
accuracy = accuracy_score(target_test, predictions)
print('Model accuracy: ', accuracy)

Model accuracy:  0.78125


## 6. Create a search space for  KNN classifier where your “n_neighbors” parameter varies from 1 to 10

In [7]:
# create search space with candidate values ranges from 1-10
search_space = [{'classifier__n_neighbors': [1,2,3,4,5,6,7,8,9,10]}]



## 7. Fitting a grid search with pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter

In [8]:
from sklearn.model_selection import GridSearchCV

# using grid search with our pipeline, search space with k=1-10 and cross validation = 5 fold
grid_search = GridSearchCV(pipeline, search_space, cv=5, verbose=0)
best_model = grid_search.fit(x_train, target_train)

In [9]:
# using best estimator to find the best 'n_neighbors value'
best_model.best_estimator_.get_params()['classifier__n_neighbors']

9

9 is the best n_neighbors value

## 8. Finding the accuracy of the grid search best model on the test set

In [10]:

predictions_grid = best_model.predict(x_test)
accuracy = accuracy_score(target_test, predictions_grid)
print('Model accuracy: ', accuracy)

Model accuracy:  0.75


## 9. Now, repeating steps 6 and 7 with the same pipeline, but expanding search space to include logistic regression and random forest models with the hyperparameter values

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np

#pipeline.steps.append(['classifier',LogisticRegression()])

# create search space to include LogisticRegression and RandomForestClassifier with values from 12.3 of book
search_space_new = [{'classifier': [LogisticRegression(max_iter=500, solver='liblinear')],
                 'classifier__penalty': ['l1', 'l2'], 'classifier__C': np.logspace(0,4,10)},
               {'classifier': [RandomForestClassifier()], 'classifier__n_estimators': [10,100,1000],
               'classifier__max_features': [1,2,3]}]


grid_search = GridSearchCV(pipeline, search_space_new, cv=5, verbose=0)

In [12]:
# best model hyperparameters
best_model = grid_search.fit(x_train, target_train)
print(best_model.best_estimator_)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier',
                 LogisticRegression(C=2.7825594022071245, max_iter=500,
                                    solver='liblinear'))])


In [13]:
# getting best model accuracy
predictions_best = best_model.predict(x_test)
accuracy = accuracy_score(target_test, predictions_best)
print('Model accuracy: ', accuracy)

Model accuracy:  0.8229166666666666


The best model and hyperparameters is the Logistic regression with (C=2.7825594022071245, max_iter=500,
                                    solver='liblinear')

## 11. Summary of Results

In the first few steps we found the best hyperparameters for the KNN classification model. However, there was no improvement in the model. This means that possibly, KNN is not the best model algorithm to use. We then included other models in the search space (LogisticRegression and RandomForest), along with their own hyperparameter candidates. By doing this we found which model and with which hyperparameters leads to the best accuracy, which in this case was Logistic regression with parameters: (C=2.7825594022071245, max_iter=500, solver='liblinear').