DSC550 Week 9 <br>
Exercise 9.2 <br>
Best Model Selection and Hyperparameter Tuning <br>
Author Michael Paris <br>
02/08/2022 <br>

1. Import the dataset and ensure that it loaded properly.
1. Prepare the data for modeling by performing the following steps:
    1. Drop the column “Load_ID.”
    1. Drop any rows with missing data.
    1. Convert the categorical features into dummy variables.
1. Split the data into a training and test set, where the “Loan_Status” column is the target.
1. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python 1. Cookbook).
1. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.
1. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).
1. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.
1. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.
1. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.
1. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.
1. Summarize your results.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

1. Import the dataset and ensure that it loaded properly.

In [2]:
#import the dataset into a dataframe
loan_df = pd.read_csv('Loan_Train.csv')

In [3]:
loan_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
loan_df.shape

(614, 13)

2. Prepare the data for modeling by performing the following steps:
    1. Drop the column “Loan_ID.”
    1. Drop any rows with missing data.
    1. Convert the categorical features into dummy variables.

In [5]:
loan_df.drop('Loan_ID', axis=1, inplace=True)
loan_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
loan_df.isna().any()

Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

In [7]:
# drop any rows with missing data

print(loan_df.shape)
loan_df = loan_df.dropna()
print(loan_df.shape)

(614, 12)
(480, 12)


In [8]:
# Convert the categorical features into dummy variables.
# lets look at what the columns look like from a dataype perspective.

loan_df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [9]:
object_column_list = list(loan_df.select_dtypes(include=['object']).columns)
loan_df_dummies = pd.get_dummies(loan_df, columns=object_column_list)
loan_df_dummies.shape

(480, 22)

In [10]:
loan_df_dummies.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
1,4583,1508.0,128.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,1,0,0,1,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,1,1,...,0,1,0,0,1,0,0,1,0,1
3,2583,2358.0,120.0,360.0,1.0,0,1,0,1,1,...,0,0,1,1,0,0,0,1,0,1
4,6000,0.0,141.0,360.0,1.0,0,1,1,0,1,...,0,1,0,1,0,0,0,1,0,1
5,5417,4196.0,267.0,360.0,1.0,0,1,0,1,0,...,0,1,0,0,1,0,0,1,0,1


In [11]:
#drop the load_status_n column as it's not needed
#rename the load_status_y column just to keep it clean and not be confusing later

loan_df_dummies.drop('Loan_Status_N', axis=1, inplace=True)
loan_df_dummies.rename(columns={'Loan_Status_Y':'Loan_Status'}, inplace=True)
loan_df_dummies.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
1,4583,1508.0,128.0,360.0,1.0,0,1,0,1,0,...,0,0,1,0,1,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,1,1,...,0,0,1,0,0,1,0,0,1,1
3,2583,2358.0,120.0,360.0,1.0,0,1,0,1,1,...,0,0,0,1,1,0,0,0,1,1
4,6000,0.0,141.0,360.0,1.0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,1,1
5,5417,4196.0,267.0,360.0,1.0,0,1,0,1,0,...,1,0,1,0,0,1,0,0,1,1


3. Split the data into a training and test set, where the “Loan_Status” column is the target.

In [12]:
# Split the data into a training and test set, where the “Loan_Status” column is the target.

#X
X = loan_df_dummies.drop(['Loan_Status'], axis=1)
#Y
y = loan_df_dummies['Loan_Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python 1. Cookbook).

In [13]:
#Create a pipeline with a min-max scaler and a KNN 
#classifier (see section 15.3 in the Machine Learning with Python 1. Cookbook).

#standardizer = StandardScaler()
standardizer = preprocessing.MinMaxScaler(feature_range=(0, 1))
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [14]:
# Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. 
# Note: Fitting a pipeline model works just like fitting a regular model.

model = pipe.fit(X_train, y_train)
y_predict = model.predict(X_test)

score = accuracy_score(y_test,y_predict)
print("The model's accuracy is: {:.2f}".format(score))

The model's accuracy is: 0.68


6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).

7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [15]:
# create a search space for KNN varying from 1-10
# set up the grid search  with 5 fold cv to figure the best K value
#grab the knn__n_neighbors parameter from the classifier to get the K value

search_space = [{'knn__n_neighbors': [1,2,3,4,5,6,7,8,9,10]}]

classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(X_train, y_train)
K = classifier.best_estimator_.get_params()["knn__n_neighbors"]
print('The best K is : {}'.format(K))

y_predict = classifier.predict(X_test)

score = accuracy_score(y_test,y_predict)
print("The model's accuracy is: {:.2f}".format(score))

The best K is : 10
The model's accuracy is: 0.71


9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [16]:
# set up the pipeline

pipe = Pipeline([('standardizer', standardizer),('classifier', KNeighborsClassifier())])

In [17]:
# create the search space with KNN, Logistic regression and random forest

search_space = [
    {'classifier': [KNeighborsClassifier()],
     'classifier__n_neighbors': [1,2,3,4,5,6,7,8,9,10]},
    {'classifier': [LogisticRegression(solver='liblinear')],
     'classifier__penalty': ['l1','l2'],
     'classifier__C': np.logspace(0,4,10)},
    {'classifier': [RandomForestClassifier()],
     'classifier__n_estimators': [10,100,1000],
     'classifier__max_features': [1,2,3]}]

In [18]:
# set up the gridsearch and find the best model 

gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)
best_model = gridsearch.fit(X_train, y_train)

#y_predict = best_model.predict(X_test)

#score = accuracy_score(y_test,y_predict)
#print("The model's accuracy is: {:.2f}".format(score))

In [19]:
# run a prediction with the best model and write out the accuracy
y_predict = best_model.predict(X_test)

score = accuracy_score(y_test,y_predict)
print("The model's accuracy is: {:.2f}".format(score))

The model's accuracy is: 0.79


9. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [20]:
# look at the best model and read out the parameters

best_model.best_estimator_.get_params()["classifier"]

LogisticRegression(penalty='l1', solver='liblinear')

10. Summarize your results.

- I thought the increase in accuracy from the initial run of the KNN model to the final run with the best model that ended up being a logistic regression model was pretty impressive, moving from an initial accuracy of 68% to a final accuracy of 79%.
- I noticed that the above numbers changed each time I ran through the code from the begining.  I suspect this occurs based on how the train and test data is divided.
- The best model was a logistic regression model.