## Loan Approval Analysis

In this exercise, you will work with the Loan_Train.csv dataset which can be downloaded from this link: Loan Approval Data Set. 
1. Import the dataset and ensure that it loaded properly.  
2. Prepare the data for modeling by performing the following steps:  
* Drop the column “Load_ID.”  
* Drop any rows with missing data.  
* Convert the categorical features into dummy variables.  
3. Split the data into a training and test set, where the “Loan_Status” column is the target.  
4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).  
5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.  
6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).  
7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.  
8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.  
9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.  
10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.  
11. Summarize your results.  

In [1]:
# import libraries
import numpy as np
import pandas as pd

In [2]:
# check versions of packages
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)

numpy version: 1.19.2
pandas version: 1.1.3


### 1. Import the dataset and ensure that it loaded properly.

In [3]:
# import data as pandas dataframe
df_loan_all = pd.read_csv('Loan_Train.csv')

In [4]:
df_loan_all.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Prepare the data for modeling by performing the following steps:
Drop the column “Load_ID.”  
Drop any rows with missing data.  
Convert the categorical features into dummy variables.  

In [5]:
# Remove Loan_ID column
df_loan_all.drop('Loan_ID', inplace=True, axis=1)

In [6]:
df_loan_all.shape

(614, 12)

In [7]:
# Drop any rows with missing data
df_loan_all.dropna(inplace=True)

In [8]:
df_loan_all.shape

(480, 12)

In [9]:
# Convert the categorical values into dummy variables
df_loan_all.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [10]:
# Create dummy variables for independent variable Gender
dummy_gender = pd.get_dummies(df_loan_all['Gender'], prefix='gen')
dummy_gender.head()

Unnamed: 0,gen_Female,gen_Male
1,0,1
2,0,1
3,0,1
4,0,1
5,0,1


In [11]:
# Creating v1 to populate the dummy variables and add the gender dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_all, dummy_gender], axis=1)

In [12]:
df_loan_v1.shape

(480, 14)

In [13]:
# Create dummy variables for independent variable Married
dummy_married = pd.get_dummies(df_loan_v1['Married'], prefix='married')
dummy_married.head()

Unnamed: 0,married_No,married_Yes
1,0,1
2,0,1
3,0,1
4,1,0
5,0,1


In [14]:
# populate v1 with the dummy variables and add the Married dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_married], axis=1)

In [15]:
df_loan_v1.shape

(480, 16)

In [16]:
# Create dummy variables for independent variable Dependents
dummy_depd = pd.get_dummies(df_loan_v1['Dependents'], prefix='depd')
dummy_depd.head()

Unnamed: 0,depd_0,depd_1,depd_2,depd_3+
1,0,1,0,0
2,1,0,0,0
3,1,0,0,0
4,1,0,0,0
5,0,0,1,0


In [17]:
# populate v1 with the dummy variables and add the Dependent dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_depd], axis=1)

In [18]:
df_loan_v1.shape

(480, 20)

In [19]:
# Create dummy variables for independent variable Education
dummy_edu = pd.get_dummies(df_loan_v1['Education'], prefix='edu')
dummy_edu.head()

Unnamed: 0,edu_Graduate,edu_Not Graduate
1,1,0
2,1,0
3,0,1
4,1,0
5,1,0


In [20]:
# populate v1 with the dummy variables and add the Education dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_edu], axis=1)

In [21]:
df_loan_v1.shape

(480, 22)

In [22]:
# Create dummy variables for independent variable Self_Employed
dummy_se = pd.get_dummies(df_loan_v1['Self_Employed'], prefix='se')
dummy_se.head()

Unnamed: 0,se_No,se_Yes
1,1,0
2,0,1
3,1,0
4,1,0
5,0,1


In [23]:
# populate v1 with the dummy variables and add the Self_Employed dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_se], axis=1)

In [24]:
df_loan_v1.shape

(480, 24)

In [25]:
# Create dummy variables for independent variable Loan_Amount_Term
dummy_lat = pd.get_dummies(df_loan_v1['Loan_Amount_Term'], prefix='laterm')
dummy_lat.head()

Unnamed: 0,laterm_36.0,laterm_60.0,laterm_84.0,laterm_120.0,laterm_180.0,laterm_240.0,laterm_300.0,laterm_360.0,laterm_480.0
1,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,1,0
5,0,0,0,0,0,0,0,1,0


In [26]:
# populate v1 with the dummy variables and add the Loan_Amount_Term dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_lat], axis=1)

In [27]:
df_loan_v1.shape

(480, 33)

In [28]:
# Create dummy variables for independent variable Credit_History
dummy_ch = pd.get_dummies(df_loan_v1['Credit_History'], prefix='chist')
dummy_ch.head()

Unnamed: 0,chist_0.0,chist_1.0
1,0,1
2,0,1
3,0,1
4,0,1
5,0,1


In [29]:
# populate v1 with the dummy variables and add the Credit_History dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_ch], axis=1)

In [30]:
df_loan_v1.shape

(480, 35)

In [31]:
# Create dummy variables for independent variable Property_Area
dummy_pa = pd.get_dummies(df_loan_v1['Property_Area'], prefix='parea')
dummy_pa.head()

Unnamed: 0,parea_Rural,parea_Semiurban,parea_Urban
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
5,0,0,1


In [32]:
# populate v1 with the dummy variables and add the Property_Area dummy variables to the dataframe
df_loan_v1 = pd.concat([df_loan_v1, dummy_pa], axis=1)

In [33]:
df_loan_v1.shape

(480, 38)

In [34]:
df_loan_v1.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,...,laterm_180.0,laterm_240.0,laterm_300.0,laterm_360.0,laterm_480.0,chist_0.0,chist_1.0,parea_Rural,parea_Semiurban,parea_Urban
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,...,0,0,0,1,0,0,1,1,0,0
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,...,0,0,0,1,0,0,1,0,0,1
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,...,0,0,0,1,0,0,1,0,0,1
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,...,0,0,0,1,0,0,1,0,0,1
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,...,0,0,0,1,0,0,1,0,0,1


In [35]:
# created new df v2 with only the last 27 columns which includes Loan_Status and all dummy variables
df_loan_v2 = df_loan_v1.iloc[:, -27:]

In [36]:
# changing Loan_Status to a binary value
df_loan_v2['Loan_Status'] = (df_loan_v2['Loan_Status'] == 'Y').astype(int)

In [37]:
df_loan_v2

Unnamed: 0,Loan_Status,gen_Female,gen_Male,married_No,married_Yes,depd_0,depd_1,depd_2,depd_3+,edu_Graduate,...,laterm_180.0,laterm_240.0,laterm_300.0,laterm_360.0,laterm_480.0,chist_0.0,chist_1.0,parea_Rural,parea_Semiurban,parea_Urban
1,0,0,1,0,1,0,1,0,0,1,...,0,0,0,1,0,0,1,1,0,0
2,1,0,1,0,1,1,0,0,0,1,...,0,0,0,1,0,0,1,0,0,1
3,1,0,1,0,1,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,1
4,1,0,1,1,0,1,0,0,0,1,...,0,0,0,1,0,0,1,0,0,1
5,1,0,1,0,1,0,0,1,0,1,...,0,0,0,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,1,1,0,1,0,1,0,0,0,1,...,0,0,0,1,0,0,1,1,0,0
610,1,0,1,0,1,0,0,0,1,1,...,1,0,0,0,0,0,1,1,0,0
611,1,0,1,0,1,0,1,0,0,1,...,0,0,0,1,0,0,1,0,0,1
612,1,0,1,0,1,0,0,1,0,1,...,0,0,0,1,0,0,1,0,0,1


### 3. Split the data into a training and test set, where the “Loan_Status” column is the target.

In [38]:
# Creating X and y
X = df_loan_v2.loc[:, df_loan_v2.columns != 'Loan_Status']
y = df_loan_v2['Loan_Status']

In [39]:
# Splitting the variables as training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, 
                                                    test_size = 0.2, random_state = 0)

In [40]:
X_train.to_csv('test1.csv')

### 4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).

In [41]:
# Load libraries for min-max scaler and KNN classifer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [42]:
# Apply min-max scaler
mms = MinMaxScaler()
X_train_mms = mms.fit_transform(X_train)

In [43]:
# Create a KNN classifer
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

In [44]:
# Create a pipline
pipe = Pipeline([("minmax_scaler", mms), ("knn", knn)])

### 5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [45]:
# Fitting the KNN classifier for the train and test data
knn_fit_train = pipe.fit(X_train, y_train)
knn_fit_test = pipe.fit(X_test, y_test)

In [46]:
# create the prediction on X_test
pipe_predict = pipe.predict(X_test)

In [47]:
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [48]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, pipe_predict))

Accuracy: 0.75


### 6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).

In [49]:
# Create space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

### 7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [50]:
# Create grid search
classifer = GridSearchCV(
    pipe, search_space, cv=5, verbose=0).fit(X_train_mms, y_train)

In [51]:
# Fitting the grid search for the train and test data
grid_fit_train = classifer.fit(X_train, y_train)
grid_fit_test = classifer.fit(X_test, y_test)

In [52]:
# create the prediction on X_test
grid_predict = classifer.predict(X_test)

### 8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [53]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, grid_predict))

Accuracy: 0.7708333333333334


### 9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [55]:
# Create new pipe for second search
pipe_2 = Pipeline([("minmax_scaler", mms), ("classifier", RandomForestClassifier)])

In [58]:
# Create search space to include logistic regression and random forest models with the hyperparameter values
search_space_2 = [{"classifier": [LogisticRegression(solver='liblinear')], 
                 "classifier__penalty": ['l1','l2'],
                 "classifier__C": np.logspace(0, 4, 10),
                 "classifier__max_iter": [100]},
                {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_features": [1, 2, 3]},
                 {"classifier": [KNeighborsClassifier()],
                 "classifier__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

In [59]:
# Create grid search
classifier_2 = GridSearchCV(pipe_2, search_space_2, cv=5, verbose=0)

In [60]:
# Fit grid search
best_model = classifier_2.fit(X_train, y_train)

### 10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [61]:
# View best model
best_model.best_estimator_.get_params()["classifier"]

LogisticRegression(penalty='l1', solver='liblinear')

In [71]:
# View best hyperparameters
print("Best Estimator: " + str(classifier_2.best_estimator_))
print("Best Score: " + str(classifier_2.best_score_))
print("Best Parameters: " + str(classifier_2.best_params_))

Best Estimator: Pipeline(steps=[('minmax_scaler', MinMaxScaler()),
                ('classifier',
                 LogisticRegression(penalty='l1', solver='liblinear'))])
Best Score: 0.8204032809295967
Best Parameters: {'classifier': LogisticRegression(penalty='l1', solver='liblinear'), 'classifier__C': 1.0, 'classifier__max_iter': 100, 'classifier__penalty': 'l1'}


In [72]:
# Fit grid search to test set
best_model_test = classifier_2.fit(X_test, y_test)

In [74]:
# create the prediction on X_test - classifier_2
grid_predict_2 = best_model_test.predict(X_test)

In [75]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, grid_predict_2))

Accuracy: 0.78125


### 11. Summarize your results

This assignment was reviewing factors that could have the potential to determine a persons loan approval. The goal was to build a model on the existing data to help predict loan application determination for future processing. Once the data was munged and columns that were not needed removed, dummy variables were created to represent each of the categories to impact the target loan status.  
We created a pipline to reference difference processes such as a min max scaler and knn classifer to evaluate the raw data. By using the KNN evaluation we are trying to find a balance of the nearest neighbors to look for the best fit.  
Next within our KNN classifier we built a set of observations 1-10 to find the best values within these parameters.  
Then we determined the accuracy of that model - 75%  
Within the same functionality I created a dictionary of learning algorithms and added hyperparameters to look for the best model without data. It was determined logistic regression provided the best model. The score was 82%. When the test data was modeled and accuracy determined, the score was 78%.