## **Data Partition**

The dataset is divided into two parts, training and testing datasets, of ratio 4:1. The training dataset is used for model construction on the data. The testing data is later to be used for prediction and model performance evaluation.

In [20]:
set.seed(5496)
ind=createDataPartition(loan$Loan_Status,p=0.8,list=F)
training=loan[ind,]
testing=loan[-ind,]

The structure of the testing data and training data set are

In [21]:
str(training)

'data.frame':	492 obs. of  10 variables:
 $ Gender        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Married       : num  0 1 1 1 1 1 1 1 1 1 ...
 $ Dependents    : chr  "0" "1" "0" "0" ...
 $ Education     : num  1 1 0 0 1 1 1 1 1 1 ...
 $ Self_Employed : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Credit_History: num  1 1 1 1 0 1 1 1 1 1 ...
 $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 2 3 2 3 3 3 ...
 $ Loan_Status   : Factor w/ 2 levels "N","Y": 2 1 2 2 1 2 1 2 2 2 ...
 $ EMI           : num  -1.034 -1.034 -1.099 -1.332 -0.824 ...
 $ TotalIncome   : num  8.67 8.71 8.51 8.26 8.62 ...


In [22]:
str(testing)

'data.frame':	122 obs. of  10 variables:
 $ Gender        : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Married       : num  1 0 1 1 1 1 1 1 0 0 ...
 $ Dependents    : chr  "0" "0" "2" "2" ...
 $ Education     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ Self_Employed : num  1 0 1 0 0 0 0 0 0 0 ...
 $ Credit_History: num  1 1 1 1 1 1 0 1 1 1 ...
 $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 1 3 1 3 2 3 ...
 $ Loan_Status   : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 1 2 2 1 ...
 $ EMI           : num  -1.696 -0.937 -0.299 -1.954 -0.996 ...
 $ TotalIncome   : num  8.01 8.7 9.17 7.78 8.49 ...


## **Model Construction**

First, a **logistic regression model** is constructed by taking all the variables into consideration.

In [23]:
## Implementing Logistic Regression  Model
model1=glm(Loan_Status~.,data=training, family="binomial")
summary(model1)


Call:
glm(formula = Loan_Status ~ ., family = "binomial", data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1609  -0.4144   0.5537   0.7157   2.3174  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -4.840444   2.475478  -1.955   0.0505 .  
Gender                  0.054550   0.327580   0.167   0.8677    
Married                 0.605783   0.281091   2.155   0.0312 *  
Dependents1            -0.555804   0.316878  -1.754   0.0794 .  
Dependents2             0.253725   0.379690   0.668   0.5040    
Dependents3            -0.005697   0.475764  -0.012   0.9904    
Education               0.430510   0.287127   1.499   0.1338    
Self_Employed          -0.135047   0.340066  -0.397   0.6913    
Credit_History          3.789530   0.459685   8.244   <2e-16 ***
Property_AreaSemiurban  0.810823   0.291444   2.782   0.0054 ** 
Property_AreaUrban      0.171976   0.290282   0.592   0.5536    
EMI            

The above summary exhibits the significance of the variables. Based on the significance level, it is evident that the variables Credit History, Property Area, Marital status and Education are the most significant variables.

Different models are constructed on the training data by removing the variables that are not significant using the **Backward Elimination Technique**.

In [24]:
model2=glm(Loan_Status~Credit_History+Property_Area+Married+Education+Dependents,data=training, family="binomial")
summary(model2)


Call:
glm(formula = Loan_Status ~ Credit_History + Property_Area + 
    Married + Education + Dependents, family = "binomial", data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2214  -0.4093   0.5634   0.6998   2.3272  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -3.45182    0.55505  -6.219 5.01e-10 ***
Credit_History          3.78110    0.45827   8.251  < 2e-16 ***
Property_AreaSemiurban  0.80384    0.28990   2.773  0.00556 ** 
Property_AreaUrban      0.18554    0.28836   0.643  0.51995    
Married                 0.60555    0.26249   2.307  0.02106 *  
Education               0.42039    0.27630   1.521  0.12814    
Dependents1            -0.59659    0.31368  -1.902  0.05718 .  
Dependents2             0.21954    0.37799   0.581  0.56136    
Dependents3            -0.07299    0.46888  -0.156  0.87630    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parame

In [25]:
model3=glm(Loan_Status~Credit_History+Property_Area+Married+Dependents,data=training, family="binomial")
summary(model3)


Call:
glm(formula = Loan_Status ~ Credit_History + Property_Area + 
    Married + Dependents, family = "binomial", data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1741  -0.4304   0.5897   0.7068   2.2767  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -3.1340     0.5101  -6.144 8.04e-10 ***
Credit_History           3.7737     0.4569   8.259  < 2e-16 ***
Property_AreaSemiurban   0.8231     0.2895   2.843  0.00447 ** 
Property_AreaUrban       0.2199     0.2863   0.768  0.44246    
Married                  0.6200     0.2610   2.376  0.01751 *  
Dependents1             -0.6102     0.3121  -1.955  0.05055 .  
Dependents2              0.1817     0.3765   0.482  0.62948    
Dependents3             -0.1241     0.4674  -0.266  0.79061    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 611.54  on 491  d

In [26]:
model4=glm(Loan_Status~Credit_History+Property_Area+Married+Education,data=training, family="binomial")
summary(model4)


Call:
glm(formula = Loan_Status ~ Credit_History + Property_Area + 
    Married + Education, family = "binomial", data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0865  -0.4096   0.5949   0.6925   2.2449  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -3.4500     0.5435  -6.348 2.18e-10 ***
Credit_History           3.7417     0.4519   8.280  < 2e-16 ***
Property_AreaSemiurban   0.7506     0.2864   2.621  0.00877 ** 
Property_AreaUrban       0.1217     0.2841   0.428  0.66841    
Married                  0.5999     0.2375   2.526  0.01154 *  
Education                0.4142     0.2730   1.517  0.12928    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 611.54  on 491  degrees of freedom
Residual deviance: 463.76  on 486  degrees of freedom
AIC: 475.76

Number of Fisher Scoring iterations: 5


In [27]:
model5=glm(Loan_Status~Credit_History+Property_Area+Married,data=training, family="binomial")
summary(model5)


Call:
glm(formula = Loan_Status ~ Credit_History + Property_Area + 
    Married, family = "binomial", data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0466  -0.4189   0.5127   0.7278   2.2902  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -3.1464     0.5020  -6.267 3.68e-10 ***
Credit_History           3.7404     0.4513   8.288  < 2e-16 ***
Property_AreaSemiurban   0.7694     0.2857   2.693  0.00708 ** 
Property_AreaUrban       0.1579     0.2821   0.560  0.57562    
Married                  0.5994     0.2371   2.528  0.01146 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 611.54  on 491  degrees of freedom
Residual deviance: 466.00  on 487  degrees of freedom
AIC: 476

Number of Fisher Scoring iterations: 5


In [28]:
model6=glm(Loan_Status~Credit_History+Property_Area,data=training, family="binomial")
summary(model6)


Call:
glm(formula = Loan_Status ~ Credit_History + Property_Area, family = "binomial", 
    data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9550  -0.3775   0.5658   0.7518   2.3749  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.7587     0.4727  -5.836 5.33e-09 ***
Credit_History           3.7244     0.4493   8.289  < 2e-16 ***
Property_AreaSemiurban   0.7853     0.2840   2.765   0.0057 ** 
Property_AreaUrban       0.1533     0.2800   0.548   0.5840    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 611.54  on 491  degrees of freedom
Residual deviance: 472.34  on 488  degrees of freedom
AIC: 480.34

Number of Fisher Scoring iterations: 5


Nested models are created above from the full model with all the variables and then their Akaike’s Information Criterion (AIC) values are compared. The model with the lower AIC value among the nested model is considered a better model.

In [29]:
AIC(model1)
AIC(model2)
AIC(model3)
AIC(model4)
AIC(model5)
AIC(model6)

Comparing the AIC values of the models it is clear that the models 4 and 5 has lesser AIC values compared to other models.

### Likelihood Ratio tests on models

Likelihood ratio tests are conducted to verify the nested models’ significance and determine the model with the best fit.

The null hypothesis of the test is taken as, Null Hypothesis: Nested model is better

If the p-value is less than significance level we reject the null hypothesis

In [30]:
lrtest(model3,model4)
lrtest(model4,model5)
lrtest(model5,model6)

Unnamed: 0_level_0,#Df,LogLik,Df,Chisq,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,8,-230.5993,,,
2,6,-231.8783,-2.0,2.557935,0.2783246


Unnamed: 0_level_0,#Df,LogLik,Df,Chisq,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,6,-231.8783,,,
2,5,-233.0013,-1.0,2.246163,0.1339462


Unnamed: 0_level_0,#Df,LogLik,Df,Chisq,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,5,-233.0013,,,
2,4,-236.1687,-1.0,6.33478,0.01183928


From the likelihood ratio tests conducted and the AIC values obtained before, we can say that model5 is the better fit for the model construction.

So, the final variables of significance are Credit History, Property Area and Marital status, since this model exhibits the least AIC value and has the goodness of fit among the nested models created.

## **Evaluating Model Performance**
Now the predictions of credit risk are made on the testing data using the best fitted logistic model. The threshold value is set at 0.5 and the prediction values of the target variable are obtained as factors “Y” and “N”. Then a Confusion Matrix is obtained for comparing the predicted values to the actual values of the testing dataset.

In [31]:
res=predict(model5,testing,type="response")
contrasts(testing$Loan_Status)
predictedvalues=ifelse(res>0.5,'Y','N')
predictedvalues=as.factor(predictedvalues)

Unnamed: 0,Y
N,0
Y,1


In [32]:
confusionMatrix(predictedvalues,testing$Loan_Status,positive = 'Y')

Confusion Matrix and Statistics

          Reference
Prediction  N  Y
         N 19  1
         Y 19 83
                                          
               Accuracy : 0.8361          
                 95% CI : (0.7582, 0.8969)
    No Information Rate : 0.6885          
    P-Value [Acc > NIR] : 0.0001559       
                                          
                  Kappa : 0.5608          
                                          
 Mcnemar's Test P-Value : 0.0001439       
                                          
            Sensitivity : 0.9881          
            Specificity : 0.5000          
         Pos Pred Value : 0.8137          
         Neg Pred Value : 0.9500          
             Prevalence : 0.6885          
         Detection Rate : 0.6803          
   Detection Prevalence : 0.8361          
      Balanced Accuracy : 0.7440          
                                          
       'Positive' Class : Y               
                                    

## **Conclusion**
The accuracy of the prediction through the constructed model on the testing data is found to be **83.6%**.

* The observation inferred through this analysis is that the variable Credit History affects the chance of approval of the loan by a huge difference.
* The applicants with the Property Area as Sub-urban has more credibility among the applicants for loan.
* Those applicants who are married are more likely to get their loan approved.