# Predicting Loan Repayment

<img src="images/loan-repayment.jpg"/>

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this problem, we will use publicly available data from *LendingClub.com*, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable *not_fully_paid* indicates that the loan was not paid back in full (the borrower either defaulted or the loan was "charged off," meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:

    credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

    purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

    int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

    installment: The monthly installments ($) owed by the borrower if the loan is funded.

    log.annual.inc: The natural log of the self-reported annual income of the borrower.

    dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

    fico: The FICO credit score of the borrower.

    days.with.cr.line: The number of days the borrower has had a credit line.

    revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

    revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

    inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

    delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

    pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

### Problem 1.1 - Preparing the Dataset

Load the dataset loans.csv into a data frame called loans, and explore it using the str() and summary() functions.

**What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.**

In [1]:
loans = read.csv("data/loans.csv")
head(loans)

Unnamed: 0_level_0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>
1,1,debt_consolidation,0.1189,829.1,11.35041,19.48,737,5639.958,28854,52.1,0,0,0,0
2,1,credit_card,0.1071,228.22,11.08214,14.29,707,2760.0,33623,76.7,0,0,0,0
3,1,debt_consolidation,0.1357,366.86,10.37349,11.63,682,4710.0,3511,25.6,1,0,0,0
4,1,debt_consolidation,0.1008,162.34,11.35041,8.1,712,2699.958,33667,73.2,1,0,0,0
5,1,credit_card,0.1426,102.92,11.29973,14.97,667,4066.0,4740,39.5,0,1,0,0
6,1,credit_card,0.0788,125.13,11.90497,16.98,727,6120.042,50807,51.0,0,0,0,0


In [2]:
str(loans)

'data.frame':	9578 obs. of  14 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
 $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
 $ installment      : num  829 228 367 162 103 ...
 $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
 $ dti              : num  19.5 14.3 11.6 8.1 15 ...
 $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
 $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
 $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
 $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
 $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
 $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
 $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...


In [3]:
summary(loans)

 credit.policy                 purpose        int.rate       installment    
 Min.   :0.000   all_other         :2331   Min.   :0.0600   Min.   : 15.67  
 1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039   1st Qu.:163.77  
 Median :1.000   debt_consolidation:3957   Median :0.1221   Median :268.95  
 Mean   :0.805   educational       : 343   Mean   :0.1226   Mean   :319.09  
 3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407   3rd Qu.:432.76  
 Max.   :1.000   major_purchase    : 437   Max.   :0.2164   Max.   :940.14  
                 small_business    : 619                                    
 log.annual.inc        dti              fico       days.with.cr.line
 Min.   : 7.548   Min.   : 0.000   Min.   :612.0   Min.   :  179    
 1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0   1st Qu.: 2820    
 Median :10.928   Median :12.665   Median :707.0   Median : 4140    
 Mean   :10.932   Mean   :12.607   Mean   :710.8   Mean   : 4562    
 3rd Qu.:11.290   3rd Qu.:17.950   3rd 

In [4]:
# Table for people that pay full and not pay full.
z = table(loans$not.fully.paid)
z


   0    1 
8045 1533 

8,045 people pay full;
1,533 not pay full

In [5]:
# Proportion of the loans in the dataset were not paid in full.
z[2]/sum(z)

1533/9578 = 16%

### Problem 1.2 - Preparing the Dataset

**Which of the following variables has at least one missing observation?**

In [6]:
summary(loans)

 credit.policy                 purpose        int.rate       installment    
 Min.   :0.000   all_other         :2331   Min.   :0.0600   Min.   : 15.67  
 1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039   1st Qu.:163.77  
 Median :1.000   debt_consolidation:3957   Median :0.1221   Median :268.95  
 Mean   :0.805   educational       : 343   Mean   :0.1226   Mean   :319.09  
 3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407   3rd Qu.:432.76  
 Max.   :1.000   major_purchase    : 437   Max.   :0.2164   Max.   :940.14  
                 small_business    : 619                                    
 log.annual.inc        dti              fico       days.with.cr.line
 Min.   : 7.548   Min.   : 0.000   Min.   :612.0   Min.   :  179    
 1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0   1st Qu.: 2820    
 Median :10.928   Median :12.665   Median :707.0   Median : 4140    
 Mean   :10.932   Mean   :12.607   Mean   :710.8   Mean   : 4562    
 3rd Qu.:11.290   3rd Qu.:17.950   3rd 

Series with NA's: log.annual.inc; days.with.cr.line; revol.util; inq.last.6mths; delinq.2yrs; pub.rec.

### Problem 1.3 - Preparing the Dataset

**Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data?** 

Hint: you can use the subset() function to build a data frame with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).

We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.

### Problem 1.4 - Preparing the Dataset

For the rest of this problem, we'll be using a revised version of the dataset that has the missing values filled in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the same data frame going forward, you can either run the commands below in your R console (if you haven't already, run the command install.packages("mice") first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.

IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.

    library(mice)

    set.seed(144)

    vars.for.imputation = setdiff(names(loans), "not.fully.paid")

    imputed = complete(mice(loans[vars.for.imputation]))

    loans[vars.for.imputation] = imputed

Note that to do this imputation, we set vars.for.imputation to all variables in the data frame except for not.fully.paid, to impute the values using all of the other independent variables.

**What best describes the process we just used to handle missing values?**

In [7]:
#install.packages("mice")
library(mice)


Attaching package: 'mice'


The following objects are masked from 'package:base':

    cbind, rbind




In [8]:
set.seed(144)

vars.for.imputation = setdiff(names(loans), "not.fully.paid")

imputed = complete(mice(loans[vars.for.imputation]))


 iter imp variable
  1   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  1   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  1   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  1   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  1   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  2   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  2   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  2   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  2   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  2   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
  3   1  log.annual.inc  days.with.cr.line  revol.

In [9]:
loans[vars.for.imputation] = imputed

We predicted missing variable values using the available independent variables for each observation.

### Problem 2.1 - Prediction Models
Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the data frames train and test.

Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.

**Which independent variables are significant in our model?** (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.

In [10]:
#install.packages("caTools")
library(caTools)

In [11]:
# Split the data
set.seed(144)

spl = sample.split(loans$not.fully.paid, 0.7)

train = subset(loans, spl == TRUE)
test = subset(loans, spl == FALSE)

In [12]:
# Logistic Regression
mod = glm(not.fully.paid~., data=train, family="binomial")
summary(mod)


Call:
glm(formula = not.fully.paid ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2008  -0.6213  -0.4953  -0.3609   2.6389  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                9.250e+00  1.552e+00   5.959 2.54e-09 ***
credit.policy             -3.417e-01  1.009e-01  -3.388 0.000704 ***
purposecredit_card        -6.124e-01  1.344e-01  -4.557 5.18e-06 ***
purposedebt_consolidation -3.199e-01  9.179e-02  -3.485 0.000493 ***
purposeeducational         1.351e-01  1.753e-01   0.771 0.440814    
purposehome_improvement    1.728e-01  1.479e-01   1.168 0.242901    
purposemajor_purchase     -4.828e-01  2.008e-01  -2.404 0.016215 *  
purposesmall_business      4.123e-01  1.418e-01   2.907 0.003653 ** 
int.rate                   6.434e-01  2.085e+00   0.309 0.757592    
installment                1.274e-03  2.092e-04   6.091 1.12e-09 ***
log.annual.inc            -4.328e-01

Answer: credit.policy, purpose2 (credit card), purpose3 (debt consolidation), purpose 6 (major purchase), purpose 7 (small business), installment, log.annual.inc, fico, revol.bal, inq.last.6mths, pub.rec

### Problem 2.2 - Prediction Models
Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.

Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. **What is the value of Logit(A) - Logit(B)?**

Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by --9.408e-03 * -10 from the predicted log odds of Application B.

In [13]:
-9.408e-03 * -10

Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by -9.408e-03 * -10 = 0.09408 from the predicted log odds of Application B.

**Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)?** (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)exp(B)exp(C). Also, remember that exp() is the exponential function in R.)

In [14]:
exp(0.09408)

The predicted odds of loan A not being paid back in full are 1.0986 times larger than the predicted odds for loan B. Intuitively, it makes sense that loan A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.

### Problem 2.3 - Prediction Models
Predict the probability of the test set loans not being paid back in full (remember type="response" for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.

**What is the accuracy of the logistic regression model?** Input the accuracy as a number between 0 and 1.

In [15]:
# Make predictions
test$predicted.risk = predict(mod, newdata=test, type="response")

# Tabulate not fully with threshold
z = table(test$not.fully.paid, test$predicted.risk > 0.5)
z

   
    FALSE TRUE
  0  2400   13
  1   457    3

The rows are labeled with the actual outcome, and the columns are labeled with the predicted outcome.

            Predict 0       Predict 1
Actual 0    True Negative   False Positive
Actual 1    False Negative  True Positive

  z = [1][3]
      [2][4]

The **rows** are labeled with the actual outcome, and the **columns** are labeled with the predicted outcome.

                Predict 0       Predict 1
    Actual 0    True Negative   False Positive
    Actual 1    False Negative  True Positive
    
    z = [1][3]
        [2][4]

In [16]:
# Calculate accuracy                        # (TN+TP)/(TN+FN+TP+FP)
accur <- (z[1]+z[4])/(z[1]+z[2]+z[3]+z[4])  # sum(diag(z))/sum(z)
paste("Accuracy Logistic Regression: ", round(accur,digits=4))              

**What is the accuracy of the baseline model?** Input the accuracy as a number between 0 and 1.

In [17]:
z2 = table(test$not.fully.paid)
z2


   0    1 
2413  460 

In [18]:
2413/(2413+460)

Or, in another way...

In [19]:
# Calculate accuracy                        # (TN+TP)/(TN+FN+TP+FP)
accur2 <- z2[1]/(z2[1]+z2[2])               # z[1]/sum(z)
paste("Accuracy Baseline: ", round(accur2,digits=4))                  

### Problem 2.4 - Prediction Models
Use the ROCR package to compute the test set AUC.

In [20]:
# Calculate AUC
#install.packages("ROCR")
library(ROCR)

pred = prediction(test$predicted.risk, test$not.fully.paid)

AUCt = as.numeric(performance(pred, "auc")@y.values)

paste("Test set AUC: ", round(AUCt,digits=4))

The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.

### Problem 3.1 - A "Smart Baseline"
In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.

However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan's risk. This variable, *int.rate*, is an independent variable in our dataset. In this part, we will investigate using the loan's interest rate as a "smart baseline" to order the loans according to risk.

Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable *not.fully.paid* using only the variable *int.rate*.

The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. **What is the most likely explanation for this difference?**

In [21]:
# Logistic Regression
bivariate = glm(not.fully.paid~int.rate, data=train, family="binomial")
summary(bivariate)


Call:
glm(formula = not.fully.paid ~ int.rate, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0547  -0.6271  -0.5442  -0.4361   2.2914  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.6726     0.1688  -21.76   <2e-16 ***
int.rate     15.9214     1.2702   12.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5896.6  on 6704  degrees of freedom
Residual deviance: 5734.8  on 6703  degrees of freedom
AIC: 5738.8

Number of Fisher Scoring iterations: 4


Decreased significance between a bivariate and multivariate model is typically due to correlation. From cor(trainint.rate,trainfico), we can see that the interest rate is moderately well correlated with a borrower’s credit score.

Training/testing set split rarely has a large effect on the significance of variables (this can be verified in this case by trying out a few other training/testing splits), and the models were trained on the same observations.

### Problem 3.2 - A "Smart Baseline"
Make test set predictions for the bivariate model. **What is the highest predicted probability of a loan not being paid in full on the testing set?**

In [22]:
# Make predictions
pred.bivariate = predict(bivariate, newdata=test, type="response")
# Max Probability
summary(pred.bivariate)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06196 0.11549 0.15077 0.15963 0.18928 0.42662 

Highest Predicted Probability = 0.4266

**With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the testing set?**

The maximum predicted probability of the loan not being paid back is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.

### Problem 3.3 - A "Smart Baseline"
What is the test set AUC of the bivariate model?

In [23]:
# Calculate AUC
prediction.bivariate = prediction(pred.bivariate, test$not.fully.paid)

AUCb = as.numeric(performance(prediction.bivariate, "auc")@y.values)

paste("Test set AUC: ", round(AUCb,digits=4))

### Problem 4.1 - Computing the Profitability of an Investment
While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.

To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the rt power.

How much does a \\$10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math. Enter the number of dollars, without the $ sign.

In [24]:
c = 10
r = 0.06
t = 3

round(c*exp(r*t),2)

### Problem 4.2 - Computing the Profitability of an Investment
While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment. What is the profit to the investor if the investment is paid back in full?

Answer: c\*exp(rt) - c

### Problem 4.3 - Computing the Profitability of an Investment
Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?

Answer: c

### Problem 5.1 - A Simple Investment Strategy
In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).

In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations. Enter the following commands in your R console to create this new variable:

    test$profit = exp(test$int.rate*3) - 1

    test$profit[test$not.fully.paid == 1] = -1

**What is the maximum profit of a \\$10 investment in any loan in the testing set (do not include the $ sign in your answer)?**

In [25]:
# Create a new variable
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1

# Maximum profit
summary(test$profit)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0000  0.2858  0.4111  0.2094  0.4980  0.8895 

Maximum profit = 8.895

In [26]:
head(test)

Unnamed: 0_level_0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,predicted.risk,profit
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
2,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,0.07681532,0.3789192
3,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0.17397147,0.5024543
10,1,debt_consolidation,0.1221,84.12,10.203592,10.0,707,2730.042,5630,23.0,1,0,0,0,0.10911169,0.4423879
12,1,debt_consolidation,0.1324,253.58,11.835009,9.16,662,4298.0,5122,18.2,2,1,0,0,0.10249433,0.4876534
21,1,all_other,0.08,188.02,11.225243,16.08,772,4888.958,29797,23.2,1,0,0,0,0.06798512,0.2712492
28,1,debt_consolidation,0.1375,255.43,9.998798,14.29,662,1318.958,4175,51.5,0,1,0,0,0.18741271,0.5105895


### Problem 6.1 - An Investment Strategy Based on Risk
A simple investment strategy of equally investing in all the loans would yield profit \\$20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.

To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.

First, use the subset() function to build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.

**What is the average profit of a \\$1 investment in one of these high-interest loans (do not include the $ sign in your answer)?**

In [27]:
# Subset the data
highInterest = subset(test, int.rate >= 0.15)

# Find the average
mean(highInterest$profit)

**What proportion of the high-interest loans were not paid back in full?**

In [28]:
# Tabulate high interest loans not fully paid
z3 = table(highInterest$not.fully.paid)
z3


  0   1 
327 110 

327 people pay full;
110 not pay full

In [29]:
# Compute proportion
paste(round(z3[2]/sum(z3),2)*100,"%")

### Problem 6.2 - An Investment Strategy Based on Risk
Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list. Find the highest predicted risk that we will include by typing the following command into your R console:

    cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]

Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.

**What is the profit of the investor, who invested \\$1 in each of these 100 loans (do not include the $ sign in your answer)?**

In [30]:
# Implement cutoff
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
cutoff

In [31]:
# Subset the data
selectedLoans = subset(highInterest, predicted.risk <= cutoff)
head(selectedLoans)
nrow(selectedLoans)

Unnamed: 0_level_0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,predicted.risk,profit
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
74,1,small_business,0.1501,225.37,12.26905,6.45,677,6240.0,56411,75.3,0,0,0,1,0.1640581,-1.0
87,1,credit_card,0.1533,444.05,11.0021,19.52,667,2700.958,33074,68.8,2,0,0,0,0.1685521,0.5839156
624,1,debt_consolidation,0.1576,420.47,11.51293,18.55,667,4560.042,34841,89.6,0,0,0,0,0.1576252,0.6044805
1361,1,all_other,0.1588,245.69,11.46163,24.19,667,5375.958,590,84.3,0,0,0,0,0.1623675,0.610267
1524,1,debt_consolidation,0.1557,244.62,10.78932,2.72,672,3010.042,3273,69.6,1,0,0,1,0.1464001,-1.0
1617,1,home_improvement,0.1525,347.88,11.0021,1.28,702,1290.042,4980,55.3,1,0,0,0,0.1757593,0.5801187


In [32]:
# Calculate the profit
sum(selectedLoans$profit)

**How many of 100 selected loans were not paid back in full?**

In [33]:
# Tabulate how many selected loans were not paid back in full
z4 = table(selectedLoans$not.fully.paid) 
z4


 0  1 
81 19 

81 people pay full;
19 not pay full

We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor's $100 investment. Although the logistic regression models developed in this problem did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.

We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.