# Week 5-2: The social effects of using machine learning for lending decisions

This notebook uses about a 10% sample of the [Lending Club data set](https://www.kaggle.com/wendykan/lending-club-loan-data) to examine the results of improved default prediction on who gets a loan and who doesn't. 

This analysis is inspired by the paper [Predictably Unequal? The Effects of Machine Learning on Credit Markets](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3072038) by Fuster et al.

For more code that deals with machine learning on this data set, see [Predicting Loan Repayment](https://towardsdatascience.com/predicting-loan-repayment-5df4e0023e92).

This is ~100k row subset of the ~900k records in the original data set. The subset over-samples defaulters enormously, so that repaid vs. defaulted is about 50/50. In the original data it's very much smaller, about 80:20. The preprocessing script does this intentionally, to make the numbers in this example easier to understand. As a side effect, this skews the income distributions because defaulters (who were previously granted a loan, if they're in this data set!) will have different characteristics from non-defaulters, or defaulters who were denied a loan. Another unrealistic thing is that a real loan issuer has some key information that isn't publicly available for privacy reasons, like FICO credit score. 

So **don't take this as reasearch on what the effect of machine learning on loan decisions will actually be.** This document is a learning tool, meant to demonstrate some ways that question could be explored. 

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from pandas.plotting import scatter_matrix
from sklearn import metrics
%matplotlib inline

In [None]:
# Our data subset is all the "charged off" or "defaulted" rows in the original, plus 1/20 the "fully paid"
# In any case, it's concluded loans only, nothing that's still being repaid.


In [None]:
# how many rows?


In [None]:
# What are the columns? For the column descriptions, see LCDataDictionary.xlsx


In [None]:
# Let's look at value counts for a few fields, to get a sense of what this data is


In [1]:
# What do people get loans for?


In [None]:
# What does the distribution of income look like?


## 1. A simple classifier
Logisitic regression on a handful of features to try to predict who will fall behind on payments.

We'll generate features from the following columns

- purpose: The purpose of the loan such as: credit_card, debt_consolidation, etc.
- installment: The monthly payment on the loan
- annual_inc: the annual income of the borrower.
- dti: The debt-to-income ratio of the borrower, excluding this loan
- pub_rec: The borrower’s number of derogatory public records.

And the target variable, that we're trying to predict, is:
- loan_status: Fully Paid, Charged Off, or Default

In [None]:
# Encode everything into a feature matrix
features = pd.concat(
    [
        loans.annual_inc/1000, # count in 1000s of dollars, to make the coefficient more easily interpretable
        loans.dti,
        loans.installment,
        loans.pub_rec,
        pd.get_dummies(loans.purpose, prefix='purpose'),
    ],
    axis=1)

# Code the target variable as True if we are predicting that this loan gets repaid in full
target = loans.loan_status == 'Fully Paid'

features.head()

In [2]:
# Your basic logistic regression


In [None]:
# Examine regression coefficients


In [None]:
# Let's see how well this classifier did


Our interest here is the people who were predicted repay the loan, but actually didn't. These are the false positives, and there are a lot of them in this data set (because this data set is artificially enriched so that about half are defaulters.) On the other hand, the false negatives represent people who we guessed would default, but repaid.

Who are these people? We we get an idea by looking at their relative income distribution. First, here's everybody:

And here's the income of just the false positives:

The false positives have a higher income than the average, \$75,000 vs \$70,000. This makes sense as these are people who we thought would repay, and the model increases its odds of repayment by 0.5% for every thousand dollars of income. This adds up, as you can see in the high income of those predicted to repay:

## 2. A more accurate classifier

Let's add more features and use a better classifier to see if we can better predict who will pay off their loans. This is what many lending companies are doing right now.

The features we'll add:
- loan_amnt: how much was loaned
- int_rate: The interest rate of the loan 
- grade: Lending Club's internal loan quality grade
- home_ownership: RENT, OWN, MORTGAGE, OTHER
- earliest_cr_line: The month the borrower's earliest reported credit line was opened
- emp_length: employment length in years


In [None]:
# To clean up this column, we'll just extract the year using a regex
loans.earliest_cr_line.head()

In [None]:
emp_dict = {
     'n/a' : 0,
     '< 1 year' : 0.5,
     ' < 1 year' : 0.5,  # look, dirty data!
     '1 year' : 1,
     '2 years' : 2,
     '3 years' : 3,
     '4 years' : 4,
     '5 years' : 5,
     '6 years' : 6,
     '7 years' : 7,
     '8 years' : 8,
     '9 years' : 9,
     '10+ years' : 10
    }


features2 = pd.concat(
    [
        features,
        loans.loan_amnt,
        loans.int_rate,
        pd.get_dummies(loans.grade, prefix='grade'),
        pd.get_dummies(loans.home_ownership, prefix='home'),
        loans.emp_length.replace(emp_dict),
        loans.earliest_cr_line.str.extract('(\d\d\d\d)', expand=False).astype(int)
    ],
    axis=1)


In [None]:
# Fit another logistic regression to the new features


Well, we did it. We got a 5.3% accuracy increase from our big data efforts. In real life the impact of using big data could substantial because it's large, percentage-wise, or because the lender has a large number of customers so small changes affect many people.

The confusion matrix shows where the improved accuracy comes from:

In [None]:
# Change in false positive rate


In [None]:
# Change in false positive rate


This classifer reduces the number of both false negatives and false positives. False negatives are people who repaid their loan, when the old model said they would not, so they have new access to credit that they will successfully repay. On the other hand, a reduction in false positives means we are now denying credit to people who would have gotten it before.

Who are the people who we didn't think would repay, but now believe they will?


In this case the new classifier admitted a bunch of people who are below our average income (of \$70K), which is a probably a socially desirable outcome -- at least for the ones who are able to repay. How many of the newly admitted will default?

and whether *that* is socially desirable or not depends on what happens when someone defaults, which is a complicated question of law, finance, and power. There's only so far that technical analysis of machine learning can go before it hits the interface with the real world, where everything happens to real people.