# Unit 2 Lesson 4: Logistic Regression

### Estimated time: 2 - 3 hours

As you learned in the previous lesson, **linear regression** assumes a continuous response from a continuous input, but often variables aren't continuous numbers but exist in categories. **Logistic regression** takes the same basic idea that one variable can influence another and applies it to the class of variables that exist as discrete values like "yes" or "no", "overweight" or "underweight", etc.

In the following lesson, you're going to learn the fundamentals of logistic regression, how to build a logistic regression model in Python and how to analyze the model output.

## Goals

* Learn the fundamentals of logistic regression models.
* Build and analyze a logistic regression model in Python.


# Unit 2 Lesson 4 Assignment 1: Logistic Regression Overview

#### Estimated time: 30 minutes

As with linear regression, logistic regression models the relationship between dependent and independent variables. This time we are concerned with categorical variables (e.g. yes or no, heads or tails), and logistic regression helps us predict the likelihood of events occurring. Essentially, we want to know what the odds are that we'll win instead of lose. In order to understand the odds that logistic regression deals with, you should know about odds ratios.

**Odds ratios** tell us the probability of one thing happening as it compares to the probability of another thing happening. If someone tells you that the odds of winning are 1:4, that means there is 1 part chance of winning and 4 parts chance of losing. That means there's 5 parts total, and that the probability of winning is 1/5 = 0.2 = 20%. The probability of losing is 4/5 = 0.8 = 80%. (Remember that if the probability of winning is p, then the probability of losing is 1-p.) Depending on the ratio of the probability of winning to the probability of losing, we have more or less confidence in the outcome.

Odds ratios should be familiar if you know about sports betting odds or a little about casinos. For more information on this watch the video on odd ratios and risk ratios [https://www.youtube.com/watch?v=hOtoV2Kjb0o] at and read this tutorial about odds and exponents [http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod4/2/].

# Unit 2 Lesson 4 Assignment 2: Data Cleaning

We're going to be using the same data that we used in the previous two lessons here. You can use the cleaned-up data provided in the first lesson, or the data you cleaned up yourself. If you want to use the data you cleaned yourself, simply add the following line to your 'linear_regression.py' script:

`loansData.to_csv('loansData_clean.csv', header=True, index=False)`

We only need to do a few more things to the data to get it ready for logistic regression.

* Create a new file called 'logistic_regression.py'. For this lesson, we're going to need `pandas` and `statsmodels`.
* Load the data.


In [42]:
import pandas as pd
import statsmodels as sm
df = pd.read_csv("../u2_l3/loansData_clean.csv")

In [5]:
df.head()

Unnamed: 0,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,FICO.Score
0,20000,20000,8.90%,36 months,debt_consolidation,14.90%,SC,MORTGAGE,6541.67,735-739,14,14272,2,< 1 year,735
1,19200,19200,12.12%,36 months,debt_consolidation,28.36%,TX,MORTGAGE,4583.33,715-719,12,11140,1,2 years,715
2,35000,35000,21.98%,60 months,debt_consolidation,23.81%,CA,MORTGAGE,11500.0,690-694,14,21977,1,2 years,690
3,10000,9975,9.99%,36 months,debt_consolidation,14.30%,KS,MORTGAGE,3833.33,695-699,10,9346,0,5 years,695
4,12000,12000,11.71%,36 months,credit_card,18.78%,NJ,RENT,3195.0,695-699,11,14469,0,9 years,695


* Add a column to your dataframe indicating whether the interest rate is < 12%. This would be a derived column that you create from the interest rate column. You name it IR_TF. It would contain binary values, i.e.'0' when interest rate < 12% or '1' when interest rate is >= 12%

In [48]:
df["IR_TF"] = pd.Series([1 if float(rate[:-1]) >= 12 else 0 for rate in df["Interest.Rate"]])

* Do some spot checks to make sure that it worked.

In [50]:
df[df['Interest.Rate'] == '12.12%'].head() 

Unnamed: 0,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,FICO.Score,IR_TF
1,19200,19200,12.12%,36 months,debt_consolidation,28.36%,TX,MORTGAGE,4583.33,715-719,12,11140,1,2 years,715,1
16,10000,10000,12.12%,36 months,debt_consolidation,17.72%,CA,RENT,9000.0,695-699,18,20317,0,7 years,695,1
31,14000,14000,12.12%,36 months,debt_consolidation,14.93%,CA,MORTGAGE,10583.33,685-689,9,35457,0,2 years,685,1
49,14000,14000,12.12%,36 months,debt_consolidation,11.38%,NY,MORTGAGE,4500.0,705-709,22,18583,0,5 years,705,1
83,12000,12000,12.12%,36 months,debt_consolidation,18.62%,VA,RENT,5833.33,690-694,16,18838,0,10+ years,690,1


In [51]:
df[df['Interest.Rate'] == '9.99%'].head() # should all be False

Unnamed: 0,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,FICO.Score,IR_TF
3,10000,9975.0,9.99%,36 months,debt_consolidation,14.30%,KS,MORTGAGE,3833.33,695-699,10,9346,0,5 years,695,0
19,5200,5175.0,9.99%,60 months,debt_consolidation,10.29%,AL,MORTGAGE,3750.0,760-764,10,16094,0,< 1 year,760,0
57,6000,6000.0,9.99%,36 months,other,7.50%,FL,MORTGAGE,2625.0,715-719,4,5167,0,10+ years,715,0
445,5000,4947.35,9.99%,60 months,credit_card,10.80%,CA,RENT,6400.0,730-734,8,5783,0,< 1 year,730,0
506,15000,10825.0,9.99%,60 months,debt_consolidation,11.95%,AZ,MORTGAGE,1958.33,745-749,8,3584,1,2 years,745,0


* Statsmodels needs an intercept column in your dataframe, so add a column with a constant intercept of 1.0.

In [52]:
df['Intercept'] = [1.0 for row in range(len(df))]# should all be False

* Create a list of the column names of our independent variables, including the intercept, and call it ind_vars.

In [82]:
ind_vars = ['FICO.Score', 'Amount.Funded.By.Investors']

# Unit 2 Lesson 4 Project 3: Logistic Regression Analysis

#### Estimated time: 1 - 2 hours

In this lesson, we're going to be looking at the same data as the last lesson, but we're going to ask a different question, one that has a binary outcome: What is the probability of getting a loan from the Lending Club for $10,000 at an interest rate ≤ 12% with a FICO score of 750?

To do this, we're going to use a logit function, or the log odds function, which is a function derived from the concept of odds ratios that gives (as the name suggests) the log of the odds ratio logit(p) = log(p/1-p). We can then plot a linear model of how the log odds vary depending on some variable. Instead of our familiar function y = mx + b, now we have log(p/1-p) = mx + b.

Calculating odds ratios can be kind of awkward, but we can solve directly for p as a function of some variable x:

p(x) = 1/(1 - e^(mx + b))

This is the logistic function. When plotted, the standard logistic function looks like this:

<img src="logistic_curve.svg" height="400" width="400">

Determining the probability isn't a binary outcome though, so we need to decide on a probability threshold above which it means we will get the loan and below which it means we won't (i.e. the intercept point on the logistic curve where the values go from positive to negative).

Let's say a probability of less than 70% means we won't get the loan. In other words, we're not confident that we'll get the loan until we have a 7/10 chance of getting it. To state this more explicitly: if p ≥0.70, then 1, else 0.

We start with a model of how the interest rate varies with FICO score and the loan amount desired:

`interest_rate = b + a1(FICOScore) + a2(LoanAmount)`

Plugging in the values of the problem:

`interest_rate = b + a1(750) + a2(10000)`

Finally, we need to determine the probability p that the interest rate will be less than or equal to 12%. Once we fit a linear model to interest rate, FICO Score, and Loan Amount, we can plug that linear equation into the logistic function above to determine p. If p is ≥ 0.70, then we predict that we will get the loan (1), and if p < 0.70, we predict that we won't get the loan (0).

### 1 Define the logistic regression model.
    
`logit = sm.Logit(df['IR_TF'], df[ind_vars])` 

In [41]:
df.columns

Index(['Amount.Requested', 'Amount.Funded.By.Investors', 'Interest.Rate',
       'Loan.Length', 'Loan.Purpose', 'Debt.To.Income.Ratio', 'State',
       'Home.Ownership', 'Monthly.Income', 'FICO.Range', 'Open.CREDIT.Lines',
       'Revolving.CREDIT.Balance', 'Inquiries.in.the.Last.6.Months',
       'Employment.Length', 'FICO.Score', 'Interest.Rate.Binary', 'Intercept'],
      dtype='object')

In [83]:
from statsmodels.discrete.discrete_model import Logit
logit = Logit(df['IR_TF'], df[ind_vars])

### 2 Fit the model.
    
`result = logit.fit()`

In [84]:
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.644095
         Iterations 4


### 3 Get the fitted coefficients from the results.

 `coeff = result.params`
 
 `print(coeff)`
 
     This gives the coefficient of each independent, e.g., predictor, variable.

In [85]:
coeff = result.params
print(coeff)

FICO.Score                   -0.000721
Amount.Funded.By.Investors    0.000075
dtype: float64


### 4 Using these coefficients, what is the linear part of our predictor?
    
`interest_rate = −60.125 + 0.087423(FicoScore) − 0.000174(LoanAmount)`

In [60]:
# interest_rate = −60.125 + -0.000721(FicoScore) − 0.000075(LoanAmount)

### 5 What is our logistic function?

`p(x) = 1/(1 + e^(intercept + 0.087423(FicoScore) − 0.000174(LoanAmount))`

### 6 Write a function called `logistic_function` that will take a FICO Score and a Loan Amount of this linear predictor, and return p. (Try not to hardcode any values if you can! Hint: pass the coefficients object to the function as an argument.)

In [133]:
from math import e
def logistic_function(score, amount, coeff):
    a1, a2 = coeff
    # Statsmodels needs an intercept column in your dataframe, so add a column with a constant intercept of 1.0.
    p = 1/(1 + e**(1.0 + a1*(score) - a2*(amount))) # df.loc[df['FICO.Score'] == score]
    return p 

### 7 Determine the probability that we can obtain a loan at ≤12% Interest for $10,000 with a FICO score of 720 using this function.

In [134]:
logistic_function(720, 10000, coeff)

0.56637705030307972

### 8 Is p above or below 0.70? Do you predict that we will or won't obtain the loan?

Below. 

Won't.

### 9 Now think critically, does your prediction make sense given the data? Try plotting the data to see if you can see the prediction visually. If you cannot find the correlation visually, you might have to re-evaluate your logistic function. An example plot can be seen here, created by one of our data science mentors, which compares two different equations for the logistic regression. Which one makes more sense?

<img src="logit.png" height="400px" width="400px">

### 10 If you're feeling really adventurous, you can create a new function `pred` to predict whether or not we'll get the loan automatically.

## Submission

Push your version of "logistic_regression.py" script to GitHub and enter the link below.