Logistic Regression - Overview
===========
***

### What are the odds that an event will happen? Answering yes/no questions.

Often we have to resolve questions with binary or yes/no outcomes.

For example:

* _Does a patient have cancer?_

* _Will a team win the next game?_

* _Will the customer buy my product?_

* _Will I get the loan?_


## A familiar example

We are going to start by plotting something we understand in the real world, although we may never actually have plotted it before.
Let's say on the x-axis is tumor size and say the outcome on the y axis is cancer test indicated by a value of 0 or 1 respectively.  

Then a plot for these scores might look like this:


<img src="Images/class_prob.jpg" width="70%">

So, how do we predict whether patient haa cancer or not if we are given the Tumor Size score?  
Clearly linear regression is not a good model.  
Take a look at this plot of a "best fit" line over the points:



<img src="Images/class_prob2.jpg" width="70%">

### How do we model this sort of data best?

We need a better way to model our data.  
We are going to do this in two steps.

First, we will just pull a function out of the data science bag of tricks and show that it works reasonably well.

And, second, we are going to understand how we came up with that function and how it is related to binary outcomes and odds.
But before that let's understand this a bit better.

This function will need to have a value of 0 for mo cancer and 1 for the cancer.
To make sense it will need to be 0 for some score and all scores below it and be 1 for some other score and all scores above it. And it will need to smoothly increase from 0 to 1 in the intermediate range.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np


def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.arange(-7, 7, 0.1)
phi_z = sigmoid(z)

plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$\phi (z)$')

# y axis ticks and gridline
plt.yticks([0.0, 0.5, 1.0])
ax = plt.gca()
ax.yaxis.grid(True)

plt.tight_layout()
# plt.savefig('./figures/sigmoid.png', dpi=300)
plt.show()

## Linear Regression on Loan Data


In [3]:
import numpy as np
import pandas as pd

In [4]:
# import the cleaned up dataset
df = pd.read_csv('./Datasets/loanf.csv')
df.head()

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount
6,15.31,670,36,4891.67,6000
11,19.72,670,36,3575.0,2000
12,14.27,665,36,4250.0,10625
13,21.67,670,60,14166.67,28000
21,21.98,665,36,6666.67,22000


Can we predict interest rates from given loan details ?

### Asumptions

FICO Score and Loan Amount as predictors of Interest Rate for the Lending Club sample of 2,500 loans._

We use Multivariate Linear Regression to model Interest Rate variance with FICO Score and Loan Amount using:

$$InterestRate = a_0 + a_1 * FICOScore + a_2 * LoanAmount$$

We're going to use modeling software to generate the model coefficients $a_0$, $a_1$ and $a_2$ and then some error estimates that we'll only touch upon lightly at this point. 


In [5]:
X = df[['FICO.Score', 'Loan.Amount']].values
y = df['Interest.Rate'].values

y=y.reshape((2500,1))
X.shape
y.shape

(2500, 1)

In [6]:
from sklearn.linear_model import LinearRegression
model =LinearRegression() 
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [7]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
def InterestRatePredictor(FICO,Loan):
    print ('Interest Rate for load amount %d with FICO score %d is: %.2f Percent'%(Loan,FICO,model.predict([FICO,Loan])))

In [10]:
from IPython.html import widgets
from IPython.html.widgets import interact
from IPython.display import display
i = interact(InterestRatePredictor, FICO=(665,800), Loan=(6000,22000))

Interest Rate for load amount 20768 with FICO score 794 is: 7.04 Percent


# Moving to Logistic Regression

##### _Can we get a loan, from the Lending Club, of 10,000 dollars at 12 per cent or less, with a FICO Score of 720?_

### Methods

How do we use Logistic Regression here?  Let's recast the problem as follows:-

##### _What is the probability of getting a Loan, from the Lending Club, of 10,000 dollars at 12 per cent or less with a FICO Score of 720?_  

Then let us decide that if we get a probability of less than 0.67 we say it means we won't get the loan and if it is greater than 0.67 we will. I.e. we are not confident until we have a 2/3 chance of getting it.

In reality we can set the threshold higher, say 0.8, if we want to be "more certain" that it will happen, but for this exercise we'll just say 0.67.


From initial discussion we say we want to start with a model of the form

$ Interest Rate = a_0 + a_1*FICOScore + a_2*LoanAmount $

And the derive a second equation of the form:

Z = Prob (InterestRate less than 12 percent).

We apply this to the existing dataset and create a Logistic Regression Model using modeling software.

### Analysis

As with the Linear Regression Model, we use the cleaned up Lending Club data set as input.

In [11]:
import pandas as pd
dfr = pd.read_csv('./Datasets/loanf.csv')
dfr.head()

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount
6,15.31,670,36,4891.67,6000
11,19.72,670,36,3575.0,2000
12,14.27,665,36,4250.0,10625
13,21.67,670,60,14166.67,28000
21,21.98,665,36,6666.67,22000


In [12]:
# we add a column which indicates (True/False) whether the interest rate is <= 12 
dfr['TF']=dfr['Interest.Rate']<=12
# inspect again
dfr.head()
# we see that the TF values are False as Interest.Rate is higher than 12 in all these cases

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount,TF
6,15.31,670,36,4891.67,6000,False
11,19.72,670,36,3575.0,2000,False
12,14.27,665,36,4250.0,10625,False
13,21.67,670,60,14166.67,28000,False
21,21.98,665,36,6666.67,22000,False


In [13]:
# now we check the rows that have interest rate == 10 (just some number < 12)
# this is just to confirm that the TF value is True where we expect it to be
d = dfr[dfr['Interest.Rate']==10]
d.head()
# all is well

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount,TF
650,10.0,700,36,3250.0,2800,True
204,10.0,715,36,15416.67,6000,True
440,10.0,730,36,6250.0,21000,True
521,10.0,715,36,5000.0,12000,True
1017,10.0,735,60,4000.0,5000,True


In [14]:
X = dfr[['FICO.Score', 'Loan.Amount']].values
y = dfr['TF'].values

In [15]:
from sklearn.linear_model import LogisticRegression


In [16]:
model = LogisticRegression()
model.fit(X,y)
print('Logistic Regression model')

Logistic Regression model


In [19]:
def Loan_Approvar(FICO,Loan):
    a=model.predict_proba([FICO,Loan])
    if (a[0,1]>0.5):
        
        print('Loan Approved')
    else:
            
        print('Loan Rejected')

In [20]:
i = interact(Loan_Approvar, FICO=(665,800), Loan=(6000,22000))

Loan Rejected


In [21]:
Index=3

a=model.predict_proba(X[Index])
print(a)

b=model.predict(X[Index])
print(b)

[[ 0.82738871  0.17261129]]
[False]
