# <span style="color:#F26835; font-size:34px;" >Logistic Regression</span>
###### <span style="color:#A1A1A1;"> Students: Sigurður Baldursson, Þórhildur Þorleiksdóttir </span>
###### <span style="color:#A1A1A1;">Instructor: Magnús Eðvald Björnsson </span>

In [32]:
# Bokeh for plotting
from bokeh.plotting import figure, show, ColumnDataSource, output_notebook
from bokeh.models.tools import HoverTool, BoxZoomTool, CrosshairTool, Tool
import numpy as np

output_notebook()

## Logit function
A logit function is simply a function of the mean of the response variable Y that we use as the response instead of Y itself.

Because Y is a categorical binary variable and we need to predict in percentages, the logistic model uses the logit function to help us transform it into a response between 0 and 1

In the formula below: $\beta_{0} + \beta_{1}X_{1} $ for simple logisitic regression can be changed for multiple logisitic regression for the linear compination of independent variables $\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dotso + \beta_{k}X_{k } $



$ Ln  \bigg( \dfrac{{P}} {1 - P} \bigg) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dotso + \beta_{k}X_{k}  $ 




## Lets see the logit in action

In [37]:
p_x = np.arange(0.01,0.9999,0.01).tolist()

r_y = [np.log(i/(1-i)) for i in p_x]

logit_source = ColumnDataSource(
        data=dict(
            x=p_x, 
            y=r_y,
        )
    )
logit_plot = figure(plot_width=400, plot_height=400, 
              x_axis_label = "Percentages", 
              title="Logit function: Ln(P/1-P)",
                x_range=(0,1),
                y_range=(-6,4)
            )
              
logit_plot.line('x','y', source = logit_source )

show(logit_plot)

You can see it is reaching to -inf and inf as we get close to 0 and 1

Then we compute the inverse logit because we want our percentage as the Y variable / dependent. 

Then the inverselog of the the logit function allows us to find estimated regression equation.

$ \dfrac{{P}} {1 - P} = e^{\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dotso + \beta_{k}X_{k}}  $ 

$ P = e^{\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dotso + \beta_{k}X_{k}} \big( {1 - P} \big) $ 


Until we end up with the estimated regression equation

$ \widehat{P} = \dfrac{e^{\beta_{0} + \beta_{1}X_{1}}} {1 + e^{\beta_{0} + \beta_{1}X_{1}} } $



## Plot for inverse logit


$ logit^{-1}(a) = \dfrac{e^{a} } {1 + e^{a}}  $





In [47]:
p_x = np.arange(-6,6,0.1).tolist()

r_y = [np.exp(i)/(1+np.exp(i)) for i in p_x]

logit_source = ColumnDataSource(
        data=dict(
            x=p_x, 
            y=r_y,
        )
    )
logit_plot = figure(plot_width=600, plot_height=400, 
              x_axis_label = "Percentages", 
              title="Inverse Logit function: e(a)/1-e(a)",
                x_range=(-6,6),
                y_range=(-1,2)
            )
              
logit_plot.line('x','y', source = logit_source )

show(logit_plot)


## <span style="color:#B19B7D;"> A simple example calculated </span>

The regression coefficients represent the change in the logit for each unit change in the predictor(i.e.indepenent variable)

The regression coefficients for logistic regression are calculated using maximum likelihood estimation or MLE. That is a topic of machine learning algorithms, the Logit function in statsmodel does this for us.

In [34]:
## Variables ##
urtak_size = 50

# Independent variables and their binary state
# THIS NEEDS TO BE CHANGED TO READING FROM A FILE NON RANDOM VARIABLES
x_past_credit_scores = np.random.randint(100, 400, size = urtak_size).tolist()
y_zeroes_and_ones = np.random.randint(2, size = urtak_size).tolist()
print(type(x_past_credit_scores))
####################

print(x_past_credit_scores)
print(y_zeroes_and_ones)
binary_source = ColumnDataSource(
        data=dict(
            x=x_past_credit_scores,
            y=y_zeroes_and_ones
        )
    )


plot = figure(plot_width=800, plot_height=400,
              y_axis_label = "Percentages",
              x_axis_label = "Credit score", 
              title="Credit Scores of "+ str(urtak_size) + " individuals on a scatter plot",
              x_range=(0,440))

plot.circle('x', 'y', size=20, source = binary_source)

show(plot)


<class 'list'>
[346, 205, 176, 341, 339, 279, 288, 141, 141, 304, 394, 263, 197, 325, 281, 342, 122, 292, 167, 133, 397, 278, 205, 299, 178, 275, 251, 397, 168, 152, 209, 186, 301, 160, 385, 166, 298, 214, 151, 153, 184, 331, 117, 110, 224, 275, 368, 228, 168, 310]
[0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0]


In [35]:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np


x_past_credit_scores = np.random.randint(100, 400, size = urtak_size)
y_zeroes_and_ones = np.random.randint(2, size = urtak_size)
labels = ["accepted",'creditscore']
# read the data in
#df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
input = list(zip(y_zeroes_and_ones,x_past_credit_scores))
df = pd.DataFrame.from_records(input,columns=labels)
print(df.head())

print()
print(df.accepted.value_counts())


   accepted  creditscore
0         0          394
1         0          343
2         0          371
3         0          179
4         0          362

1    25
0    25
Name: accepted, dtype: int64


In [36]:
# The Statsmodel function sm.Logit requires us to give an explicit intercept column beforehand

df['intercept'] = 1.00

train_cols = df.columns[1:]
# Index(['creditscore', 'intercept'], dtype='object')

# the Logit uses the Newton optimizer in Maximum Likelihood Estimation (MLE), (see print(result.mle_settings) )

logit = sm.Logit(df['accepted'], df[train_cols])


result = logit.fit()
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.693127
         Iterations 3
                           Logit Regression Results                           
Dep. Variable:               accepted   No. Observations:                   50
Model:                          Logit   Df Residuals:                       48
Method:                           MLE   Df Model:                            1
Date:                Fri, 24 Feb 2017   Pseudo R-squ.:               2.877e-05
Time:                        14:57:53   Log-Likelihood:                -34.656
converged:                       True   LL-Null:                       -34.657
                                        LLR p-value:                    0.9644
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
creditscore     0.0001      0.003      0.045      0.964        -0.006     0.006
intercept      -0.0351    