# Logistic Regression

Logistic regression is the classic method for binary classification. It assumes the log-odds are linear in the parameters. It does not have a closed-form solution, but can be solved using a variety of iterative approaches. Sklearn only implements regularized logistic regression, so replicating results from R's glm function requires using statsmodels. 

## Pulling Data and Imports 

In [5]:
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [6]:
x_data = pd.read_csv('C:\\Users\\smcdo\\OneDrive\\Documents\\Model_Framework\\Benchmarks\\x_data.csv', index_col=0)
y_data = pd.read_csv('C:\\Users\\smcdo\\OneDrive\\Documents\\Model_Framework\\Benchmarks\\y_data.csv', index_col=0, squeeze=True)

### Converting to binary decision

In [7]:
y_class = y_data
y_class[y_class < 0] = 0
y_class[y_class > 0] = 1

## Stats Models Implementation (unregularized logistic regression)

In [21]:
X = sm.add_constant(x_data)
model = sm.GLM(y_class, X, family=sm.families.Binomial())
fit = model.fit()
pred = model.predict(params=fit.params, exog=X)
print('Estimated Log-Odds: %s' % fit.params)
print('Estimated Odds-Ratio: %s' % [np.exp(x) for x in fit.params])
print('Fitted Values: %s' % pred)

Estimated Log-Odds: const     0.235086
0         3.122619
1       -14.401699
2       -21.903897
3         1.755968
4       -24.613408
dtype: float64
Estimated Odds-Ratio: [1.2650171329308617, 22.705763084782294, 5.5644391269448954e-07, 3.070847668091682e-10, 5.7890476127334631, 2.0442439444063939e-11]
Fitted Values: [  4.84390764e-18   1.00000000e+00   1.00000000e+00 ...,   9.12941021e-12
   2.01635794e-03   1.00000000e+00]


## Sklearn Implementation (regularized logistic regression)

In [20]:
model = linear_model.LogisticRegression(penalty='l2', C=10000, intercept_scaling=1)
model.fit(x_data,y_class)
pred = model.predict(x_data)
print('Estimated Log-Odds: %s' % regr.intercept_)
print('Estimated Log-Odds: %s' % regr.coef_)
print('Fitted Values: %s' % pred)

Estimated Log-Odds: [ 0.23456251]
Estimated Log-Odds: [[  3.10921935 -14.34058538 -21.81093979   1.74851154 -24.50901498]]
Fitted Values: [ 0.  1.  1. ...,  0.  0.  1.]


## Machine Learning Interface

### No Penalization (Statsmodels)

In [29]:
from Machine_Learning_Interface import logistic_regression as lr
model = lr.LogisticRegression(intercept=False, scale=False, cv_folds=None, penalized=False, prob=False)
model.fit(x_data,y_data)
pred = model.predict(x_data)
model.diagnostics()
print('Fitted Values: %s' % pred)

Fitted Values: 0       0
1       1
2       1
3       1
4       1
5       0
6       1
7       1
8       0
9       1
10      1
11      0
12      0
13      1
14      0
15      0
16      1
17      0
18      1
19      1
20      1
21      0
22      1
23      1
24      1
25      1
26      0
27      1
28      1
29      1
       ..
9970    0
9971    1
9972    0
9973    0
9974    1
9975    1
9976    0
9977    0
9978    1
9979    0
9980    1
9981    0
9982    0
9983    1
9984    1
9985    1
9986    1
9987    1
9988    1
9989    1
9990    1
9991    0
9992    1
9993    1
9994    1
9995    0
9996    0
9997    0
9998    0
9999    1
Name: predictions, dtype: int64


### Penalization and CV (Sklearn Implementation)

In [30]:
from Machine_Learning_Interface import logistic_regression as lr
model = lr.LogisticRegression(intercept=False, scale=False, cv_folds=3, penalized=True, prob=False)
model.fit(x_data,y_data)
pred = model.predict(x_data)
model.diagnostics()
print('Fitted Values: %s' % pred)

Fitted Values: 0       0.0
1       1.0
2       1.0
3       1.0
4       1.0
5       0.0
6       1.0
7       1.0
8       0.0
9       1.0
10      1.0
11      0.0
12      0.0
13      1.0
14      0.0
15      0.0
16      1.0
17      0.0
18      1.0
19      1.0
20      1.0
21      0.0
22      1.0
23      1.0
24      1.0
25      1.0
26      0.0
27      1.0
28      1.0
29      1.0
       ... 
9970    0.0
9971    1.0
9972    0.0
9973    0.0
9974    1.0
9975    1.0
9976    0.0
9977    0.0
9978    1.0
9979    0.0
9980    1.0
9981    0.0
9982    0.0
9983    1.0
9984    1.0
9985    1.0
9986    1.0
9987    1.0
9988    1.0
9989    1.0
9990    1.0
9991    0.0
9992    1.0
9993    1.0
9994    1.0
9995    0.0
9996    0.0
9997    0.0
9998    0.0
9999    1.0
Name: predictions, dtype: float64
