### Logistic Regression in Statsmodels

In [1]:
import pandas as pd
import statsmodels.formula.api as sm

In [2]:
# Load in the dataset
df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,1.0,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,1.0,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,1.0,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0


#### Below, we fit a logistic regression model using statsmodels (patsy's) logistic regression formula

The formula says that the high_quality (coded as 1 or 0) DEPENDS on (~) following attributes:
 `clump_thickness, cell_size_uniformity, marginal_adhesion, and single_epithelial_size`


In [3]:
model = sm.logit(
    "high_quality ~ residual_sugar + pH + alcohol",
    data = df
).fit() # We call fit to learn the coefficients of the model similar to linear regression model

model.summary() # Summary displays the output of the model

Optimization terminated successfully.
         Current function value: 0.418431
         Iterations 6


0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6493.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 10 Jan 2017",Pseudo R-squ.:,0.1557
Time:,12:54:29,Log-Likelihood:,-2718.5
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,5.039e-217

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-11.7871,0.803,-14.674,0.000,-13.361 -10.213
residual_sugar,0.0471,0.009,5.441,0.000,0.030 0.064
pH,0.1419,0.217,0.654,0.513,-0.283 0.567
alcohol,0.8946,0.031,28.600,0.000,0.833 0.956


In the table above
- `coef`, represents the coefficients we have learned for each feature
        For example, '0.8946' for alcohol represent the change in log odds 
        As alcohol content increases, the likelihood of high_quality increases

#### We can add interaction effects as well
- The `:` operator in patsy / formula-syntax represents when we care about two variables occurring together
- The `*` operator expands as follows: `a * b` expands to `a + b + a:b`, both of the original terms and interaction

In [5]:
# Is the effect of sugar or alcohol being high quality different in read or white wines?

model = sm.logit(
    "high_quality ~ residual_sugar*is_red + pH + alcohol:is_red",
    data = df
).fit() 

model.summary() # Summary displays the output of the model

# NOTE:
# alcohol has an added increase in likelihood when the wine is red
#    alcohol:is_red    1.0347

Optimization terminated successfully.
         Current function value: 0.466917
         Iterations 7


0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6491.0
Method:,MLE,Df Model:,5.0
Date:,"Sat, 20 Aug 2016",Pseudo R-squ.:,0.05785
Time:,12:26:07,Log-Likelihood:,-3033.6
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,2.473e-78

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-2.3488,0.686,-3.423,0.001,-3.694 -1.004
residual_sugar,-0.0595,0.008,-7.699,0.000,-0.075 -0.044
is_red,-12.3824,0.854,-14.503,0.000,-14.056 -10.709
residual_sugar:is_red,0.1341,0.055,2.418,0.016,0.025 0.243
pH,0.4434,0.212,2.095,0.036,0.029 0.858
alcohol:is_red,1.0347,0.075,13.764,0.000,0.887 1.182
