### Logistic Regression in Statsmodels

In [None]:
import pandas as pd
import statsmodels.formula.api as sm

In [None]:
# Load in the dataset
df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
df.head()

#### Below, we fit a logistic regression model using statsmodels (patsy's) logistic regression formula

The formula says that the high_quality (coded as 1 or 0) DEPENDS on (~) following attributes:
 `clump_thickness, cell_size_uniformity, marginal_adhesion, and single_epithelial_size`


In [None]:
model = sm.logit(
    "high_quality ~ residual_sugar + pH + alcohol",
    data = df
).fit() # We call fit to learn the coefficients of the model similar to linear regression model

model.summary() # Summary displays the output of the model

In the table above
- `coef`, represents the coefficients we have learned for each feature
        For example, '0.8946' for alcohol represent the change in log odds 
        As alcohol content increases, the likelihood of high_quality increases

#### We can add interaction effects as well
- The `:` operator in patsy / formula-syntax represents when we care about two variables occurring together
- The `*` operator expands as follows: `a * b` expands to `a + b + a:b`, both of the original terms and interaction

In [None]:
# Is the effect of sugar or alcohol being high quality different in read or white wines?

model = sm.logit(
    "high_quality ~ residual_sugar*is_red + pH + alcohol:is_red",
    data = df
).fit() 

model.summary() # Summary displays the output of the model

# NOTE:
# alcohol has an added increase in likelihood when the wine is red
#    alcohol:is_red    1.0347