### Logistic Regression in Statsmodels

In [None]:
import pandas as pd
import statsmodels.formula.api as sm

In [None]:
# Load in the dataset
df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
df.head()

#### Below, we fit a logistic regression model using statsmodels (patsy's) logistic regression formula

The formula says that the high_quality (coded as 1 or 0) DEPENDS on (~) following attributes:
 `residual_sugar, pH, and alcohol`.


In [None]:
model = sm.logit(
    "high_quality ~ residual_sugar + pH + alcohol",
    data = df
).fit()

In [None]:
model.summary()

In [None]:
import math

In [None]:
math.exp(1)

In [None]:
math.exp(0.8946) ## <-- this is e ^ 0.8946

## As alcohol increases by 1 unit, we expect that a wine is
## 2.446 times *AS* likely to be classified as "high-quality."
## This is the ODDS RATIO associating quality of wine to
## alcohol content.

## e ^ 0.8946 is the ODDS RATIO
## 0.8946 is the LOG-ODDS RATIO

In the table above
- `coef`, represents the coefficients we have learned for each feature. For example, '0.8946' for alcohol represent the change in log odds. After calculating $\exp\{0.8946\}$, we note that as alcohol content increases by one unit, the likelihood of `high_quality` increases by 2.72 times.

#### We can add interaction effects as well
- The `:` operator in patsy / formula-syntax represents when we care about two variables occurring together
- The `*` operator expands as follows: `a * b` expands to `a + b + a:b`, both of the original terms and interaction

In [None]:
model = sm.logit(
    "high_quality ~ residual_sugar * alcohol",
    data = df
).fit()

In [None]:
model.summary()

In [None]:
model = sm.logit(
    "high_quality ~ residual_sugar + alcohol + residual_sugar:alcohol",
    data = df
).fit()

In [None]:
model.summary()

In [None]:
model = sm.logit(
    "high_quality ~ color",
    data = df
).fit()

In [None]:
model.summary()

In [None]:
model = sm.logit(
    "high_quality ~ color : quality",
    data = df
).fit()