# Logistic regression

Logistic regression is not a regression, but a classification algorithm. It models the probabilities for classification problems.

## Import libraries and load data

In [None]:
import pandas as pd
import numpy as np
from matplotlib.gridspec import GridSpec
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression

# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# from sklearn.metrics import confusion_matrix, classification_report, precision_score
# from sklearn import preprocessing
# from sklearn import neighbors

from statsmodels.formula.api import logit

%matplotlib inline

In [None]:
data_url = "https://github.com/pykale/transparentML/raw/main/data/Default.csv"
df = pd.read_csv(data_url)

# Note: factorize() returns two objects: a label array and an array with the unique values.
# We are only interested in the first object.
df["default2"] = df.default.factorize()[0]
df["student2"] = df.student.factorize()[0]
df.head(3)

## Logistic model

Logistic regression is a solution for classification. It models the probability that `y` belongs to a particular category rather than modelling this response `y` directly. Specifically, instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. 

$$
\textrm{logistic}(x) = \frac{1}{1 + e^{-x}}
$$

The step from linear regression to logistic regression is kind of straightforward. In the simple linear regression model, we have modelled the relationship between outcome and features with a linear equation:

$$
y = \beta_0 + \beta_1 x_1.
$$

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation into the logistic function. This forces the output to assume only values between 0 and 1

$$
\mathbb{P}(y = 1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1)}}.
$$

Run the code cell below to see the curve of the logistic function.

In [None]:
sns.regplot(data=df, x="balance", y="default2", logistic=True)
plt.ylabel("Probability of default")
plt.xlabel("Balance")
plt.show()

Let $p(x) = \mathbb{P}(y=1| x)$, resulting in:

\begin{align}
\begin{aligned}
\frac{1-p(x)}{p(x)} &= e^{-(\beta_0 + \beta_1 x_1)} \\
\ln \left( \frac{1-p(x)}{p(x)} \right) &= -(\beta_0 + \beta_1 x_1) \\
\ln \left( \frac{p(x)}{1-p(x)} \right) &= \beta_0 + \beta_1 x_1.
\end{aligned}
\end{align}

The left-hand side is called the _log odds_ or _logit_.

## Estimating the coefficients and making predictions

The coefficients of a logistic regression model can be estimated by maximum likelihood estimation. The likelihood function is

$$
\mathcal{L}(\beta_0, \beta_1) = \prod_{i:y_i=1} \mathbb{P}(y_i = 1) \prod_{i:y_i=0} (1 - \mathbb{P}(y_i = 1)).
$$

The mathematical details are beyond the scope of this course. Please read Section 4.3 of the [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/) book for more details of the optimization for logistic regression.

To make predictions, taking the classification problem of the `Default` data for example, the probability of default given balance predicted by a logistic regression model is

$$
\mathbb{P}(\text{default} = \text{Yes} \mid \text{balance}, \beta_0, \beta_1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{balance})}}.
$$

In practice, we can use the `scikit-learn` or `statsmodels` package to fit a logistic regression model and making predictions. Run the following code to fit a logistic regression model using the `scikit-learn` and `statsmodels` packages, respectively.

**Example of `scikit-learn`**

In [None]:
clf = LogisticRegression(solver="newton-cg")
X_train = df.balance.values.reshape(-1, 1)
X_test = np.arange(df.balance.min(), df.balance.max()).reshape(-1, 1)
y = df.default2
clf.fit(X_train, y)
print(clf)
print("classes: ", clf.classes_)
print("coefficients: ", clf.coef_)
print("intercept :", clf.intercept_)

In [None]:
prob = clf.predict_proba(X_test)

# Right plot
plt.scatter(X_train, y, color="orange")
plt.plot(X_test, prob[:, 1], color="lightblue")

plt.hlines(
    1,
    xmin=plt.gca().xaxis.get_data_interval()[0],
    xmax=plt.gca().xaxis.get_data_interval()[1],
    linestyles="dashed",
    lw=1,
)
plt.hlines(
    0,
    xmin=plt.gca().xaxis.get_data_interval()[0],
    xmax=plt.gca().xaxis.get_data_interval()[1],
    linestyles="dashed",
    lw=1,
)

plt.hlines(
    0.5,
    xmin=plt.gca().xaxis.get_data_interval()[0],
    xmax=plt.gca().xaxis.get_data_interval()[1],
    linestyles="dashed",
    lw=1,
)

plt.vlines(
    -clf.intercept_ / clf.coef_[0],
    ymin=plt.gca().yaxis.get_data_interval()[0],
    ymax=plt.gca().yaxis.get_data_interval()[1],
    linestyles="dashed",
    lw=1,
)

plt.hlines(
    (clf.predict_proba(np.asarray(1500).reshape(-1, 1)))[0][1],
    xmin=plt.gca().xaxis.get_data_interval()[0],
    xmax=1500,
    linestyles="dashed",
    lw=1,
)

plt.vlines(
    1500,
    ymin=plt.gca().yaxis.get_data_interval()[0],
    ymax=(clf.predict_proba(np.asarray(1500).reshape(-1, 1)))[0][1],
    linestyles="dashed",
    lw=1,
)

plt.ylabel("Probability of default")
plt.xlabel("Balance")
plt.yticks([0, 0.25, 0.5, 0.75, 1.0])
plt.xlim(xmin=-100)
plt.show()

In the above example, we learnt a logistic regression model $f(x)$ with two parameters, $\beta_0$ and $\beta_1$, from the data, where
- $\beta_0 = -10.65133019 $ 
- $\beta_1 = 0.00549892 $ 

Using these two estimated parameters, we can examine the system logic of the logistic regression model to reveal its system transparency.

```{admonition} System transparency
:class: important

- When `balance`=1500 , the predicted probability of `default` is 

$$ f(1500) = \frac{1}{1 + e^{\beta_0 + \beta_1 \times 1500}} = \frac{1}{1 + e^{-10.65133019 + 0.00549892 \times 1500}} = 0.08294763. $$

- When the probability of default is 0.5, the corresponding balance can be calculated using the _log odds_ equation above:

\begin{align}
\begin{aligned}
\ln \left(\frac{p(x)}{1-p(x)}\right) &= \beta_0 + \beta_1 \times x \\
\ln \left(\frac{0.5}{1-0.5}\right) &= \beta_0 + \beta_1 \tims x \\
x & = \frac{- \beta_0}{\beta_1} \\
x & = \frac{- (-10.65133019)}{0.00549892} \\
x & = 1936.9858426745614.
\end{aligned}
\end{align}

Therefore, to produce result `Yes`, i.e. the probability of `default` $>$ 0.5, the `balance` should be greater than 1936.9858426745614. By contrast, to produce result `No`, i.e. the probability of `default` $<$ 0.5, the `balance` should be less than 1936.9858426745614.
```

**Example of `statsmodels`**

In [None]:
est = logit("default2 ~ balance", df).fit()
est.summary2().tables[1]

In [None]:
est = logit("default2 ~ student", df).fit()
est.summary2().tables[1]

## Multiple logistic regression

The logistic regression model can be extended to multiple features. A generalised logistic regression model for an instance $\mathbf{x} = [x_1, x_2, \dots, x_D]^\top$ is

$$
\ln \left( \frac{p(\mathbf{x})}{1-p(\mathbf{x})} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D.
$$

Run the following code to fit a multiple logistic regression model using the `scikit-learn` and `statsmodels` packages, respectively.

**Example of `scikit-learn`**

In [None]:
import warnings

warnings.filterwarnings("ignore")

X_train = df.loc[:, ["balance", "income", "student2"]]
y = df.default2

clf = LogisticRegression(solver="newton-cg", penalty="none", max_iter=1000)
clf.fit(X_train, y)
print(clf)
print("classes: ", clf.classes_)
print("coefficients: ", clf.coef_)
print("intercept :", clf.intercept_)

**Example of `statsmodels`**, where the learnt parameters are the same as the `scikit-learn` example.

In [None]:
est = logit("default2 ~ balance + income + student", df).fit()
est.summary2().tables[1]