# Logistic regression

Logistic regression is not a regression, but a classification algorithm. It models the probabilities for classification problems.

## Import libraries and load data

In [None]:
import pandas as pd
import numpy as np
from matplotlib.gridspec import GridSpec
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression

# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# from sklearn.metrics import confusion_matrix, classification_report, precision_score
# from sklearn import preprocessing
# from sklearn import neighbors

from statsmodels.formula.api import logit

%matplotlib inline

In [None]:
data_url = "https://github.com/pykale/transparentML/raw/main/data/Default.csv"
df = pd.read_csv(data_url)

# Note: factorize() returns two objects: a label array and an array with the unique values.
# We are only interested in the first object.
df["default2"] = df.default.factorize()[0]
df["student2"] = df.student.factorize()[0]
df.head(3)

## Logistic model

Logistic regression is a solution for classification. It models the probability that `y` belongs to a particular category rather than modelling this response `y` directly. Specifically, instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. 

$$
\textrm{logistic}(x) = \frac{1}{1 + e^{-x}}
$$

The step from linear regression to logistic regression is kind of straightforward. In the simple linear regression model, we have modelled the relationship between outcome and features with a linear equation:

$$
y = \beta_0 + \beta_1 x_1.
$$

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation into the logistic function. This forces the output to assume only values between 0 and 1.

$$
\mathbb{P}(y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1)}}
$$

Run the code example has been used in the previous section again to see the curve of the logistic function.

In [None]:
X_train = df.balance.values.reshape(-1, 1)
y = df.default2

# Create array of test data. Calculate the classification probability
# and predicted classification.
X_test = np.arange(df.balance.min(), df.balance.max()).reshape(-1, 1)

clf = LogisticRegression(solver="newton-cg")
clf.fit(X_train, y)
prob = clf.predict_proba(X_test)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Left plot
sns.regplot(
    x=df.balance,
    y=df.default2,
    order=1,
    ci=None,
    scatter_kws={"color": "orange"},
    line_kws={"color": "lightblue", "lw": 2},
    ax=ax1,
)
# Right plot
ax2.scatter(X_train, y, color="orange")
ax2.plot(X_test, prob[:, 1], color="lightblue")

for ax in fig.axes:
    ax.hlines(
        1,
        xmin=ax.xaxis.get_data_interval()[0],
        xmax=ax.xaxis.get_data_interval()[1],
        linestyles="dashed",
        lw=1,
    )
    ax.hlines(
        0,
        xmin=ax.xaxis.get_data_interval()[0],
        xmax=ax.xaxis.get_data_interval()[1],
        linestyles="dashed",
        lw=1,
    )
    ax.set_ylabel("Probability of default")
    ax.set_xlabel("Balance")
    ax.set_yticks([0, 0.25, 0.5, 0.75, 1.0])
    ax.set_xlim(xmin=-100)

## Estimating the coefficients and making predictions

The coefficients of a logistic regression model can be estimated by maximum likelihood estimation. The likelihood function is

$$
L(\beta_0, \beta_1) = \prod_{i:y_i=1} \mathbb{P}(y_i = 1) \prod_{i:y_i=0} (1 - \mathbb{P}(y_i = 1)).
$$

The mathematical details are beyond the scope of this course. Please read Section 4.3 of the [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/) book for more details of the optimization for logistic regression.

To make predictions, taking the classification problem of the `Default` data for example, the probability of default given balance predicted by a logistic regression model is

$$
\mathbb{P}(\text{default} = \text{Yes} \mid \text{balance}, \beta_0, \beta_1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \text{balance})}}.
$$

In practice, we can use the `scikit-learn` or `statsmodels` package to fit a logistic regression model and making predictions. Run the following code to fit a logistic regression model using the `scikit-learn` and `statsmodels` packages, respectively.

Example of `scikit-learn`

In [None]:
clf = LogisticRegression(solver="newton-cg")
X_train = df.balance.values.reshape(-1, 1)
y = df.default2
clf.fit(X_train, y)
print(clf)
print("classes: ", clf.classes_)
print("coefficients: ", clf.coef_)
print("intercept :", clf.intercept_)

Example of `statsmodels`

In [None]:
est = logit("default2 ~ balance", df).fit()
est.summary2().tables[1]

In [None]:
est = logit("default2 ~ student", df).fit()
est.summary2().tables[1]