# MLE of logistic regression model

In [1]:
import sys
sys.path.insert(0,'C:\\code\\python_for_the_financial_economist\\')

# import relevant packages
import numpy as np

# packages for convex optimization
import cvxpy as cp

# for calculating evaluation metrics
from sklearn.metrics import accuracy_score, confusion_matrix

from codelib.visualization.layout import DefaultStyle
DefaultStyle();

The logistic regression model specifies the probability that a binary outcome variable $y$ equal 1 as 

$$
P(y=1 \vert \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}}
$$

where $\boldsymbol{\beta}$ is the vector of coefficients to be estimated. 

Below we simulate data from a logistic model with two explanatory variables. 

In [2]:
# set random seed
np.random.seed(42)

# number of simulations
num_sim = 100

# coefficients
beta0 = -1.0
beta1 = 2.0
beta2 =  -3.0

# generate explanatory variables
x1 = np.random.normal(loc=0, scale=1, size=num_sim)
x2 = np.random.normal(loc=0, scale=1, size=num_sim)

# generate probabilties
probs = 1 / (1 + np.exp(-(beta0 + beta1 * x1 + beta2 * x2)))

# simulate y values
y = np.random.binomial(1, probs, size=num_sim)

In [3]:
y

array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1])

## Problem 1

The log-likelihood function is given by

$$
\ell(\boldsymbol{\beta}) = \sum_{i=1}^N [y_i \log \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}} + (1 - y_i) \log (1 - \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}})]
$$

which is a convex function in $\boldsymbol{\beta}$.

Use the `CVXPY` package to estimate the coefficients by maximizing the log-likelihood function. 

## Problem 2

Calculate the predicted probabilities. Classify $y$ based on a  50% threshold. In addition calculate the accuracy score and the confusion matrix (use metrics.accuracy_score and metrics.confusion_matrix from `scikit-learn`).

## Problem 3

To be able to perform variable selection, we can add an L1 penalty to the log-likelihood

$$
\ell_{reg}(\boldsymbol{\beta}) = \ell(\boldsymbol{\beta}) - \lambda \Vert \boldsymbol{\beta} \Vert_1 , \; \; \lambda > 0
$$

Again, estimate the parameters using `CVXPY` and evaluate the model. 

### Links

Logistic regression model with maximum likelihood: [Statlect.com](https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-likelihood)