# Model Formulation 

- Discriminative Probability Classifier
- Models the probability of a binary outcome as a smooth function of predictors, constrained between 0 and 1.

$$
\Large\begin{align*}Y &\in \{0,1\}\\P(Y=1) &=p\\P(Y=0) &= 1-p\\P(Y=y) &= p^y(1-p)^{1-y}\\\end{align*}
$$

Why is log-odds the natural parameter of Bernoulli distribution?

- We start with Bernoulli pdf
- Write it in exponential form
- The coefficient multiply $y$ must be the natural parameter
- so log-odds is not a design choice, it emerges.

Consider; $p^y(1-p)^{1-y}$, lets write it in exponential form;

$$
\Large\begin{align*}p^y(1-p)^{1-y} &= exp(y\log{p} + (1-y) \log{1-p}))\\&= exp(y\log{p} + \log(1-p) - y\log{1-p})\\&= exp(y\log{\frac{p}{1-p}}+ \log{1-p})\\\end{align*}
$$

Canonical exponential form is as follows; 

$$
\Large\begin{align*}f(y|\theta) = exp(y\theta - A(\theta))\end{align*}
$$

So, based on our exponential form of Bernoulli, we have ; 

$$
\Large\begin{align*}\theta &= \log{\frac{p}{1-p}}\\-A(\theta) &= \log{1-p}\\\end{align*}
$$

So, $\log{\frac{p}{1-p}}$ is a natural parameter.

Introducing Covariates : Conditional Bernoulli

Once predictors exist, we are no longer modeling p, but ; 

$$
\Large\begin{align*}p(x) = P(Y=1|X=x)\end{align*}
$$

So, the natural parameter becomes conditional; 

$$
\Large\begin{align*}\theta(x) = \log{(\frac{p(x)}{1-p(x)})}\end{align*}
$$

We make one structural assumption; The natural parameter depends linearly on predictors; 

$$
\Large\begin{align*}\theta(x) = \beta_0 + \beta^Tx\end{align*}
$$

- $\theta \in (-\infty,+\infty)$
- Linear structure
- Linear natural parameter implies convex log-likelihood

When we plug back the theta value into $p(x)$;

# Linking Functions 

- Probit - Normal CDF
- Complementary log-log
- Cauchit
- Only logit is the canonical link for Bernoulli
- Canonical link → best statistical properties.

# Parameter Estimation using MLE 

For a single observation $(x_i, y_i)$ with $y_i \in \{0,1\}$, the Bernoulli probability mass function is:

$$
\Large\begin{align*}P(Y_i = y_i \mid X_i = x_i)&= p_i^{y_i}(1-p_i)^{1-y_i}\end{align*}
$$

$$
\Large\begin{align*}p_i = P(Y_i=1 \mid X_i=x_i)\end{align*}
$$

Assuming **conditional independence** of observations given $X$, the likelihood for the full dataset $\{(x_i,y_i)\}_{i=1}^n$ is the product of individual likelihoods:

$$
\Large\begin{align*}L(\beta)&= \prod_{i=1}^{n} p_i^{y_i}(1-p_i)^{1-y_i}\end{align*}
$$

Here, the probabilities $p_i$ are modeled using the logistic function:

$$
\Large\begin{align*}p_i&= \frac{1}{1 + \exp(-\eta_i)}\\\eta_i&= \beta_0 + \beta^T x_i\end{align*}
$$

Since the log function is monotonic, maximising the likelihood is equivalent to maximising the log-likelihood.

$$
\Large\begin{align*}\ell(\beta)&= \log L(\beta)\\&= \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]\end{align*}
$$

This is the **log-likelihood** of logistic regression.
In optimisation, it is common to convert a maximisation problem into a minimisation problem. We therefore define the **negative log-likelihood (NLL)**:

$$
\Large\begin{align*}- \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]\end{align*}
$$

The goal of logistic regression is; 

$$
\Large\begin{align*}\hat{\beta}= \arg\min_{\beta} \; \mathcal{L}(\beta)\end{align*}
$$

**Logistic Regression chooses $\beta$ so that the predicted probabilities $\sigma(\beta^Tx_i)$ match the observed labels as well as possible.** 

The negative log-likelihood above is known in machine learning as the **binary cross-entropy loss** (or log loss).
For a single observation, the loss is:

$$
\Large\begin{align*}\text{BCE}(y_i, p_i)&= - \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]\end{align*}
$$

For the full dataset; 

$$
\Large\begin{align*}\text{BCE}(\beta)&= \sum_{i=1}^{n} \text{BCE}(y_i, p_i)\end{align*}
$$

- This loss arises **directly from maximum likelihood estimation**, not as a heuristic.
- Thus, minimising binary cross-entropy is equivalent to **maximum likelihood estimation for a Bernoulli model with a logistic link**.

Thus, Logistic regression estimates parameters $\beta$ by:

1. Modeling $P(Y=1 \mid X=x)$ via the logistic (sigmoid) function
2. Writing the Bernoulli likelihood for observed labels
3. Maximising the log-likelihood (or equivalently, minimising binary cross-entropy)

# Optimization Talks

Binary cross entropy loss comes. directly from MLE 

Logistic regression = maximise log-likelihood

Equivalently = minimise BCE Loss

The algorithm, it starts with random parameters \beta . The probabilities are computed. There are many algorithms which we can use for iterative optimisation ( logistic regression has no closed form solution)

Logistic regression does not have closed form solution. 

If we take the gradient and set to zero. The equation will be nonlinear in beta. Because sigmoid function contains an exponential. No algebraic manipulation can isolate beta. So, there is no analytic solution to that equation. 

The sigmoid introduces exponentials of $\beta$ making the likelihood equations transcendental rather than linear or polynomial.

# Canonical Exponential Form

A standardised way to write a probability distribution as an exponential of a linear function.

# Implementation

In [22]:
import numpy as np
import pandas as pd

np.random.seed(42)

n = 500
X = np.random.normal(0, 1, size=(n, 2))  # two features

X_df = pd.DataFrame(X, columns=["x1", "x2"])

In [23]:
beta_0 = -0.5
beta_1 = 1.2
beta_2 = -1.0

linear_pred = beta_0 + beta_1 * X[:, 0] + beta_2 * X[:, 1]


In [24]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

p = sigmoid(linear_pred)
y = np.random.binomial(1, p)

data = X_df.copy()
data["y"] = y

In [25]:
class LogisticRegressionScratch:
    def __init__(self, lr=0.01, n_iters=5000):
        self.lr = lr
        self.n_iters = n_iters
    
    def _add_intercept(self, X):
        return np.c_[np.ones(X.shape[0]), X]
    
    def fit(self, X, y):
        X = self._add_intercept(X)
        n_samples, n_features = X.shape
        
        self.beta = np.zeros(n_features)
        self.loss_history = []
        
        for _ in range(self.n_iters):
            linear_pred = X @ self.beta
            y_hat = sigmoid(linear_pred)
            
            # Binary Cross Entropy
            loss = -np.mean(y * np.log(y_hat + 1e-9) + (1 - y) * np.log(1 - y_hat + 1e-9))
            self.loss_history.append(loss)
            
            # Gradient of negative log-likelihood
            gradient = (1 / n_samples) * X.T @ (y_hat - y)
            self.beta -= self.lr * gradient
    
    def predict_proba(self, X):
        X = self._add_intercept(X)
        return sigmoid(X @ self.beta)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)


In [26]:
X_train = data[["x1", "x2"]].values
y_train = data["y"].values

model = LogisticRegressionScratch(lr=0.1, n_iters=3000)
model.fit(X_train, y_train)

model.beta


array([-0.48830487,  1.27686522, -1.03120918])

In [27]:
np.exp(model.beta)

array([0.61366575, 3.58538269, 0.35657554])

**Interpretations**
 - $\beta_1$ - change in log-odds for 1 unit increase in $x_1$
 - A one -unit increase in $x_1$ multiplies the odds of $y=1$ by 0.61 , holding other variables constant. 

In [28]:
y_hat = model.predict_proba(X_train)
residuals = y_train - y_hat

# Key-Logistic Assumptions
 - Logit is linear in parameters 
 $$
 \Large \log\left(\frac{p}{1-p}\right) = X\beta
 $$ 
 - Observations are independent
 - No severe multicollinearity
 - No perfect separation

In [29]:
import statsmodels.api as sm

X_sm = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_sm)
result = logit_model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.497525
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  500
Model:                          Logit   Df Residuals:                      497
Method:                           MLE   Df Model:                            2
Date:                Mon, 26 Jan 2026   Pseudo R-squ.:                  0.2657
Time:                        15:33:40   Log-Likelihood:                -248.76
converged:                       True   LL-Null:                       -338.79
Covariance Type:            nonrobust   LLR p-value:                 7.979e-40
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4883      0.112     -4.347      0.000      -0.708      -0.268
x1             1.2769      0.

In [30]:
params = result.params
conf = result.conf_int()
odds_ratios = np.exp(params)
conf_odds = np.exp(conf)

print(params)
print(conf)
print(odds_ratios)
print(conf_odds)

[-0.48830487  1.27686522 -1.03120918]
[[-0.70848971 -0.26812003]
 [ 0.99265023  1.56108021]
 [-1.2874039  -0.77501446]]
[0.61366575 3.58538269 0.35657554]
[[0.49238728 0.76481597]
 [2.69837631 4.76396453]
 [0.27598634 0.46069712]]


# Odds, log -odds, odds ratio - Interpretations 

$$
\Large odds = \frac{p}{1-p}
$$ 
* Odds gives - success ( p ) is how many times as likely as failure.
* This value ranges from 0 to infinity, can't model linearly with unrestricted Xβ
* When we take the log of odds, this ranges from -infinity to + inifity.
* So +log odds means probability >0.5 and -ve log odds means probability <0.5
* In logistic regression we assume the following; The log odds has has a linear parameter structure with the predictors.This value can be modelled linearly with unrestricted Xβ

$$ 
\Large \log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + ...
$$

This ensures that we can get the probability between 0 and 1 ( using sigmoid )

$$
\Large \log(\frac{p}{1-p}) = -1.05 + 1.95 x_1 -0.98 x_2
$$

* Odds Ratios of the parameters are : 0.35, 7.0 and 0.37 respectively
* One unit increase in $x_1$ multiplies the odds of y=1 by 7
* One unit increase in $x_2$ reduces odds by 63% ( 1-0.37).
* 0.35 is the odds of y=1 at $x_1 = x_2 = 0 $

In [31]:
from sklearn.linear_model import LogisticRegression

sk_model = LogisticRegression(fit_intercept=True, solver="lbfgs")
sk_model.fit(X_train, y_train)

sk_model.intercept_, sk_model.coef_


(array([-0.48285862]), array([[ 1.24575728, -1.0077469 ]]))

# Overall Model Significance - Likelihood Ratio Test

In [32]:
import statsmodels.api as sm
from scipy import stats

X_sm = sm.add_constant(X_train)

# Null model (intercept only)
model_null = sm.Logit(y_train, np.ones((len(y_train), 1))).fit(disp=False)

# Full model
model_full = sm.Logit(y_train, X_sm).fit(disp=False)

LR_stat = 2 * (model_full.llf - model_null.llf)
df = X_sm.shape[1] - 1
p_value = stats.chi2.sf(LR_stat, df)

LR_stat, p_value


(np.float64(180.0531457176919), np.float64(7.979141792795332e-40))

# Wald Test - Single Coefficient Significance Test

In [33]:
import numpy as np
from scipy import stats


beta_hat = result.params
se = result.bse

# Wald test for coefficient j (example: x1 → index 1)
j = 1

z_stat = beta_hat[j] / se[j]
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

z_stat, p_value


(np.float64(8.805340761033186), np.float64(0.0))

# Comparison

    
Aspect                             | Linear Regression                 | Logistic Regression                           |
| :-------------------------------- | :-------------------------------- | :-------------------------------------------- |
| **Linear assumption**              | (y) linear in (X)                 | **Log-odds** linear in (X)                    |
| **Error distribution**             | Normal                            | Binomial variance (p(1-p))                    |
| **Variance of response**           | Constant ((\sigma^2))             | Mean-dependent                                |
| **Estimation method**              | Least Squares                     | Maximum Likelihood                            |
| **Overall model test**             | ANOVA F-test                      | Likelihood Ratio Test (LRT)                   |
| **Individual parameter test**      | t-test                            | Wald test                                     |
| **Alternative tests**              | Partial F-test                    | Wald / LRT / Score                            |
| **Model comparison**               | F-test                            | Likelihood ratio                              |
| **Test statistic distribution**    | F, t                              | $( \chi^2 )$, Normal (asymptotic)              |
| **Confidence intervals**           | Exact (normality)                 | Asymptotic (Wald / profile)                  |
| **Interpretation of coefficients** | Change in mean (y)                | Change in **log-odds**                        |
| **Natural effect scale**           | Units of (y)                      | Odds ratios                                   |
| **Exponentiated coefficients**     | Meaningless                        | Odds ratios                                   |
| **Residuals**                      | Raw, standardized                 | Deviance, Pearson                             |
| **Residual patterns**              | Homoscedastic                     | Mean–variance linked                          |
| **Pseudo-R²**                      | Not needed                         | McFadden, Cox–Snell                           |
| **Influence measures**             | Cook’s distance                   | Leverage, ΔDeviance                           |
| **Perfect separation**             | Not an issue                      | Can break MLE                                 |
| **Multicollinearity**              | Inflates SEs                      | Inflates SEs                                  |
