In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

![image.png](attachment:image.png)

Fi.5 Family of classiﬁcation models. a) In the remaining part of the chapter, we will address several of the limitations of logistic regression for binary classiﬁcation. b) The logistic regression model with maximum like- lihood learning is overconﬁdent, and hence we develop a Bayesian version (section 9.2). c) It is unrealistic to always assume a linear relationship be- tween the data and the world and to this end we introduce a nonlinear version (section 9.3). d) Combining the Bayesian and nonlinear versions of regression leads to Gaussian process classiﬁcation. e) The logistic regression model also has many parameters and may require considerable resources to learn when the data dimension is high and so we develop relevance vector classiﬁcation which encourages sparsity. f) We can also build a sparse model by incrementally adding parameters in a boosting model. g) Finally, we consider a very fast classiﬁcation model based on a tree structure.
# Bayesian Logistic Regression

In Bayesian logistic regression, we learn a distribution $P(φ|X, w)$ over the possible parameter values $φ$ that are compatible with the training data. During inference, we observe a new data example $x^*$ and use this distribution to weight the predictions for the world state $```markdown
# Bayesian Logistic Regression

In Bayesian logistic regression, we learn a distribution $P(φ|X, w)$ over the possible parameter values $φ$ that are compatible with the training data. In inference, we observe a new data example $x^*$ and use this distribution to weight the predictions for the world state $w^*$ given by each possible estimate of $φ$.

## Learning

We start by defining a prior over the parameters $φ$. Unfortunately, there is no conjugate prior for the likelihood in the logistic regression model, which is why there won't be closed-form expressions for the likelihood and predictive distribution.

As a reasonable choice for the prior over the continuous parameters $φ$, we use a multivariate normal distribution with zero mean and a large spherical covariance:

$$
P(φ) = \text{Norm}_φ [0, σ_p^2 I]
$$

To compute the posterior probability distribution $P(φ|X, w)$, we apply Bayes' rule:

$$
P(w|X, φ)P(φ) P(φ|X, w) = \frac{P(w|X, φ)P(φ)}{P(w|X)}
$$

where the likelihood and prior are given by equations 9.5 and 9.10, respectively.

## Inference

In inference, we observe a new data example $x^*$ and use the posterior distribution $P(φ|X, w)$ to weight the predictions for the world state $w^*$ given by each possible estimate of $φ$.

## Approximations

Due to the nonlinear function $sig[\cdot]$ in logistic regression, there are no closed-form expressions for both learning and inference steps. To make the algorithm tractable, we will approximate both steps to retain neat closed-form expressions.

## Prior Choice

The choice of prior distribution for $φ$ is important. A reasonable choice is a multivariate normal distribution with zero mean and a large spherical covariance. This prior distribution allows for a wide range of possible parameter values, providing flexibility in the model.

## Posterior Distribution

The posterior distribution $P(φ|X, w)$ is obtained by applying Bayes' rule, which combines the likelihood $P(w|X, φ)$ and the prior $P(φ)$. The denominator $P(w|X)$ is a normalization constant that ensures the posterior distribution integrates to 1.

## Inference with Posterior Distribution

In inference, we use the posterior distribution $P(φ|X, w)$ to make predictions for the world state $w^*$ given a new data example $x^*$. By weighting the predictions with the posterior distribution, we can account for uncertainty in the parameter estimates.

## Approximations for Tractability

Due to the nonlinear nature of logistic regression, closed-form expressions for the likelihood and predictive distribution are not available. To make the algorithm tractable, we will introduce approximations in both learning and inference steps. These approximations will allow us to retain closed-form expressions and simplify the computations.

## Conclusion

Bayesian logistic regression provides a probabilistic framework for learning and inference in binary classification tasks. By incorporating a prior distribution over the parameters and using the posterior distribution for predictions, we can handle uncertainty and make more robust decisions. The approximations introduced for tractability allow us to work with complex models while maintaining computational efficiency.


In [5]:
import math

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def bayesian_logistic_regression(X, y, prior_mean=0, prior_variance=100, num_iterations=1000):
    m, n = len(X), len(X[0])
    theta_mean = [prior_mean] * n
    theta_variance = [prior_variance] * n
    for _ in range(num_iterations):
        h = [sigmoid(sum(x_i * theta_mean_j for x_i, theta_mean_j in zip(x, theta_mean))) for x in X]
        gradient = []
        for x_i, y_i, h_i in zip(X, y, h):
            gradient_i = 0
            for x_ij, theta_mean_j in zip(x_i, theta_mean):
                gradient_i -= (h_i - y_i) * x_ij
            gradient.append(gradient_i)
        hessian = []
        for x in X:
            hessian_i = 0
            for x_ij, theta_mean_j in zip(x, theta_mean):
                hessian_i -= sigmoid(sum(x_ij * theta_mean_j for x_ij, theta_mean_j in zip(x, theta_mean))) * (1 - sigmoid(sum(x_ij * theta_mean_j for x_ij, theta_mean_j in zip(x, theta_mean)))) * x_ij
            hessian.append(hessian_i)
        theta_mean = [theta_mean_j - 0.5 * theta_variance_j * sum(gradient_i * hessian_i for gradient_i, hessian_i in zip(gradient, hessian)) for theta_mean_j, theta_variance_j in zip(theta_mean, theta_variance)]
        theta_variance = [theta_variance_j * (1 - 0.25 * sum(gradient_i * hessian_i for gradient_i, hessian_i in zip(gradient, hessian))) for theta_variance_j in theta_variance]
    return theta_mean, theta_variance

# Example usage
X = [[1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]]
y = [0, 1, 1, 0]

theta_mean, theta_variance = bayesian_logistic_regression(X, y)
print("Estimated parameters mean:", theta_mean)
print("Estimated parameters variance:", theta_variance)


Estimated parameters mean: [-12.499995447547368, -12.499995447547368, -12.499995447547368]
Estimated parameters variance: [93.75000227622505, 93.75000227622505, 93.75000227622505]
