# Logistic Regression

Logistic regression estimates the probability of an event occurring, based on a given data set of independent variables.

Since the outcome is a probability, the dependent variable is bounded between 0 and 1. This type of statistical model (also known as logit model) is often used for classification and predictive analytics.

It is,
* probabilistic classifier
* It is a **discriminative model**.

Logistic regression has two phases:

**training:** We train the system (specifically the weights w and b) using stochastic
gradient descent and the cross-entropy loss.

**test:** Given a test example x we compute p(y|x) and return the higher probability
label y= 1 or y= 0.

# The Logistic Function or Sigmoid Function

This function maps any real-valued number to the range [0, 1], which we interpret as a probability.

$$\sigma(t) = \frac{1}{1 + e^{-t}}$$

## Why we use sigmoid function

To make a decision on a test instance, the classifier first multiplies each xi by its weight wi, sums up the weighted features,
and adds the bias term b. 

The resulting single number z expresses the weighted sum of the evidence for the class.

$$z = w^T x + b$$

But nothing forces z to be a legal probability, that is, to lie between 0 and 1. 

In fact, since weights are real-valued, the output might even be negative; z ranges from −∞ to ∞.

To create a probability, we’ll pass z through the **sigmoid function, σ (z)**.

The sigmoid has a number of advantages; 
- it takes a real-valued number and maps it into the range (0, 1), which is just what we want for a probability.
-  Because it is nearly linear around 0 but flattens toward the ends, it tends to squash outlier values toward 0 or 1. 
- And it’s differentiable, which will be handy for training the model.

## What is logit?

The input to the sigmoid function, the score $z = w·x + b$ is often called the **logit**. 

This is because the logit function is the inverse of the sigmoid. 

The logit function is the log of the odds ratio $\frac{p}{(1-p)}$

$$ logit(p) = \sigma^{-1} (p) = ln(\frac{p}{1-p}) $$

Using the term **logit** for z is a way of reminding us that by using the sigmoid to turn z (which ranges from−∞ to ∞) into a probability, we are implicitly interpreting z as not just any real-valued number, but as specifically a log odds.

# The Model

In logistic regression, we model the probability of an instance belonging to the positive class as:

$$P(y=1|X) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

vector form,

Now the liear function (w^T x + b) can be written as,

$$\theta^T x' $$

- **where x' is individual training instance**

- we append the bias term b to the end of the weight vector (w) 
$$\theta = [w,b]$$
- append 1 to the end of each input vector(x) 
$$ x' = [x, 1] $$

now the probability function becomes,

$$P(y=1|x') = \sigma(\theta^T x')$$



## Calculating probability of all training instances at once

We can calculate probability vector containing probability of all training instances as,

$$X = [[x'], ...]$$

where, X = matrix containing all training instances.

Probability vector containing probability of all training instances is,

$$\hat{y} = \sigma(X^T \theta)$$

- x' is column vector
- $\theta$ is also column vector
- By performing $X^T$ we converted the column vectors to row vector so that we can perform matrix multiplication

# Making Prediction

To make the prediction, we first calculate the linear predictor or logit
$$ w^T x + b $$

Then we pass this into sigmoid function to get the probability,
$$ probability P(y=1|x)= \sigma(w^T x + b) $$

We can set a decision boundary and based on the probability we can make predictions,
$$
decision(x) = \hat{y} = 
  \begin{cases}
    0 & \text{if } P(y=1|x) < 0.5 \\
    1 & \text{if } P(y=1|x) \ge 0.5 
  \end{cases}
$$

This is equivalent to,
$$
\hat{y} = 
  \begin{cases}
    0 & \text{if } (w^T x + b) < 0 \\
    1 & \text{if } (w^T x + b) \ge 0 
  \end{cases}
$$

# Model training

## Loss Function (cross-entropy loss)
We use the **negative log-likelihood** as loss function. This is modified maximum likelyhood method.

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}[y^{(i)}log(\hat{p}^{(i)}) + (1 - y^{(i)})log(1 - \hat{p}^{(i)})]$$

- m is number of training instances,
- $y^i$ is the target value for the current training instance,
- $p^i$ is prediction value $\sigma (\theta^T x)$ for the current training instance.

## Gradient descent

Derivative of loss function:

$$ \frac{\partial}{\partial \theta_j} {J(\theta)} =  \frac{1}{m} \sum\limits_{i = 1}^{m} (\sigma(\theta^Tx^{(i)}) - y^{(i)}) x_j^{(i)}$$

Now we can apply various gradient descent methods for finding the optimal weights that minimizes the loss function