# Logistic Regression

### High Level Summary
Logistic regression is an algorithm for classifying data samples into one of K discrete classes. In the case that K = 2, the term binary logistic regression may be used, and when K > 2 multinomial logistic regression is used. If the classes have order, ordinal logistic regression is used.

## Binary logistic regression

Assume that the target value, y, can be either 1 or 0 (i.e. $y \in \{0,1\}$). We could just treat this like linear regression where we build up a score ($\vec{\theta}^T \vec{x}$) and a higher score may mean y is more likely to be 1 while a lower score indicates y is likely to be zero. There are several problems with this and it is easy to find cases where this model does not work well [citation needed]. Additionally, it is not intuitive to predict values outside the interval [0, 1].

Instead, we'll choose a model that predicts value within the interval [0, 1] using the logit function. There are good reasons that this function is chosen, but we'll not go into them here (beyond the reasons stated already). The linear combination of the input vector is now chosen to vary with the logit of the probability p.

$$ \ln{\frac{p}{1-p}} = \vec{\theta}^T \cdot \vec{x} $$

Solving for p,

$$ p(y = 1|\vec{x};\vec{\theta}) = h_{\theta}(\vec{x}) = 
\frac{1}{1 + e^{-\vec{\theta}^T \cdot \vec{x}}} \\
p(y = 0|\vec{x};\vec{\theta}) = 1 - p(y = 1|\vec{x};\vec{\theta}) = 
\frac{e^{-\vec{\theta}^T \cdot \vec{x}}}{1 + e^{-\vec{\theta}^T \cdot \vec{x}}}
$$

Written more compactly, we have the probability of y being what it is, given the data (x vector):

$$
p(y;\vec{x};\vec{\theta}) = h_{\theta}(\vec{x})^y \cdot \left(1 - h_{\theta}(\vec{x})\right)^{1 - y}
$$

Remember that when y = 1, the second term goes away and the probability is just $h_{\theta}$ and when y = 0, the probability becomes $1 - h_{\theta}$.

The equation above is for a single sample. What we really want to do is to maximize the equation above taking into account all samples. Basically, maximize the aggregate likelihood of all the data, given the feature vectors. We can do this by multiplying their likelihoods together, which means we're multiplying a bunch of numbers that vary between [0, 1] together, so the result also varies between [0, 1].

$$
L(\vec{\theta}) = \prod_{i=1}^m p(y^i|\vec{x}^i;\vec{\theta})
= \prod_{i=1}^m h_{\theta}(\vec{x}^i)^{y^i} \cdot \left(1 - h_{\theta}(\vec{x}^i)\right)^{1 - y^i}
$$

We can transform $L(\vec{\theta})$ into its log form which is easier to maximize. Maximizing the log form of the likelihood with respect to the parameter vector $\vec{\theta}$ is equivalent to maximizing the original likelihood function.

$$
\ell(\vec{\theta}) = \ln{L(\vec{\theta})} = 
\sum_{i=1}^m y^i\ln{h_{\theta}(\vec{x}^i)} \cdot (1 - y^i) \ln{\left(1 - h_{\theta}(\vec{x}^i)\right)}
$$

So, now we need to find a $\theta$ vector that maximizes the equation above. How? One way is to use gradient ascent, which is a common optimization function.

In gradient ascent, we wish to start with a guess for the parameter vector, and then somehow intelligently update that guess until we get to the best answer. To do this, we find the gradient of the likelihood function with respect to $\theta$ and step in that direction. So, for each $\theta_j$ in $\vec{\theta}$ we update our guess by:

$$
\vec{\theta}_j = \vec{\theta}_j + \alpha\frac{\partial}{\partial \vec{\theta}_j} \ell(\vec{\theta})
$$

We need to find the gradient of the likelihood function with respect to theta first.

$$
\frac{\partial}{\partial \vec{\theta}_j} \ell(\vec{\theta}) = 
\left(y \frac{1}{h_{\theta}(\vec{x})} - (1 - y) \frac{1}{1- h_{\theta}(\vec{x})} \right) \frac{\partial}{\partial \vec{\theta}_j} h_{\theta}(\vec{x}) \\ 
= \left(y \frac{1}{h_{\theta}(\vec{x})} - (1 - y) \frac{1}{1- h_{\theta}(\vec{x})} \right)  h_{\theta}(\vec{x})(1 - h_{\theta}(\vec{x})) \frac{\partial}{\partial \vec{\theta}_j} \left( \vec{\theta}^T \vec{x} \right) \\
= \left(y \frac{1}{h_{\theta}(\vec{x})} - (1 - y) \frac{1}{1- h_{\theta}(\vec{x})} \right)  h_{\theta}(\vec{x})(1 - h_{\theta}(\vec{x})) x_j \\
= \left(y - h_{\theta}(\vec{x}) \right) x_j
$$

This leads to the gradient ascent update and, further, a stochastic update:

$$
\theta_j = \theta_j + \alpha\sum_{i=1}^m \left(y^i - h_{\theta}(\vec{x}^i) \right) x^i_j \\
\theta_j = \theta_j + \alpha \left(y^i - h_{\theta}(\vec{x}^i) \right) x^i_j
$$