<h2 style="background-color:rgba(100,100,100,0.5);"> Logistic Regression </h2>

Some regression algorithms can be used for classification (and vice versa). Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs to the negative class, labeled “0”).

This makes the Logistic regression a binary classifier.

<h3><b>Prediction</b></h3>

Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result. The logistic is a sigmoid function (S-shaped) that outputs a number between 0 and 1. Therefore, you first take the output of the linear regression function, then use the sigmoid function to get the probability. Based on the value of the probability value, the class can be predicted.

$ \hat{p} =
  h_\theta(x) = \sigma(\theta^Tx)
$

$ \sigma(t) =
  \frac{1}{1 + Exp(-t)}
$

$ \hat{y} = 
  \begin{cases}
    0 & \quad {\hat{p} < 0.5} \\
    1 & \quad {\hat{p} >= 0.5} \\
  \end{cases}
$

First $\theta^Tx$ is computed, then $\hat{p}$, finally $\hat{y}$. If, $\theta^Tx$ is positive, the model predicts 1, if $\theta^Tx$ is negative, the model predicts 0.

<h3><b>Cost Function</b></h3>

$ J(X, \theta, y) = 
  -\cfrac{1}{m}\sum_{i=1}^{m}[y^ilog(\hat{p}^i) + (1-y^i)log(1 - \hat{p}^i)]
$

This cost function makes sense because $–log(t)$ grows very large when t approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance, and it will also be very large if the model estimates a probability close to 1 for a negative instance. On the other hand, $–log(t)$ is close to 0 when t is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want.

The cost function over the whole training set is the average cost over all training instances and is called the log loss function.

There is no known closed-form equation to compute the value of θ that minimizes this cost function (there is no equivalent of the Normal Equation). The good news is that this cost function is convex, so Gradient Descent (or any other optimization algorithm) can be used. The partial derivatives of the cost function with regard to the jth model parameter $\theta_j$ are given by

$ \frac{\partial}{{\partial}\theta_{j}}J(\theta) =
  \cfrac{1}{m}\sum_{i=1}^{m}(\sigma(\theta^Tx^i) - y^i)x_j^i
$

Then, any of the know Gradient Descent algorithms can be used to solve for the model parameters.

<h2 style="background-color:rgba(100,100,100,0.5);"> Softmax Regression </h2>

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression.

This makes the Softmax regression a multiclass classifier. Also, the Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass, not multioutput), so it should be used only with mutually exclusive classes, such as different types of flowers. You cannot use it to recognize multiple people in one picture.

<h3><b>Prediction</b></h3>

When given an instance x, the Softmax Regression model first computes a score $s_k(x)$ for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores. The equation to compute $s_k(x)$ should look familiar, as it is just like the equation for Linear Regression prediction.

Softmax score:

$ S_k(x) =
  (\theta^k)^Tx
$

Note that each class has its own dedicated parameter vector $\theta^k$, all these vectors are typically stores as rows in a parameter matrix $\Theta$.

Once you have computed the score of every class for the instance x, you can estimate the probability $\hat{p}_k$ that the instance belongs to class k by running the scores through the softmax function. The function computes the exponential of every score, then normalizes them (dividing by the sum of all the exponentials). The scores are generally called logits or log-odds (although they are actually unnormalized log-odds).

Softmax function:

$ \hat{p}_k =
  \sigma(s(x))_k = 
  \cfrac{exp(s_k(x))}{\sum_{j=1}^{k}exp(s_j(x))}
$
K is the number of classes, s(x) is a vector containing the scores of each class for instance x, $\sigma(s(x))_k$ is the estimated probability that the instance x belongs to class k, given the scores of each class for that instance.

Just like the Logistic Regression classifier, the Softmax Regression classifier predicts the class with the highest estimated probability (which is simply the class with the highest score).

Softmax regression classifier prediction:

$ \hat{y} =
  Arg_kMax \: \sigma(s(x))_k = 
  Arg_kMax \: s_k(x) = 
  Arg_kMax \: ((\theta^k)^Tx)
$

The argmax operator returns the value of a variable that maximizes a function. In this equation, it returns the value of k that maximizes the estimated probability $\sigma(s(x))_k$.

<h3><b>Cost Function</b></h3>

The objective is to have a model that estimates a high probability for the target class (and consequently a low probability for the other classes). Minimizing the cost function shown below, called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes.

Cross entropy cost function:

$ J(\Theta) = 
  -\cfrac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{k}y^i_k\:log(\hat{p}^i_k)
$
$y^i_k$ is the target probability that the ith instance belongs to class k. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not.

Notice that when there are just two classes (K=2), this cost function is equivalent to the Logistic Regression’s cost function (log loss).

Cross entropy gradient vector for class k

$ g(J(\Theta)) = \nabla_{\theta^k}J(\Theta) =
  \cfrac{1}{M}\sum_{i=1}^{m}(\hat{p}^i_k - y^i_k)x^i
$

Now you can compute the gradient vector for every class, then use Gradient Descent (or any other optimization algorithm) to find the parameter matrix $\Theta$ that minimizes the cost function.