### Preliminaries

#### Definition

Given a dataset $D = (X, C)$ where $C = \{C_1, C_2, ..., C_k\}$ is the set of all classes, the classification task requires us to classify correctly any input $x_i \in \mathbb{R}^K$ to its respective class $c \in C$.

#### Binary Classification Task

Let consider a classification task where $k = 2$ (a binary classification task). Let $P(x, C_i)$ is the probability of assigning class $C_i \in C$ to the given input $x \in X$. Then the binary classification task becomes a joint probability approximation task.

Applying the Bayesian rule, we have:

$$
    P(x, C_i) = P(x | C_i)P(C_i) = P(C_i | x)P(x)
$$

$$
    \Rightarrow P(C_i | x) = \frac{P(x | C_i)P(C_i)}{P(x)}
$$

$P(x | C_i)$ is called the likelihood distribution, $P(C_i)$ is the prior distribution, $P(x)$ is the evidence, and $P(C_i | x)$ is the posterior distribution.

Following the Bayesian rule, we have:

\begin{align}
    P(C_i | x)  & = \frac{P(x | C_i)P(C_i)}{P(x)} \\
                & = \frac{P(x | C_i) P(C_i)}{ P(x, C_1) + P(x, C_2) } \\
                & = \frac{P(x | C_i) P(C_i)}{ P(x | C_1)P(C_1) + P(x | C_2)P(C_2) } \\
\end{align}

Let $a_1 = ln\frac{P(x | C_1)P(C_1)}{P(x | C_2)P(C_2)}$, we have:

\begin{align}
    P(C_1 | x)  & = \left( \frac{P(x | C_1)P(C_1) + P(x | C_2)P(C_2)}{P(x | C_1)P(C_1)} \right)^{-1} \\
                & = \left(1 + \frac{P(x | C_2)P(C_2)}{P(x | C_1)P(C_1)} \right)^{-1} \\
                & = \left(1 + \left(\frac{P(x | C_1)P(C_1)}{P(x | C_2)P(C_2)}\right)^{-1} \right)^{-1} \\
                & = \left(1 + exp \left(-ln\frac{P(x | C_1)P(C_1)}{P(x | C_2)P(C_2)}\right) \right)^{-1} \\
                & = (1 + exp(-a_1))^{-1} \\
                & = \frac{1}{1 + exp(-a_1)} \\
                & = \sigma(a_1)
\end{align}

The function $\sigma: \mathbb{R} \mapsto [0, 1], x \mapsto \sigma(x)$ is the **Sigmoid function** and has the graph as the following figure.

![image.png](attachment:image.png)

By the same way, via defining $a_2 = ln\frac{P(x | C_2)P(C_2)}{P(x | C_1)P(C_1)}$, we also have:

$$
    P(C_2 | x) = \sigma(a_2)
$$

#### Classification Task in General

Let us consider the general case where $k \in \mathbb{N}$ and $k > 2$. In this case, we have:

$$
    P(C_i | x) = \frac{P(x | C_i)P(C_i)}{P(x)} = \frac{P(x | C_i)P(C_i)}{\sum_{j=1}^{k} P(C_j, x)} = \frac{P(x | C_i)P(C_i)}{\sum_{j=1}^{k} P(x | C_j)P(C_j)}
$$

Let define $a_i = lnP(x | C_i)P(C_i)$, we have:

\begin{align}
    P(C_i | x)  & = \frac{exp(ln(P(x | C_i)P(C_i)))}{\sum_{j=1}^{k} exp(ln(P(x | C_j)P(C_j)))} \\
                & = \frac{exp(a_i)}{\sum_{j=1}^{k} exp(a_j)}
\end{align}

The left handside of (2) is called the **Softmax function** on the vector $a = (a_1, a_2, ..., a_k) \in \mathbb{R}^k$.

From former analysis, we can conclude:
- In case $k = 2$, the posterior distribution has the form of sigmoid function.
- In case $k > 2$, the posterior distribution has the form of softmax function.

#### Logistic Regression

In case of binary classification, the posterior has the form of

$$
    P(C_i | x) = \frac{P(x | C_1)P(C_1)}{P(x | C_1)P(C_1) + P(x | C_2)P(C_2)}
$$

Let $a_1 = ln\frac{P(x | C_1)P(C_1)}{P(x | C_2)P(C_2)}$, then

\begin{align}
    a_1 & = ln (P(x | C_1)P(C_1)) - ln (P(x | C_2)P(C_2)) \\
        & = ln P(x | C_1) - ln P(x | C_2) + const \\
\end{align}
where $const \in \mathbb{R}$ indicates terms that are not relevant to $x$.

Let assume that $P(x | C_1) \sim \mathcal{N}(\mu_1, \Sigma_1)$ and $P(x | C_2) \sim \mathcal{N}(\mu_2, \Sigma_2)$ where $\mu_1, \mu_2 \in \mathbb{R}^k$ and $\Sigma_1, \Sigma_2 \in \mathbb{R}^{k \times k}$.

From that on, we have:

$$
    P(x | C_1) = \frac{1}{(2\pi)^{\frac{k}{2}}} \frac{1}{|\Sigma_1|^{\frac{1}{2}}} exp(-\frac{1}{2}(x - \mu_1)^T\Sigma_1(x - \mu_1))
$$

$$
    \Rightarrow lnP(x | C_1) = -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_1| - \frac{1}{2}(x - \mu_1)^T\Sigma_1(x - \mu_1)
$$

In the same way, we have

$$
    \Rightarrow lnP(x | C_2) = -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_2| - \frac{1}{2}(x - \mu_2)^T\Sigma_2(x - \mu_2)
$$

Replace these terms into $a_1$, we have:

\begin{align}
    a_1 & = lnP(x | C_1)P(C_1) - lnP(x | C_2)P(C_2) + const \\
        & = \left[-\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_1| - \frac{1}{2}(x - \mu_1)^T\Sigma_1(x - \mu_1)\right] - \left[-\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_2| - \frac{1}{2}(x - \mu_2)^T\Sigma_2(x - \mu_2)\right] + const \\
        & = -\frac{1}{2}(ln|\Sigma_1| - ln|\Sigma_2|) - \frac{1}{2}(x^T\Sigma_1^{-1}x - x^T\Sigma_2^{-1}x) - \frac{1}{2}(\mu_1^T\Sigma_1^{-1}\mu_1 - \mu_2^T\Sigma_2^{-1}\mu_2) + (\mu_1^T\Sigma_1^{-1}x - \mu_2^T\Sigma_2^{-1}x) + const \\
\end{align}

Let assume that $\Sigma_1 = \Sigma_2 = \Sigma$, then we have:

\begin{align}
    a_1 & = lnP(x | C_1)P(C_1) - lnP(x | C_2)P(C_2) + const \\
        & = -\frac{1}{2}(ln|\Sigma_1| - ln|\Sigma_2|) - \frac{1}{2}(x^T\Sigma_1^{-1}x - x^T\Sigma_2^{-1}x) - \frac{1}{2}(\mu_1^T\Sigma_1^{-1}\mu_1 - \mu_2^T\Sigma_2^{-1}\mu_2) + (\mu_1^T\Sigma_1^{-1}x - \mu_2^T\Sigma_2^{-1}x) + const \\
        & = -\frac{1}{2}(ln|\Sigma| - ln|\Sigma|) - \frac{1}{2}(x^T\Sigma^{-1}x - x^T\Sigma^{-1}x) - \frac{1}{2}(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2) + (\mu_1^T\Sigma^{-1}x - \mu_2^T\Sigma^{-1}x) + const \\
        & = - \frac{1}{2}(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2) + (\mu_1^T - \mu_2^T)\Sigma^{-1}x + const \\
\end{align}

By defining

$$
    w_0 = - \frac{1}{2}(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2) + const \in \mathbb{R}
$$

$$
    w_1 = (\mu_1^T - \mu_2^T)\Sigma^{-1} \in \mathbb{R}^k
$$

then we have

$$
    a_1 = w_0 + w_1 x
$$

In addition, we have shown that 

$$
    P(C_1 | x) = \sigma(a_1) = \sigma(w_0 + w_1x)
$$

then the above formula is called the **Logistic Regression** model.

In conclusion, with the assumption of Gaussian distribution of the likelihood probability $P(x | C_i)$ of the class $C_i$, then we have the Logistic Regression for the posterior distribution of that class.

#### Softmax Regression

Consider the general case where $k > 2$, we have:

$$
    P(C_i | x) = \frac{P(x | C_i)P(C_i)}{P(x)} = \frac{P(x | C_i)P(C_i)}{\sum_{j=1}^{k} P(C_j, x)} = \frac{P(x | C_i)P(C_i)}{\sum_{j=1}^{k} P(x | C_j)P(C_j)}
$$

Let define $a_i = lnP(x | C_i)P(C_i)$, we have:

\begin{align}
    P(C_i | x)  & = \frac{exp(ln(P(x | C_i)P(C_i)))}{\sum_{j=1}^{k} exp(ln(P(x | C_j)P(C_j)))} \\
                & = \frac{exp(a_i)}{\sum_{j=1}^{k} exp(a_j)}
\end{align}

As the same approach, let assume the likelihood distribution follows the Gaussian distribution $P(x | C_i) \sim \mathbb{N}(\mu_i, \Sigma_i)$. Then we have:

$$
    P(x | C_i) = \frac{1}{(2\pi)^{\frac{k}{2}}} \frac{1}{|\Sigma_i|^{\frac{1}{2}}} exp(-\frac{1}{2}(x - \mu_i)^T\Sigma_i(x - \mu_i))
$$

\begin{align}
    \Rightarrow lnP(x | C_i)    & = -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_i| - \frac{1}{2}(x - \mu_i)^T\Sigma_i(x - \mu_i) \\
                                & = -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_i| - \frac{1}{2}x^T\Sigma_i^{-1}x - \frac{1}{2}\mu_i^T\Sigma_i^{-1}\mu_i + \mu_i^T\Sigma_i^{-1}x
\end{align}

Hence

\begin{align}
    exp(a_i)    & = exp\left( -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_i| - \frac{1}{2}x^T\Sigma_i^{-1}x - \frac{1}{2}\mu_i^T\Sigma_i^{-1}\mu_i + \mu_i^T\Sigma_i^{-1}x + lnC_i \right) \\
                & = exp\left( -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma_i| - \frac{1}{2}x^T\Sigma_i^{-1}x\right)exp\left( - \frac{1}{2}\mu_i^T\Sigma_i^{-1}\mu_i + \mu_i^T\Sigma_i^{-1}x + lnC_i \right)
\end{align}

Now assume that $\Sigma_1 = \Sigma_2 = \Sigma_3 = ... = \Sigma_k = \Sigma$, we have:

\begin{align}
        P(C_i | x)  & = \frac{exp(a_i)}{\sum_{j=1}^{k} exp(a_j)} \\
                    & = \frac{exp\left( -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma| - \frac{1}{2}x^T\Sigma^{-1}x\right)exp\left( - \frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i + \mu_i^T\Sigma^{-1}x + lnC_i \right)}{\sum_{j=1}^k exp\left( -\frac{k}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma| - \frac{1}{2}x^T\Sigma^{-1}x\right)exp\left( - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \mu_j^T\Sigma^{-1}x + lnC_j \right)} \\
                    & = \frac{exp\left( - \frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i + \mu_i^T\Sigma^{-1}x + lnC_i \right)}{\sum_{j=1}^k exp\left( - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \mu_j^T\Sigma^{-1}x + lnC_j \right)} \\
                    & = \frac{exp\left( - \frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i + lnC_i + \mu_i^T\Sigma^{-1}x \right)}{\sum_{j=1}^k exp\left( - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + lnC_j + \mu_j^T\Sigma^{-1}x \right)} \\
\end{align}

Let define 

$$
    w_{i0} = -\frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i + lnC_i
$$

$$
    w_{i1} = \mu_i^T\Sigma^{-1}x
$$

We have

$$
    P(C_i | x) = \frac{exp(w_{i0} + w_{i1}x)}{\sum_{j=1}^{k} exp(w_{j0} + w_{j1}x)}
$$

The above formula is called the **Softmax Regression** model.

In conclusion, with the assumption of Gaussian distribution of the likelihood probability $P(x | C_i)$ of the class $C_i$, then we have the Softmax Regression for the posterior distribution of that class.