## 3.4 Softmax回归

softmax回归是多分类问题中最常用的模型之一。与线性回归类似，softmax回归也是线性模型，但它的输出经过了softmax函数的变换，从而能够输出各类别的概率分布(软分类)。

Softmax回归的数学表达式如下：
$$
\hat{y} = softmax(o), \hat{y}_i = \frac{e^{o_i}}{\sum_{j} e^{o_j}}
$$

In [None]:
import numpy as np

def softmax(z, axis=-1):
    f"""
    classic softmax function implementation
    """
    max_val = np.max(z, axis=axis, keepdims=True)
    exp_z = np.exp(z - max_val)
    return exp_z / np.sum(exp_z, axis=axis, keepdims=True)


## 似然函数与交叉熵损失函数
softmax函数给出了一个向量 $\hat{y}$, 可以将其视为"对给定任意输入$x$的每个类的条件概率"。 例如, $\hat{y}_1 = P(y=猫|x)$

Assuming the training dataset has $n$ samples, the likelihood function can be expressed as:
$$
P(Y|X) = \prod_{i=1}^{n} P(y^{(i)}|x^{(i)}) = \prod_{i=1}^{n} \prod_{j=1}^{q} (\hat{y}_j^{(i)})^{y_j^{(i)}}
$$,
where $X = {x^{(1)}, x^{(2)}, ..., x^{(n)}}$ represents the input features of all samples, and $Y = {y^{(1)}, y^{(2)}, ..., y^{(n)}}$ represents one-hot encoded labels of all samples.

based on the maximum likelihood estimation (MLE) principle, we aim to maximize the likelihood function $P(Y|X)$, which is equivalent to minimizing the negative log-likelihood function:
$$
-\log P(Y|X) = -\sum_{i=1}^{n} \log P(y^{(i)}|x^{(i)}) = \sum_{i=1}^{n} L(\hat{y}^{(i)}, y^{(i)})
$$,
where $L(\hat{y}, y) = -\sum_{j=1}^{q} y_j \log \hat{y_j}$ is the loss function, known as the cross-entropy loss function.

## Softmax及其导数

$$
\begin{aligned}
L(\hat{y}, y) &= -\sum_{j=1}^{q} y_j \log \hat{y}_j \\
&\quad \text{Step 1: Substitute Softmax definition } \hat{y}_j = \frac{e^{o_j}}{\sum_{k=1}^q e^{o_k}} \\
&= -\sum_{j=1}^{q} y_j \log \left( \frac{e^{o_j}}{\sum_{k=1}^{q} e^{o_k}} \right) \\
&\quad \text{Step 2: Log quotient rule } \log(\frac{a}{b}) = \log a - \log b \\
&= -\sum_{j=1}^{q} y_j \left( \log(e^{o_j}) - \log \left( \sum_{k=1}^{q} e^{o_k} \right) \right) \\
&\quad \text{Step 3: Split the sum and simplify } \log(e^x) = x \\
&= -\sum_{j=1}^{q} y_j o_j + \sum_{j=1}^{q} y_j \log \left( \sum_{k=1}^{q} e^{o_k} \right) \\
&\quad \text{Step 4: Factor out the log-sum term (constant w.r.t summation index j)} \\
&= -\sum_{j=1}^{q} y_j o_j + \left( \log \sum_{k=1}^{q} e^{o_k} \right) \underbrace{\sum_{j=1}^{q} y_j}_{=1 \text{ (One-hot property)}} \\
&= -\sum_{j=1}^{q} y_j o_j + \log \sum_{k=1}^{q} e^{o_k} \\
&= \log \sum_{k=1}^{q} e^{o_k} - \sum_{j=1}^{q} y_j o_j
\end{aligned}
$$

#### Differentiating the Loss Function
$$
\begin{aligned}
\frac{\partial L}{\partial o_i} &= \frac{\partial }{\partial o_i} \left( \log \sum_{k=1}^{q} e^{o_k} - \sum_{j=1}^{q} y_j o_j \right) \\
&\quad \text{Step 1: Differentiate each term separately} \\
&= \frac{\partial }{\partial o_i} \log \sum_{k=1}^{q} e^{o_k} - \frac{\partial }{\partial o_i} \sum_{j=1}^{q} y_j o_j \\
&\quad \text{Step 2: Apply the chain rule to the first term} \\
&= \frac{1}{\sum_{k=1}^{q} e^{o_k}} \cdot \frac{\partial }{\partial o_i} \sum_{k=1}^{q} e^{o_k} - y_i \\
&\quad \text{Step 3: Differentiate the sum inside the first term} \\
&= \frac{1}{\sum_{k=1}^{q} e^{o_k}} \cdot e^{o_i} - y_i \\
&\quad \text{Step 4: Simplify the expression} \\
&= \frac{e^{o_i}}{\sum_{k=1}^{q} e^{o_k}} - y_i \\
&= \hat{y}_i - y_i
\end{aligned}
$$

## Information theory - Entropy
信息论的核心思想是量化数据中的信息内容。 在信息论中， 该数值被称为分布P的熵， 定义为：
$$
H(P) = - \sum_{x} P(x) \log P(x)
$$,
其中在观测一个事件$x$时， $-\log P(x)$表示该事件携带的信息量。当我们赋予一个事件较低的概率时， 该事件携带的信息量就会增加。 熵$H(P)$衡量了分布$P$中所有可能事件的信息量的期望值。

如果把真实分布$P$和预测分布$Q$之间的差异表示为交叉熵：我们可以把交叉熵想像为主观概率为$Q$的观察者在看到根据概率$P$生成的数据时的预期惊异。 当$P$和$Q$越接近时， 观察者的惊异就越小。

训练模型最小化交叉熵（Cross Entropy Loss），本质上就是在最小化模型看到真实结果/标签时的"惊异程度"以及最大化观测数据的似然。