# 五. `softmax`回归

## 1. 概述

Softmax回归（Softmax Regression），也称为多项（Multinomial）或多类（Multi-Class）的Logistic回归，是Logistic回归在多分类问题上的推广。

对于多类问题，类别标签$y \in {1, 2,..., C}$ 可以有C个取值．给定一个样本x，Softmax 回归预测的属于类别c的条件概率为
$$
\begin{aligned}
p(y=c|\mathbf{x})&=\mathrm{softmax}(\mathbf{w^T_cx})\\
&=\frac{\exp(\mathbf{w^T_cx})}{\sum_{i=1}^C \exp(\mathbf{w^T_ix})}
\end{aligned}
$$

其中$\mathbf{w_i}$是第i类的权重向量。

In [1]:
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline
from torch.utils.data import TensorDataset, DataLoader
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
def softmax(X, W):
    """
    X: torch.FloatTensor, N*a, N样本数量, a为特征的维度
    W: torch.FloatTensor, a*C, C为类别数量
    """
    C = torch.exp(X@W)  # hat_y, N*C
    return C / torch.sum(C, axis=1).reshape(X.shape[0], -1)  # 各样本对应类别的标准化概率分布

In [3]:
X = torch.randn(10, 5)
W = torch.randn(5, 3)
softmax(X, W)

tensor([[0.1584, 0.3256, 0.5161],
        [0.0928, 0.5208, 0.3864],
        [0.2508, 0.4578, 0.2914],
        [0.0253, 0.8853, 0.0895],
        [0.3327, 0.1074, 0.5599],
        [0.7337, 0.0700, 0.1963],
        [0.4860, 0.1457, 0.3683],
        [0.2206, 0.1274, 0.6520],
        [0.0067, 0.9760, 0.0172],
        [0.7444, 0.0281, 0.2275]])

## 2. Softmax回归的决策函数

Softmax回归的决策函数可以表示为

$$
\begin{aligned}
\hat{y}&=\text{arg}\max_{i=1}^{C}p(y=c|\mathbf{x})\\
&=\text{arg}\max_{i=1}^{C}\mathbf{w_i^Tx}
\end{aligned}
$$

In [4]:
def hat_y(X, W):
    S = softmax(X, W)  # 各样本在各类别上的概率
    max_indices = torch.max(S, dim=1)[1]
    pred_y = torch.zeros_like(S)
    pred_y[torch.arange(S.shape[0]), max_indices] = 1
    return max_indices, pred_y

In [5]:
pred_y = hat_y(X, W)

In [6]:
pred_y[0]

tensor([2, 1, 1, 1, 2, 0, 0, 2, 1, 0])

In [7]:
pred_y[1]

tensor([[0., 0., 1.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        [1., 0., 0.],
        [1., 0., 0.],
        [0., 0., 1.],
        [0., 1., 0.],
        [1., 0., 0.]])

- 与`Logistic`回归的关系。当类别数$C=2$时，softmax回归的决策函数为
$$
\begin{aligned}
\hat{y}&=\text{arg}\max_{i\in\{1,2\}}p(y=c|\mathbf{x})\\
&=\text{arg}\max_{i\in\{1,2\}}\mathbf{w_i^Tx}\\
&=I(\mathbf{(w_2-w_1)^Tx}>0))
\end{aligned}
$$
其中$I(\cdot)$是指示函数。

## 3. 准则

给定N个训练样本，Softmax回归使用交叉熵损失函数学习最有的参数矩阵$W$。为了方便起见，使用C维的`one-hot`向量表示类别标签，对于类别i，其向量表示为
$$
y = [I(i=1), I(i=2), ..., I(i=C)]
$$

采用交叉熵损失函数，Softmax回归模型的风险函数是
$$
\begin{aligned}
R(\mathbf{W})&=-\frac{1}{N}\sum_{n=1}^N\sum_{i=1}^{C}y_c^{(n)}\log \hat{y}_c^{(n)}\\
&=-\frac{1}{N}\sum_{n=1}^N(\mathbf{y^{(n)}})^T\log \hat{y}_c^{(n)}
\end{aligned}
$$
其中，$\hat{y}_c^{(n)}=\text{softmax}(\mathbf{W^Tx^{(n)}})$为样本$x^{(n)}$在每个类别的后验概率。

In [15]:
def cross_entropy(X, y, W):
    """
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    """
    p_y = softmax(X, W) # N*C, N个样本分别在C个类别的后验概率
    crossEnt = -torch.dot(y.reshape(-1), torch.log2(p_y).reshape(-1)) / y.shape[0]  # 展开成1维，点积
    return crossEnt

In [16]:
X = torch.randn(10, 5)
y = torch.zeros(10, 3)
y[torch.arange(10), torch.randint(low=0, high=y.shape[1] - 1, size=(10,))] = 1
W = torch.randn(5, 3)

In [17]:
a = torch.arange(10)
b = torch.arange(10)

In [18]:
torch.sum(a * b), torch.dot(a,b)

(tensor(285), tensor(285))

In [19]:
y

tensor([[0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [1., 0., 0.],
        [0., 1., 0.]])

In [20]:
cross_entropy(X, y, W)

tensor(4.0633)

In [21]:
prob_y = softmax(X, W)

In [22]:
-torch.dot(y.reshape(-1), torch.log2(prob_y).reshape(-1)) / y.shape[0]

tensor(4.0633)

- 风险函数$\mathbf{R(W)}$关于$W$的梯度为
$$
\frac{\partial R(W)}{\partial W}=-\frac{1}{N}\sum_{n=1}^N\mathbf{x^{(n)}(y^{(n)}-\hat{y}^{(n)})}^T
$$

In [30]:
def grad_crosEnt_W(X, y, W):
    '''
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    '''
    hat_y = softmax(X, W)
    a = (X.t() @ (y - hat_y)) / y.shape[0]  # (a+1)*N | N*C
    return -a

In [31]:
grad_crosEnt_W(X, y, W)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1000x8 and 5x3)

## 4. 学习方法

- 输入: 训练集X，`one-hot`形式的标签y
- 输出：最优参数$w^*$
- 算法过程
    - 初始化$W_0:=0$，最大迭代次数$T$
    - 然后通过下式进行参数的迭代更新
    $$
    W_{t+1}:=W_t+\eta\left(\frac{1}{N}\sum_{n=1}^N\mathbf{x^{(n)}(y^{(n)}-\hat{y}^{(n)})}^T\right) 
    $$
    - 直到满足指定迭代次数，令$w^*=w^T$。

- 预测的正确率

In [25]:
def precision_rate(X, y, W, X_with_bias=False):
    if X_with_bias:
        hat_X = X
    else:
        hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
        
    pred_y = hat_y(hat_X, W)
    precision = torch.sum(pred_y[0] == torch.max(y, axis=1)[1]).numpy() / pred_y[0].numel()
    return precision

- 生成模拟数据

In [26]:
X = torch.randn(1000, 8)
hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广
true_W = torch.randn(hat_X.shape[1], 5)  # 增广
indices_y, y = hat_y(hat_X, true_W)

In [27]:
train_X, train_y = X[:800], y[:800]
train_indices_y = indices_y[:800]
hat_train_X = hat_X[:800]
test_X, test_y = X[800:], y[800:]
test_indices_y = indices_y[800:]
hat_test_X = hat_X[800:]

### 4.1 方法1: 梯度下降-人工求导

In [32]:
def softmax_sgd(X, y, num_steps=100, lr=0.1):
    '''
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    '''
    hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
    W = torch.randn(hat_X.shape[1], y.shape[1])  # 增广参数矩阵
    for i in range(num_steps):
        W -= lr*grad_crosEnt_W(hat_X, y, W)
        loss = cross_entropy(hat_X, y, W)
        if (i+1) % 50 == 0:
            print(f'训练{i+1}轮, 交叉熵为{loss:.2f}')
            
    return W

In [33]:
# 模拟数据
est_W = softmax_sgd(train_X, train_y, num_steps=1000)
precision_rate(train_X, train_y, est_W, X_with_bias=False), precision_rate(test_X, test_y, est_W, X_with_bias=False)

训练50轮, 交叉熵为2.24
训练100轮, 交叉熵为1.44
训练150轮, 交叉熵为1.08
训练200轮, 交叉熵为0.87
训练250轮, 交叉熵为0.74
训练300轮, 交叉熵为0.66
训练350轮, 交叉熵为0.60
训练400轮, 交叉熵为0.56
训练450轮, 交叉熵为0.54
训练500轮, 交叉熵为0.51
训练550轮, 交叉熵为0.49
训练600轮, 交叉熵为0.48
训练650轮, 交叉熵为0.46
训练700轮, 交叉熵为0.45
训练750轮, 交叉熵为0.44
训练800轮, 交叉熵为0.43
训练850轮, 交叉熵为0.42
训练900轮, 交叉熵为0.41
训练950轮, 交叉熵为0.40
训练1000轮, 交叉熵为0.39


(0.94625, 0.915)

### 4.2 方法2: 随机梯度下降-自动求导

In [34]:
def softmax_miniBatch_sgd(X, y, num_epoch=50, batch_size=40, lr=0.05):
    '''
    X: N*a, N个样本, 特征数量为为a
    y: N*C, y为N个C维的one-hot向量
    W: a*C
    '''
    hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
    W = torch.randn(hat_X.shape[1], y.shape[1])  # 增广参数矩阵
    W.requires_grad_()
    dataset = TensorDataset(hat_X, y)
    data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
    for epoch in range(num_epoch):
        for t_x, t_y in data_iter:
            l = cross_entropy(t_x, t_y, W)        
            l.backward()  # 计算损失函数在 W 上的梯度
            W.data.sub_(lr*W.grad/batch_size)
            W.grad.data.zero_()
            
        if (epoch + 1) % 50 == 0:
            with torch.no_grad():  # 不计算梯度，加速损失函数的运算
                train_l = cross_entropy(hat_X, y, W)  # 最近一次的负对数似然率
                est_W = W.detach().numpy()  # detach得到一个有着和原tensor相同数据的tensor
                print(f'epoch {epoch + 1}, loss: {train_l:.4f}')
            
    return est_W, train_l

In [35]:
# 模拟数据
est_W, train_l = softmax_miniBatch_sgd(train_X, train_y, num_epoch=1000, batch_size=40, lr=0.1)
precision_rate(train_X, train_y, est_W), precision_rate(test_X, test_y, est_W)

epoch 50, loss: 2.0593
epoch 100, loss: 1.3058
epoch 150, loss: 0.9535
epoch 200, loss: 0.7821
epoch 250, loss: 0.6903
epoch 300, loss: 0.6345
epoch 350, loss: 0.5967
epoch 400, loss: 0.5684
epoch 450, loss: 0.5459
epoch 500, loss: 0.5271
epoch 550, loss: 0.5110
epoch 600, loss: 0.4968
epoch 650, loss: 0.4841
epoch 700, loss: 0.4727
epoch 750, loss: 0.4623
epoch 800, loss: 0.4527
epoch 850, loss: 0.4439
epoch 900, loss: 0.4357
epoch 950, loss: 0.4281
epoch 1000, loss: 0.4210


(0.94625, 0.93)

### 4.3 方法3: torch.nn

- 定义类

In [36]:
class SofmaxRegresModel(torch.nn.Module): 
    def __init__(self, dim_in, dim_out):
        # 首先找到LinearModel的父类torch.nn.Module，然后把类LinearModel的对象转换为类torch.nn.Module的对象, 
        # 即执行父类torch.nn.Module的初始化__init__()
        super(SofmaxRegresModel, self).__init__() 
        self.layer1 = torch.nn.Linear(dim_in, dim_out, bias=True)
        
    def forward(self, x):
        y_pred = self.layer1(x)
        return torch.nn.functional.softmax(y_pred, dim=1)  # softmax

- 定义训练算法

In [38]:
X.shape, y.shape

(torch.Size([1000, 8]), torch.Size([1000, 5]))

In [42]:
dim_in = X.shape[1]
dim_out = y.shape[1]
# 实例化1个网络
net = SofmaxRegresModel(dim_in, dim_out)
# 初始化网络参数和偏置
net.layer1.weight.data = torch.randn(dim_out, dim_in)
net.layer1.bias.data = torch.Tensor(dim_out)
# 损失函数|
loss = torch.nn.CrossEntropyLoss()
# 随机梯度下降算法

In [43]:
trainer = torch.optim.SGD(net.parameters(), lr=0.05)
# 加载数据
batch_size = 20
num_epochs = 200
dataset = TensorDataset(train_X, train_indices_y)
data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
# 开始训练
for epoch in range(num_epochs):
    for t_x, t_y in data_iter:
        l = loss(net(t_x), t_y)  # 计算当前批量的交叉熵损失
        trainer.zero_grad()  # 参数梯度清零
        l.backward()  # 反向传播，计算梯度
        trainer.step()  # 更新参数
        
    if (epoch+1) % 20 == 0:
        with torch.no_grad():  # 不计算梯度，加速损失函数的运算
            l_epoch = loss(net(train_X), train_indices_y) 
            print('epoch {}, loss {}'.format(epoch+1, l_epoch)) 

epoch 20, loss 1.4764677286148071
epoch 40, loss 1.3729668855667114
epoch 60, loss 1.3075109720230103
epoch 80, loss 1.2599815130233765
epoch 100, loss 1.2198939323425293
epoch 120, loss 1.1919025182724
epoch 140, loss 1.1810524463653564
epoch 160, loss 1.1743334531784058
epoch 180, loss 1.1692289113998413
epoch 200, loss 1.165086269378662


- 结果

In [44]:
w, b = net.parameters()
W = torch.cat([w.data, b.data.reshape(-1, 1)], axis=1)

pred_train_y = torch.max(net(train_X), axis=1)[1]
pred_test_y = torch.max(net(test_X), axis=1)[1]

print('train', torch.sum(pred_train_y == train_indices_y).numpy() / train_indices_y.numel())
print('test', torch.sum(pred_test_y == test_indices_y).numpy() / test_indices_y.numel())

train 0.79
test 0.79


## 5. 案例

鸢尾花数据集

In [45]:
from sklearn import datasets
d = datasets.load_iris()

In [46]:
x_labels, y_labels = d['feature_names'], d['target_names']

In [47]:
x_labels, y_labels

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

In [48]:
rand_idx = np.arange(len(d['target']))
np.random.shuffle(rand_idx)
t_idx = rand_idx[:100]
v_idx = rand_idx[100:]
x_train, y_train = torch.from_numpy(d['data'][t_idx]).type(torch.FloatTensor), torch.from_numpy(d['target'][t_idx])
x_valid, y_valid = torch.from_numpy(d['data'][v_idx]).type(torch.FloatTensor), torch.from_numpy(d['target'][v_idx])
onehot_y_train = torch.zeros(x_train.shape[0], 3)
onehot_y_train[torch.arange(x_train.shape[0]), y_train] = 1
onehot_y_valid = torch.zeros(x_valid.shape[0], 3)
onehot_y_valid[torch.arange(x_valid.shape[0]), y_valid] = 1

In [49]:
y_train

tensor([1, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 1, 1, 0, 2, 0, 2, 0, 0, 0, 2, 2, 0, 0,
        0, 2, 2, 2, 2, 0, 1, 2, 0, 2, 1, 2, 1, 1, 0, 2, 1, 0, 0, 2, 1, 2, 2, 2,
        2, 2, 0, 2, 1, 2, 0, 2, 0, 1, 0, 1, 2, 1, 0, 1, 2, 1, 0, 2, 0, 1, 1, 1,
        2, 2, 1, 1, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 2, 2, 0, 1, 2, 0, 2,
        1, 1, 0, 1])

### 5.1 方法1

In [50]:
est_W = softmax_sgd(x_train, onehot_y_train, num_steps=1000)
print(f"Train accuracy rate: {precision_rate(x_train, onehot_y_train, est_W)}")
print(f"Valid accuracy rate: {precision_rate(x_valid, onehot_y_valid, est_W)}")

训练50轮, 交叉熵为0.72
训练100轮, 交叉熵为0.55
训练150轮, 交叉熵为0.42
训练200轮, 交叉熵为0.37
训练250轮, 交叉熵为0.34
训练300轮, 交叉熵为0.32
训练350轮, 交叉熵为0.30
训练400轮, 交叉熵为0.28
训练450轮, 交叉熵为0.27
训练500轮, 交叉熵为0.25
训练550轮, 交叉熵为0.24
训练600轮, 交叉熵为0.23
训练650轮, 交叉熵为0.23
训练700轮, 交叉熵为0.22
训练750轮, 交叉熵为0.21
训练800轮, 交叉熵为0.21
训练850轮, 交叉熵为0.20
训练900轮, 交叉熵为0.20
训练950轮, 交叉熵为0.19
训练1000轮, 交叉熵为0.19
Train accuracy rate: 0.98
Valid accuracy rate: 1.0


### 5.2 方法2 

In [51]:
# 鸢尾花
est_W, _ = softmax_miniBatch_sgd(x_train, onehot_y_train, num_epoch=1000, batch_size=40, lr=0.1)
print(f"Train accuracy rate: {precision_rate(x_train, onehot_y_train, est_W)}")
print(f"Valid accuracy rate: {precision_rate(x_valid, onehot_y_valid, est_W)}")

epoch 50, loss: 4.2735
epoch 100, loss: 1.6804
epoch 150, loss: 1.2477
epoch 200, loss: 1.0474
epoch 250, loss: 0.9341
epoch 300, loss: 0.8604
epoch 350, loss: 0.8079
epoch 400, loss: 0.7672
epoch 450, loss: 0.7343
epoch 500, loss: 0.7063
epoch 550, loss: 0.6821
epoch 600, loss: 0.6610
epoch 650, loss: 0.6417
epoch 700, loss: 0.6245
epoch 750, loss: 0.6090
epoch 800, loss: 0.5947
epoch 850, loss: 0.5806
epoch 900, loss: 0.5679
epoch 950, loss: 0.5561
epoch 1000, loss: 0.5455
Train accuracy rate: 0.93
Valid accuracy rate: 0.94


### 5.3 方法3

In [52]:
# 实例化
dim_in = 4  # 特征数量
dim_out = 3  # 类别数量
net = SofmaxRegresModel(dim_in, dim_out)
# 初始化网络参数和偏置
net.layer1.weight.data = torch.randn(dim_out, dim_in)
net.layer1.bias.data = torch.randn(dim_out)
# 损失函数
loss = torch.nn.CrossEntropyLoss()
# 随机梯度下降算法
trainer = torch.optim.SGD(net.parameters(), lr=0.04)
# 加载数据
batch_size = 20
num_epochs = 2000
dataset = TensorDataset(x_train, y_train)
data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)

# 训练
for epoch in range(num_epochs):
    for t_x, t_y in data_iter:
        l = loss(net(t_x), t_y)  # 计算当前批量的交叉熵损失
        trainer.zero_grad()  # 参数梯度清零
        l.backward()  # 反向传播，计算梯度
        trainer.step()  # 更新参数
    if (epoch+1) % 50 == 0:
        with torch.no_grad():  # 不计算梯度，加速损失函数的运算
            l_epoch = loss(net(x_train), y_train) 
            print('epoch {}, loss {}'.format(epoch+1, l_epoch)) 

epoch 50, loss 0.9097030162811279
epoch 100, loss 0.903800904750824
epoch 150, loss 0.9007967114448547
epoch 200, loss 0.8989772200584412
epoch 250, loss 0.8977570533752441
epoch 300, loss 0.8968818187713623
epoch 350, loss 0.896223247051239
epoch 400, loss 0.8957096338272095
epoch 450, loss 0.8952977657318115
epoch 500, loss 0.8949601054191589
epoch 550, loss 0.8946780562400818
epoch 600, loss 0.8944389224052429
epoch 650, loss 0.8942336440086365
epoch 700, loss 0.8940551280975342
epoch 750, loss 0.8938988447189331
epoch 800, loss 0.8937607407569885
epoch 850, loss 0.8936377167701721
epoch 900, loss 0.8935274481773376
epoch 950, loss 0.8934279084205627
epoch 1000, loss 0.8933377861976624
epoch 1050, loss 0.8932558298110962
epoch 1100, loss 0.893180787563324
epoch 1150, loss 0.8931118249893188
epoch 1200, loss 0.8930484652519226
epoch 1250, loss 0.8929897546768188
epoch 1300, loss 0.8929353952407837
epoch 1350, loss 0.8928847312927246
epoch 1400, loss 0.8928375840187073
epoch 1450, los

In [53]:
w, b = net.parameters()
W = torch.cat([w.data, b.data.reshape(-1, 1)], axis=1)

In [None]:
pred_y_train = torch.max(net(x_train), axis=1)[1]
pred_y_valid = torch.max(net(x_valid), axis=1)[1]

In [None]:
pred_y_train

In [None]:
torch.sum(pred_y_train == y_train).numpy() / y_train.numel()

In [None]:
torch.sum(pred_y_valid == y_valid).numpy() / y_valid.numel()

## 附. 熵相关概念

一条信息的信息量大小和它的不确定性有很大的关系。一句话如果需要很多外部信息才能确定，我们就称这句话的信息量比较大。比如你听到“云南西双版纳下雪了”，那你需要去看天气预报、问当地人等等查证（因为云南西双版纳从没下过雪）。相反，如果和你说“人一天要吃三顿饭”，那这条信息的信息量就很小，因为这条信息的确定性很高，我们不需要用很多信息取证明它。因此，可将事件$x_0$的信息量表示为：
$$
I(x_0)=-\log p(x_0)
$$

### A1. 熵
信息量是对于单个事件来说的，但是实际情况一件事有很多种发生的可能，比如掷骰子有可能出现6种情况，明天的天气可能晴、多云或者下雨等等。因此，我们需要评估事件对应的所有可能性。

熵（entropy）是表示随机变量不确定的度量，是对表征所有可能发生的事件所需信息量的期望。

设$X$是一个取有限个值的随机变量，其概率分布为
$$
P(X=x_i)=p_i,i=1,2,...,n
$$
熵定义为
$$
H(x)=\sum_{i=0} p(x_i) I(x_i)=-\sum_{i=1}^n p(x_i) \log p(x_i)
$$
上式中，若$p_i=0$，则定义$0\log 0=0$；对数以2或者e为底，这时熵的单位分别称为比特(bit)或者纳特(nat)。熵只依赖于X的分布，与其取值无关，因此也可将X的熵记作$H(p)$, 即
$$
H(p)=-\sum_{i=1}^n p_i \log p_i
$$
熵越大，不确定越大。

In [8]:
def entropy(P):
    '''
    P为概率分布
    '''
    return -np.sum([p*np.log2(p) if p > 0 else 0 for p in P])

In [9]:
P1 = np.ones(10) / 10
P2 = np.zeros(10)
P2[3] = 1
entropy(P1), entropy(P2)

(3.321928094887362, -0.0)

### A2. 条件熵
条件熵(conditional entropy): 表示在已知随机变量X的条件下随机变量Y的不确定性。

$$
H(Y|X)=\sum_{i=1}^n P(X=x_i)H(Y|X=x_i)
$$

其中，$H(Y|X=x_i)=-\sum_j P(Y=y_j|X=x_i)\log P(Y=y_j|X=x_i)$，表示在$X=x_i$时Y的不确定程度；$p(Y=y_j|X=x_i) = \frac{p(X=x_i, Y=y_j)}{p(X=x_i)}$。
> 如果X与Y无关，则有$H(Y|X)=H(Y)$；如果Y由X唯一决定，则有$H(Y|X)=0$

In [10]:
def conditional_entropy(P_XY):
    '''
    P_XY为X和Y的联合概率分布shape(x_size, y_z)
    '''
    return np.sum([np.sum(P_XY[i]) * entropy(P_XY[i, :]/np.sum(P_XY[i])) 
                   for i in P_XY.shape[1]])

### A3. KL散度（相对熵）

相对熵(`relative entropy`)或KL散度(`Kullback-Leibler divergence`)：度量一个概率分布$p(x)$相对另一个概率分布$q(X)$的差异

$$
\text{KL(p||q)}=-\sum_x p(x)\log\frac{q(x)}{p(x)}
$$

由`Jesen`不等式可证明，$\text{KL(p||q)}\geq 0$，当且仅当对于所有$x$有$p(x)=q(x)$时，取等号。

此外，需注意，$\text{KL(p||q)}\neq \text{KL(q||p)}$

In [11]:
def KL(p_x, q_x):
    return -np.sum([p_x[i]*np.log(q_x[i]/p_x[i]) if p_x[i] > 0 and q_x[i] > 0 else 0 
                    for i in range(len(p_x))])

In [12]:
KL(P1, P2), KL(P1, P1)

(-0.2302585092994046, -0.0)

### A4. 交叉熵

交叉熵定义如下:

$$
\text{crossEntropy(p(x), q(x))} = -\sum_x p(x)\log q(x)
$$

- 与KL散度的关系
$$
\begin{aligned}
\text{KL(p||q)} &= -\sum_x p(x)\log\frac{q(x)}{p(x)}\\
&= -\sum_x p(x)\log q(x) + \sum_x p(x)\log p(x) \\
&= \text{crossEntropy(p(x), q(x))} - H\left(p(x)\right)
\end{aligned}
$$

即有$\text{crossEntropy(p(x), q(x))} = \text{KL(p||q)} + H\left(p(x)\right)$

由于$H\left(p(x)\right)$为定值，针对q最小化交叉熵等价于最小化`KL(p||q)`，即使理论分布与抽样分布之间的差异最小。

In [13]:
def cross_entropy(p_x, q_x):
    return -np.sum([p_x[i]*np.log(q_x[i]) if q_x[i] > 0 else 0 for i in range(len(p_x))])

In [14]:
cross_entropy(P1, P2), cross_entropy(P1, P1), cross_entropy(P2, P2)

(-0.0, 2.3025850929940455, -0.0)

## 参考资料
1. 李航. 统计学习方法. 2017.
2. 邱锡鹏. 神经网络与机器学习. 2020.