# 五. `softmax`回归

## 1. 概述

Softmax回归（Softmax Regression），也称为多项（Multinomial）或多类（Multi-Class）的Logistic回归，是Logistic回归在多分类问题上的推广。

对于多类问题，类别标签$y \in {1, 2,..., C}$ 可以有C个取值．给定一个样本x，Softmax 回归预测的属于类别c的条件概率为
$$
\begin{aligned}
p(y=c|\mathbf{x})&=\mathrm{softmax}(\mathbf{w^T_cx})\\
&=\frac{\exp(\mathbf{w^T_cx})}{\sum_{i=1}^C \exp(\mathbf{w^T_ix})}
\end{aligned}
$$

其中$\mathbf{w_i}$是第i类的权重向量。

In [1]:
%matplotlib inline
from IPython import display
from torch.utils.data import TensorDataset, DataLoader
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
def softmax(X, W):
    """
    X: torch.FloatTensor, N*a, N样本数量, a为特征的维度
    W: torch.FloatTensor, a*C, C为类别数量
    """
    C = torch.exp(X@W)  # hat_y, N*C
    return C / torch.sum(C, axis=1).reshape(X.shape[0], -1)  # 各样本对应类别的标准化概率分布

In [3]:
X = torch.randn(10, 5)
W = torch.randn(5, 3)
softmax(X, W)

tensor([[1.8991e-01, 1.9777e-01, 6.1233e-01],
        [9.2654e-02, 8.4814e-01, 5.9211e-02],
        [8.9445e-02, 2.4870e-02, 8.8568e-01],
        [4.5190e-01, 4.3690e-01, 1.1120e-01],
        [1.8212e-02, 1.6665e-04, 9.8162e-01],
        [1.5361e-01, 8.3875e-01, 7.6353e-03],
        [3.4259e-01, 7.8311e-02, 5.7910e-01],
        [1.8198e-01, 8.0394e-01, 1.4078e-02],
        [1.5031e-01, 6.8826e-03, 8.4280e-01],
        [2.7072e-01, 1.2412e-01, 6.0516e-01]])

## 2. Softmax回归的决策函数

Softmax回归的决策函数可以表示为

$$
\begin{aligned}
\hat{y}&=\text{arg}\max_{i=1}^{C}p(y=c|\mathbf{x})\\
&=\text{arg}\max_{i=1}^{C}\mathbf{w_i^Tx}
\end{aligned}
$$

In [4]:
def hat_y(X, W):
    S = softmax(X, W)  # 各样本在各类别上的概率
    max_indices = torch.max(S, dim=1)[1]
    pred_y = torch.zeros_like(S)
    pred_y[torch.arange(S.shape[0]), max_indices] = 1
    return max_indices, pred_y

In [5]:
pred_y = hat_y(X, W)

In [6]:
pred_y[0]

tensor([2, 1, 2, 0, 2, 1, 2, 1, 2, 2])

- 与`Logistic`回归的关系。当类别数$C=2$时，softmax回归的决策函数为
$$
\begin{aligned}
\hat{y}&=\text{arg}\max_{i\in\{1,2\}}p(y=c|\mathbf{x})\\
&=\text{arg}\max_{i\in\{1,2\}}\mathbf{w_i^Tx}\\
&=I(\mathbf{(w_1-w_0)^Tx}>0)
\end{aligned}
$$
其中$I(\cdot)$是指示函数。

## 3. 准则

给定N个训练样本，Softmax回归使用交叉熵损失函数学习最有的参数矩阵$W$。为了方便起见，使用C维的`one-hot`向量表示类别标签，对于类别i，其向量表示为
$$
y = [I(i=1), I(i=2), ..., I(i=C)]
$$

采用交叉熵损失函数，Softmax回归模型的风险函数是
$$
\begin{aligned}
R(\mathbf{W})&=-\frac{1}{N}\sum_{n=1}^N\sum_{i=1}^{C}y_c^{(n)}\log \hat{y}_c^{(n)}\\
&=-\frac{1}{N}\sum_{n=1}^N(\mathbf{y^{(n)}})^T\log \hat{y}_c^{(n)}
\end{aligned}
$$
其中，$\hat{y}_c^{(n)}=\text{softmax}(\mathbf(W^Tx^{(n)}))$为样本$x^{(n)}$在每个类别的后验概率。

In [7]:
def cross_entropy(X, y, W):
    """
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    """
    p_y = softmax(X, W) # N*C, N个样本分别在C个类别的后验概率
    crossEnt = -torch.dot(y.reshape(-1), torch.log2(p_y).reshape(-1)) / y.shape[0]  # 展开成1维，点积
    return crossEnt

In [8]:
X = torch.randn(10, 5)
y = torch.zeros(10, 3)
y[torch.arange(10), torch.randint(low=0, high=y.shape[1] - 1, size=(10,))] = 1
W = torch.randn(5, 3)

In [9]:
cross_entropy(X, y, W)

tensor(2.0838)

In [10]:
prob_y = softmax(X, W)

In [11]:
-torch.dot(y.reshape(-1), torch.log2(prob_y).reshape(-1)) / y.shape[0]

tensor(2.0838)

- 风险函数$\mathbf{R(W)}$关于$W$的梯度为
$$
\frac{\partial R(W)}{\partial W}=-\frac{1}{N}\sum_{n=1}^N\mathbf{x^{(n)}(y^{(n)}-\hat{y}^{(n)})}^T
$$

In [13]:
def grad_crosEnt_W(X, y, W):
    '''
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    '''
    hat_y = softmax(X, W)
    a = (X.t() @ (y - hat_y)) / y.shape[0]  # (a+1)*N | N*C
    return a

In [14]:
grad_crosEnt_W(X, y, W)

tensor([[-1.2208e-01,  1.9918e-01, -7.7093e-02],
        [-1.8273e-01,  1.5876e-02,  1.6686e-01],
        [-3.1105e-01,  1.7246e-01,  1.3859e-01],
        [ 1.4861e-02,  2.1199e-01, -2.2685e-01],
        [-5.2817e-05,  2.6907e-02, -2.6854e-02]])

## 4. 学习方法

采用梯度下降法，softmax回归的训练过程为：

初始化$W_0:=0$，然后通过下式进行参数的迭代更新
$$
W_{t+1}:=W_t+\eta\left(\frac{1}{N}\sum_{n=1}^N\mathbf{x^{(n)}(y^{(n)}-\hat{y}^{(n)})}^T\right) 
$$

- 预测的正确率

In [465]:
def precision_rate(X, y, W, X_with_bias=False):
    if X_with_bias:
        hat_X = X
    else:
        hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
        
    pred_y = hat_y(hat_X, W)
    precision = torch.sum(pred_y[0] == torch.max(y, axis=1)[1]).numpy() / pred_y[0].numel()
    return precision

- 生成模拟数据

In [158]:
X = torch.randn(1000, 8)
hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广
true_W = torch.randn(hat_X.shape[1], 5)  # 增广
indices_y, y = hat_y(hat_X, true_W)

In [161]:
train_X, train_y = X[:800], y[:800]
train_indices_y = indices_y[:800]
hat_train_X = hat_X[:800]
test_X, test_y = X[800:], y[800:]
test_indices_y = indices_y[800:]
hat_test_X = hat_X[800:]

### 4.1 方法1: 梯度下降-人工求导

In [115]:
def softmax_sgd(X, y, num_steps=100, lr=0.1):
    '''
    X: N*(a+1), N个样本, 特征数量为为a, 外加1维偏置
    y: N*C, y为N个C维的one-hot向量
    W: (a+1)*C
    '''
    hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
    W = torch.randn(hat_X.shape[1], y.shape[1])  # 增广参数矩阵
    for i in range(num_steps):
        W += lr*grad_crosEnt_W(hat_X, y, W)
        loss = cross_entropy(hat_X, y, W)
        if (i+1) % 50 == 0:
            print(f'训练{i+1}轮, 交叉熵为{loss:.2f}')
            
    return W

In [466]:
# 模拟数据
est_W = softmax_sgd(train_X, train_y, num_steps=1000)
precision_rate(train_X, train_y, est_W, X_with_bias=False), precision_rate(test_X, test_y, est_W, X_with_bias=False)

训练50轮, 交叉熵为2.00
训练100轮, 交叉熵为1.09
训练150轮, 交叉熵为0.77
训练200轮, 交叉熵为0.64
训练250轮, 交叉熵为0.56
训练300轮, 交叉熵为0.52
训练350轮, 交叉熵为0.48
训练400轮, 交叉熵为0.46
训练450轮, 交叉熵为0.44
训练500轮, 交叉熵为0.42
训练550轮, 交叉熵为0.41
训练600轮, 交叉熵为0.39
训练650轮, 交叉熵为0.38
训练700轮, 交叉熵为0.37
训练750轮, 交叉熵为0.36
训练800轮, 交叉熵为0.36
训练850轮, 交叉熵为0.35
训练900轮, 交叉熵为0.34
训练950轮, 交叉熵为0.33
训练1000轮, 交叉熵为0.33


### 4.2 方法2: 随机梯度下降-自动求导

In [137]:
def softmax_miniBatch_sgd(X, y, num_epoch=50, batch_size=40, lr=0.05):
    '''
    X: N*a, N个样本, 特征数量为为a
    y: N*C, y为N个C维的one-hot向量
    W: a*C
    '''
    hat_X = torch.cat([X, torch.ones(X.shape[0], 1)], axis=1)  # 增广X
    W = torch.randn(hat_X.shape[1], y.shape[1])  # 增广参数矩阵
    W.requires_grad_()
    dataset = TensorDataset(hat_X, y)
    data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
    for epoch in range(num_epoch):
        for t_x, t_y in data_iter:
            l = cross_entropy(t_x, t_y, W)        
            l.backward()  # 计算损失函数在 W 上的梯度
            W.data.sub_(lr*W.grad/batch_size)
            W.grad.data.zero_()
            
        if (epoch + 1) % 50 == 0:
            with torch.no_grad():  # 不计算梯度，加速损失函数的运算
                train_l = cross_entropy(hat_X, y, W)  # 最近一次的负对数似然率
                est_W = W.detach().numpy()  # detach得到一个有着和原tensor相同数据的tensor
                print(f'epoch {epoch + 1}, loss: {train_l:.4f}')
            
    return est_W, train_l

In [488]:
# 模拟数据
est_W, train_l = softmax_miniBatch_sgd(train_X, train_y, num_epoch=1000, batch_size=40, lr=0.1)
precision_rate(train_X, train_y, est_W), precision_rate(test_X, test_y, est_W)

epoch 50, loss: 2.3773
epoch 100, loss: 1.2608
epoch 150, loss: 0.8993
epoch 200, loss: 0.7505
epoch 250, loss: 0.6676
epoch 300, loss: 0.6123
epoch 350, loss: 0.5716
epoch 400, loss: 0.5399
epoch 450, loss: 0.5142
epoch 500, loss: 0.4928
epoch 550, loss: 0.4746
epoch 600, loss: 0.4589
epoch 650, loss: 0.4450
epoch 700, loss: 0.4327
epoch 750, loss: 0.4217
epoch 800, loss: 0.4117
epoch 850, loss: 0.4026
epoch 900, loss: 0.3942
epoch 950, loss: 0.3864
epoch 1000, loss: 0.3792


(0.945, 0.96)

### 4.3 方法3: torch.nn

- 定义类

In [195]:
class SofmaxRegresModel(torch.nn.Module): 
    def __init__(self, dim_in, dim_out):
        # 首先找到LinearModel的父类torch.nn.Module，然后把类LinearModel的对象转换为类torch.nn.Module的对象, 
        # 即执行父类torch.nn.Module的初始化__init__()
        super(SofmaxRegresModel, self).__init__() 
        self.layer1 = torch.nn.Linear(dim_in, dim_out, bias=True)
        
    def forward(self, x):
        y_pred = self.layer1(x)
        return torch.nn.functional.softmax(y_pred, dim=1)  # softmax

- 定义训练算法

In [493]:
dim_in = X.shape[1]
dim_out = y.shape[1]
# 实例化1个网络
net = SofmaxRegresModel(dim_in, dim_out)
# 初始化网络参数和偏置
net.layer1.weight.data = torch.randn(dim_out, dim_in)
net.layer1.bias.data = torch.Tensor(dim_out)
# 损失函数
loss = torch.nn.CrossEntropyLoss()
# 随机梯度下降算法
trainer = torch.optim.SGD(net.parameters(), lr=0.05)

In [494]:
# 加载数据
batch_size = 20
num_epochs = 100
dataset = TensorDataset(train_X, train_indices_y)
data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
# 开始训练
for epoch in range(num_epochs):
    for t_x, t_y in data_iter:
        l = loss(net(t_x), t_y)  # 计算当前批量的交叉熵损失
        trainer.zero_grad()  # 参数梯度清零
        l.backward()  # 反向传播，计算梯度
        trainer.step()  # 更新参数
    if (epoch+1) % 20 == 0:
        with torch.no_grad():  # 不计算梯度，加速损失函数的运算
            l_epoch = loss(net(train_X), train_indices_y) 
            print('epoch {}, loss {}'.format(epoch+1, l_epoch)) 

epoch 20, loss 1.368786334991455
epoch 40, loss 1.173109531402588
epoch 60, loss 1.1000187397003174
epoch 80, loss 1.0768675804138184
epoch 100, loss 1.0675488710403442


- 结果

In [497]:
w, b = net.parameters()
W = torch.cat([w.data, b.data.reshape(-1, 1)], axis=1)

pred_train_y = torch.max(net(train_X), axis=1)[1]
pred_test_y = torch.max(net(test_X), axis=1)[1]

print('train', torch.sum(pred_train_y == train_indices_y).numpy() / train_indices_y.numel())
print('test', torch.sum(pred_test_y == test_indices_y).numpy() / test_indices_y.numel())

train 0.90125
test 0.9


## 5. 案例

鸢尾花数据集

In [290]:
from sklearn import datasets
d = datasets.load_iris()

In [405]:
x_labels, y_labels = d['feature_names'], d['target_names']

In [489]:
x_labels, y_labels

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

In [476]:
rand_idx = np.arange(len(d['target']))
np.random.shuffle(rand_idx)
t_idx = rand_idx[:100]
v_idx = rand_idx[100:]
x_train, y_train = torch.from_numpy(d['data'][t_idx]).type(torch.FloatTensor), torch.from_numpy(d['target'][t_idx])
x_valid, y_valid = torch.from_numpy(d['data'][v_idx]).type(torch.FloatTensor), torch.from_numpy(d['target'][v_idx])
onehot_y_train = torch.zeros(x_train.shape[0], 3)
onehot_y_train[torch.arange(x_train.shape[0]), y_train] = 1
onehot_y_valid = torch.zeros(x_valid.shape[0], 3)
onehot_y_valid[torch.arange(x_valid.shape[0]), y_valid] = 1

### 5.1 方法1

In [490]:
est_W = softmax_sgd(x_train, onehot_y_train, num_steps=1000)
print(f"Train accuracy rate: {precision_rate(x_train, onehot_y_train, est_W)}")
print(f"Valid accuracy rate: {precision_rate(x_valid, onehot_y_valid, est_W)}")

训练50轮, 交叉熵为0.93
训练100轮, 交叉熵为0.69
训练150轮, 交叉熵为0.49
训练200轮, 交叉熵为0.37
训练250轮, 交叉熵为0.34
训练300轮, 交叉熵为0.31
训练350轮, 交叉熵为0.29
训练400轮, 交叉熵为0.27
训练450轮, 交叉熵为0.26
训练500轮, 交叉熵为0.24
训练550轮, 交叉熵为0.23
训练600轮, 交叉熵为0.22
训练650轮, 交叉熵为0.22
训练700轮, 交叉熵为0.21
训练750轮, 交叉熵为0.20
训练800轮, 交叉熵为0.19
训练850轮, 交叉熵为0.19
训练900轮, 交叉熵为0.18
训练950轮, 交叉熵为0.18
训练1000轮, 交叉熵为0.18
Train accuracy rate: 0.99
Valid accuracy rate: 0.94


### 5.2 方法2 

In [491]:
# 鸢尾花
est_W, _ = softmax_miniBatch_sgd(x_train, onehot_y_train, num_epoch=1000, batch_size=40, lr=0.1)
print(f"Train accuracy rate: {precision_rate(x_train, onehot_y_train, est_W)}")
print(f"Valid accuracy rate: {precision_rate(x_valid, onehot_y_valid, est_W)}")

epoch 50, loss: 1.7345
epoch 100, loss: 1.2518
epoch 150, loss: 1.0072
epoch 200, loss: 0.8750
epoch 250, loss: 0.7937
epoch 300, loss: 0.7384
epoch 350, loss: 0.6971
epoch 400, loss: 0.6650
epoch 450, loss: 0.6388
epoch 500, loss: 0.6158
epoch 550, loss: 0.5963
epoch 600, loss: 0.5790
epoch 650, loss: 0.5635
epoch 700, loss: 0.5495
epoch 750, loss: 0.5364
epoch 800, loss: 0.5245
epoch 850, loss: 0.5135
epoch 900, loss: 0.5030
epoch 950, loss: 0.4932
epoch 1000, loss: 0.4839
Train accuracy rate: 0.94
Valid accuracy rate: 0.9


### 5.3 方法3

In [517]:
# 实例化
dim_in = 4
dim_out = 3
net = SofmaxRegresModel(dim_in, dim_out)
# 初始化网络参数和偏置
net.layer1.weight.data = torch.randn(dim_out, dim_in)
net.layer1.bias.data = torch.randn(dim_out)
# 损失函数
loss = torch.nn.CrossEntropyLoss()
# 随机梯度下降算法
trainer = torch.optim.SGD(net.parameters(), lr=0.05)
# 加载数据
batch_size = 20
num_epochs = 2000
dataset = TensorDataset(x_train, y_train)
data_iter = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)

# 训练
for epoch in range(num_epochs):
    for t_x, t_y in data_iter:
        l = loss(net(t_x), t_y)  # 计算当前批量的交叉熵损失
        trainer.zero_grad()  # 参数梯度清零
        l.backward()  # 反向传播，计算梯度
        trainer.step()  # 更新参数
    if (epoch+1) % 50 == 0:
        with torch.no_grad():  # 不计算梯度，加速损失函数的运算
            l_epoch = loss(net(x_train), y_train) 
            print('epoch {}, loss {}'.format(epoch+1, l_epoch)) 

epoch 50, loss 0.8891471028327942
epoch 100, loss 0.8810678720474243
epoch 150, loss 0.8779969215393066
epoch 200, loss 0.8762195110321045
epoch 250, loss 0.8750027418136597
epoch 300, loss 0.8741034865379333
epoch 350, loss 0.8734056949615479
epoch 400, loss 0.8728427886962891
epoch 450, loss 0.8723844885826111
epoch 500, loss 0.8719981908798218
epoch 550, loss 0.8716733455657959
epoch 600, loss 0.8713958263397217
epoch 650, loss 0.871156632900238
epoch 700, loss 0.8709509372711182
epoch 750, loss 0.8707689046859741
epoch 800, loss 0.8706075549125671
epoch 850, loss 0.8704662919044495
epoch 900, loss 0.8703412413597107
epoch 950, loss 0.8702294230461121
epoch 1000, loss 0.870128870010376
epoch 1050, loss 0.8700387477874756
epoch 1100, loss 0.869956910610199
epoch 1150, loss 0.869883120059967
epoch 1200, loss 0.8698161244392395
epoch 1250, loss 0.8697571754455566
epoch 1300, loss 0.8696995377540588
epoch 1350, loss 0.8696485757827759
epoch 1400, loss 0.8696022033691406
epoch 1450, loss

In [518]:
w, b = net.parameters()
W = torch.cat([w.data, b.data.reshape(-1, 1)], axis=1)

In [519]:
pred_y_train = torch.max(net(x_train), axis=1)[1]
pred_y_valid = torch.max(net(x_valid), axis=1)[1]

In [520]:
pred_y_train

tensor([2, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0,
        2, 0, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 0, 2,
        2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0,
        0, 2, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0,
        0, 0, 0, 2])

In [521]:
torch.sum(pred_y_train == y_train).numpy() / y_train.numel()

0.66

In [522]:
torch.sum(pred_y_valid == y_valid).numpy() / y_valid.numel()

0.68

## 参考资料
1. 李航. 统计学习方法. 2017.
2. 邱锡鹏. 神经网络与机器学习. 2020.