# 4. Softmax Regression

In [1]:
import time
import math
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.utils import data

## Multiple Ouputs Classification

For example, let's consider the data has **four features** and **three classes**. Then, we need to have three **affine functions**:

$$
\begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
\end{aligned}
$$

The neural network representation of this is:

![](http://d2l.ai/_images/softmaxreg.svg)

For a **fully connected layer** with $d$ inputs and $q$ outputs, the number of parameters require is $\mathcal{O}(dq)$. Practically, we can reduce this down to $\mathcal{O}(\frac{dq}{n})$.

## Softmax Calibration

We aim to have the output $\hat{y}_j$ as the **probability** of a data point belonging to class $j$.Then, we can obtain the **final output** as $\operatorname*{argmax}_j y_j$.

Thus, we can apply a **`softmax`** function over the outputs to make sure the they are all **positive** and **sum up to 1**.

$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$$

The softmax function does not change the **rank** of the original outputs from linear regression:

$$\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j$$

Thus, even though softmax is a **non-linear** function, softmax regression is still a **linear** model.

## Mini-Batch Vectorization

When implementing the softmax regression, we normally read a **mini-batch** of the dataset $\mathbf{X}$ with **dimension** $d$ and **batch size** $n$. 

If the dataset has $q$ **distinct classes**, we have: 

$$\mathbf{X} \in \mathbb{R}^{n \times d}$$
$$\mathbf{W} \in \mathbb{R}^{d \times q}$$
$$\mathbf{b} \in \mathbb{R}^{1\times q}$$

Then, the softmax regression is given as:

$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned}$$

where $\mathbf{O}$ and $\hat{\mathbf{Y}}$ are both $\in n \times q$.

## Cost Function

To find the **optimal parameters**, we aims to maximize the **likelihood**:

$$P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$$

This is equivalent to minimizing the following:

$$-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})$$

where the **`loss function`** is the **cross-entropy**:

$$
\begin{aligned}
l(\mathbf{y}, \hat{\mathbf{y}}) &= - \sum_{j=1}^q y_j \log \hat{y}_j \\
&=  - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\
&= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j\\
&= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j.
\end{aligned}
$$

Thus, the **derivative** of the loss function in respect to the **linear output** $o_j$ is:

$$
\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j
$$