A linear layer computes $y_{m \times 1} = W_{m \times n} x_{n \times 1} + b_{m \times 1}$, where

$$
W = \begin{bmatrix}
W_{1,1} & W_{1, 2} & ... & W_{1, n} \\
W_{2,1} & W_{2, 2} & ... & W_{2, n} \\
\vdots \\
W_{m,1} & W_{m, 2} & ... & W_{m, n}
\end{bmatrix}
$$


## Gradients w.r.t. $W$
Because $y_i = W_{i,1} x_1 + W_{i,2} x_2 + ... + W_{i,n} x_n + b_i$, we have
$\partial y_i / \partial W_{i,j} = x_j$ and
$\partial y_k / \partial W_{i,j} = 0$ for $k \not= i$.

When we "flatten" the $W$ into a "long vector" $w$, the Jacobian w.r.t. $w$ then becomes:
$$
J_w = \begin{bmatrix}
x_1 & x_2 & ... & x_n &   0 &   0 & ... &   0 &  0  & 0   & 0   \\
0   & 0   & ... & 0   & x_1 & x_2 & ... &   0 &  0  & 0   & 0   \\
\vdots \\
0   & 0   & ... & 0   & 0   & 0   & ... & x_1 & x_2 & ... & x_n
\end{bmatrix}_{m \times (mn)}
$$


If we chain the gradient product (assuming the final loss is scaler $\ell$):
$$
\nabla^T_w \ell = \nabla^T_y \ell \cdot J_w =
\begin{bmatrix}
x_1 \frac{\partial \ell}{\partial y_1} & x_2 \frac{\partial \ell}{\partial y_1} & ... & x_n \frac{\partial \ell}{\partial y_1} &
x_1 \frac{\partial \ell}{\partial y_2} & x_2 \frac{\partial \ell}{\partial y_2} & ... & x_n \frac{\partial \ell}{\partial y_2} &
...
\end{bmatrix}_{1 \times (mn)}
$$


As it is a recycling patten, we can "unroll" the Jacobian to a matrix so that it matches the dimension of $W$:
$$
\nabla_W \ell =
\begin{bmatrix}
x_1 \frac{\partial \ell}{\partial y_1} & x_2 \frac{\partial \ell}{\partial y_1} & ... & x_n \frac{\partial \ell}{\partial y_1} \\
x_1 \frac{\partial \ell}{\partial y_2} & x_2 \frac{\partial \ell}{\partial y_2} & ... & x_n \frac{\partial \ell}{\partial y_2} \\
\vdots
\end{bmatrix}_{m \times n}
= (\nabla_y \ell)_{m \times 1} \cdot (x^T)_{1 \times n}
$$

## Gradients w.r.t. $b$
Because $y_i = W_{i,1} x_1 + W_{i,2} x_2 + ... + W_{i,n} x_n + b_i$, the Jacobian w.r.t. $b$ is an identity matrix $E$:
$$
\tag{2}
\nabla^T_b \ell = \nabla^T_y \ell \cdot J_b =
\nabla^T_y \ell \cdot E = \nabla^T_y \ell
$$


## Gradients w.r.t. $x$
The Jacobian w.r.t. $W$ is, according to $y_i = W_{i,1} x_1 + W_{i,2} x_2 + ... + W_{i,n} x_n + b_i$,

$$
J_x = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & ... & \frac{\partial y_1}{\partial x_n} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & ... & \frac{\partial y_2}{\partial x_n} \\
\vdots & \ddots \\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & ... & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
=
\begin{bmatrix}
W_{1,1} & W_{1, 2} & ... & W_{1, n} \\
W_{2,1} & W_{2, 2} & ... & W_{2, n} \\
\vdots \\
W_{m,1} & W_{m, 2} & ... & W_{m, n}
\end{bmatrix}
= W
$$
as a result,
$$
\tag{3}
\nabla^T_x \ell = \nabla^T_y \ell \cdot W
$$