# Basics

We use lower case letters, such as a,b,c to denote the vectors/scalers. It should
be clear from the contexts a letter refers to vector or scaler. For matrices, we
use upper-case letters, such as A,B,C. In addition, we sometimes make use of
“tensors”, which are nothing but high-dimensional arrays. More specifically, we
call vectors 1-d arrays, matrices 2-d arrays, and tensors 3-d or 4-d arrays. We
also use upper-case letters to represent tensors.

我们用小写字母，比如a,b,c表示向量或标量，通过上下文可以清楚的了解到一个字母是向量还是标量。我们用大写字母表示矩阵，如A,B,C。另外，我们有时使用“张量”这一概念表示高维矩阵。更准确的说，我们称一维数组为矢量，二维数组为矩阵，3维或4维数组为矢量。我们同样使用大写字母表示张量。

Given a vector a, we call it N-dimensional, if it contains exactly N elements,
and we use $a_i$ to denote its i’th element, 1 ≤ i ≤ N. Similarly, for a matrix A,
we call it M ×N dimensional, if it contains M rows and N columns, and we
use Aij or Ai,jto denote its (i,j)’th element, 1 ≤ i ≤ M,1 ≤ j ≤ N. We use
similar notations for tensors.

一个矢量，如果包含N个元素，我们称为N维矢量。我们使用$a_i$表示第i个元素。同样对于一个$M \times N$矩阵A，我们使用$A_{ij}$表示其第$i,j$个元素。我们使用同样的方式标记张量。

We follow the conventions in literature when it comes to the sum/difference
of vectors/matrices. Similarly, we follow the conventions in the literature in
representing the product between scalers/vectors/matrices. We omit the details
here, and we refer the readers to Schott (2016).
In addition, we also need the following vector/matrix operations

我们在矩阵或向量的差、和、乘积方面遵守写作惯例

## Vectorization of matrices，矩阵的拉直

Let A be an M ×N matrix. We use vec(A) to denote the vector obtained by
stacking the columns of A. More specifically, let $a_i$ denote the i'th column of
A, then $vec(A)=[a_1^t,a_2^t,...,a_n^t]^t$.  

A表示一个$M \times N$矩阵，我们使用$vec(A)$表示将A按列拉平，即$a_i$表示A的第i列，$vec(A)=[a_1^t,a_2^t,...,a_n^t]^t$.  

## Inner product 内积

Let a,b be N-dimensional vector. Their inner product $a \cdot b= a^tb=\sum_{i=1}^{N} a_ib_i$
Similarly, let A,B be M ×N-dimensional matrices, then their inner product $A\cdot B= \sum_{j=1}^{M}\sum_{i=1}^{N} A_{ij}B_{ij}$，**要求A与B的维度相同，计算结果是标量**。  
We have the following identity:  
$A\cdot(BC)=(B^tA)\cdot C=(AC^t)\cdot B$  
内积：只要形式上能相乘，结果就是相等的。

## Hadamard product

Let A,B be M ×N-dimensional matrices. Their Hadamard product, $A\bigodot B$ is
defined as an M ×N-dimensional matrix which satisfies,  
$(A\bigodot B)_{ij}=A_{ij}B_{ij}$  
We have the following identity  
$A\cdot(B\bigodot C)=(A\bigodot B)\cdot C$

Hadamard product要求两个矩阵维度相同，乘积的结果为两个矩阵对位相乘。

# Vector/matrix calculus 矢量与矩阵的微积分

In this section, we derive the identities of vector/matrix calculus.

## Directives

Let Q be an $R^M \to  R$ function, and we use  $Q'(x)　or　\frac{\partial Q}{\partial x} $　to denote the following　M-dimensional vector,  
$Q'(x)_i=\frac{\partial Q}{\partial x_i}$

Q为一个将M维向量映射为标量的函数

Similarly, let Q be a $R^{M \times N} \to  R$ functional, we use $Q'(X)　or　\frac{\partial Q}{\partial X} $ to denote the
following M ×N-dimensional matrix,  
$Q'(X)_{ij}=\frac{\partial Q}{\partial X_{ij}}$

Q为一个将M,N维矩阵映射为标量的函数，注意此时的X为upper case letter，表示矩阵。导数为$Q$对每个$x_i$求导

Now let Q be a function from $R^{M} \to  R^N $, s.t.  
$Q(x)=\begin{bmatrix}
Q_1(x)\\  
Q_2(x)\\  
\cdot\\  
\cdot\\  
\cdot\\  
Q_N(x)
\end{bmatrix}$

Then its Jacobian $\nabla Q(x)$ is defined as an N × M-dimensional matrix that
satisfies,  
$\nabla Q(x)_{ij}=\frac{Q_i(x)}{\partial x_j}$

1.Q(x)的形式是有限的，是一个N维向量，每个元素是一个函数。  
2.$\nabla Q(x)_{ij}=\frac{Q_i(x)}{\partial x_j}$，即以$Q_i(x)$对每个$x_j$求导为矩阵的行，结果为一个$N \times M$的矩阵。

插曲：  
Usually, the acronym $s.t.$ means such that. In the context of optimization, it means subject to. Also note that such that does not have the same meaning as so that.  
Such that, describes how something should be done.  
So that, describes why something should be done.   
For clarity, it's usually best to avoid $s.t.$ and simply write such that.  
s.t.是subject to （such that）的缩写，受约束的意思。按中文习惯可以翻译成：使得...满足...

For functions that maps matrices/tensors to matrices/tensors, we define the derivatives as the corresponding vectorized version. To illustrate, let Q be a function that maps $R^{M1 \times N1}$ to $R^{M2 \times N2}$, then its derivatives are defined as <font size="5"> $\frac{\partial vec(Q)}{\partial vec^t(X)}$</font>  即“先把$Q_i$拉平，再把$X$拉平，之后挨个求导。个人理解：$Q$为$R^{M2 \times N2}$,$X$为$R^{M1 \times N1}$维，计算结果为$(M2\times N2)  \times (M1 \times N1)$维的矩阵”  
We have the following formula:  
<font size="5">$\frac{\partial vec(Q(F(X)))}{\partial vec^t(X)}=\frac{\partial vec(Q(F(X))}{\partial vec^t (F(X))}\frac{\partial vec(F(X))}{\partial vec^t (X)}$</font>  
  
即“按照链式求导法则，一层一层的拉平，求导，这个具体的维度，就很麻烦了”  
个人理解：比如：$Q:M1\times N1,F:M2 \times N2,X:M3 \times N3$，则$\frac{\partial vec(Q(F(X))}{\partial vec^t (F(X))}$为$(M1\times N1)\times (M2\times N2)$,$\frac{\partial vec(F(X))}{\partial vec^t (X)}$为$(M2\times N2 \times (M3\times N3)$，最终结果为$(M1\times N1)\times (M3\times N3)$

还有：
<font size="5">$\frac{\partial Mx}{\partial x^t}=M$</font>

## Differentials

Let Q be a function that maps $R^M \mapsto R^N$, and assume its derivative exists. It can be proved that there exists a matrix Ax s.t.,  
$Q(x+dx)=Q(x)+A_xdx+o(x)$

Q是一个$R^M \mapsto R^N$的映射，假设其导数存在，可证明存在一个矩阵$A_x$，使得：  
$Q(x+dx)=Q(x)+A_xdx+o(x)$

In the display above, we use o(dx) to denote an N-dimensional vector that satisfies,  
<font size="5">$\lim_{dx \to 0}\frac{o(dx)}{ \lVert dx  \rVert}=0$</font>

We denote the differential of $Q,dQ(x;dx): R^M \mapsto R^N$, s.t. $dQ(x;dx)=A_xdx$  
We also write $dQx.dx$ in place of $dQ(x;dx)$.  
$Q$的导数为$dQ(x;dx),dQ(x;dx)=A_xdx$  

Similarly, let Q be a function that maps $R^{M1 \times N1}$ to $R^{M2 \times N2}$, and assume it is element-wise differentiable. Then there exists matrix $A_X$ s.t.  
$vec(Q(X+dX))=vec(Q(X)+A_xvec(dX)+o(dX)$  
同样，$Q$一个 $R^{M1 \times N1}$ to $R^{M2 \times N2}$映射，假设其所有元素可微，则存在一个矩阵$A_X$，使得：  
$vec(Q(X+dX))=vec(Q(X)+A_Xvec(dX)+o(dX)$  
要搞矩阵看来都要拉平了

We define the differential of $Q,dQ(X;dX)$ to be an $M2 \times N2$ matrix s.t.  
$vec(dQ(X;dX))=A_Xvec(dX)$  
We also use $dQ_X.dX$ to denote $dQ(X;dX)$.  
把$dQ(X;dX)$记为$Q$的导数，$vecdQ(X;dX)=A_Xvec(dX)$

The derivatives and differentials are related through the following implications:  
$dQ_X.dX=M \cdot dX$   
is equivalent to  
$\frac{\partial Q(X)}{\partial X}=M$

We have the following chain rules for differentials,  
$dQ_X.dX=dG_F.dF_X.dX$  

In view of this we write can omit $dX$ and the subscript to write  
$dQ=dG \cdot dF$  

Finally, we have the following identities,  
\begin{align}
d(\alpha X+Y)&=\alpha dX+dY\\
d(XY)&=(dX)Y+XdY\\
d(X\cdot Y)&=dX \cdot Y +X\cdot dY\\
d(X\bigodot Y)&=dX \bigodot Y +X \bigodot dY
\end{align}