## q1 - softmax

### (a) Prove that softmax is invariant to constant oﬀsets in the input. $ softmax(x) = softmax(x + c) $

note: In practice, we make use of this property and choose $c = − max_i x_i$ when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).

证明：
$$
softmax(x + c)_i = \frac{e^{x_i + c}}{\Sigma_j{e^{x_j + c}}} 
= \frac{e^c . e^{x_i}}{e^c . \Sigma_j{e^{x_j}}}
= \frac{e^{x_i}}{\Sigma_j{e^{x_j}}} = softmax(x)_i
$$

### broadcasting in numpy

q1.b 为编程题，见 `q1_softmax.py`。下面的例子用来熟悉 numpy 的 broadcasting。

numpy 的 array 运算，通常是 element-wise。broadcasting 的意义在于减少不必要的数据拷贝，提高运算效率。

理解 broadcasting 规则最好的方法，就是把参与运算的两个 array 的 shape 右对齐列出来，等于1的维度就要被拉伸。比如：

```
A     :  4 x 1 x 5 x 1
B     :      3 x 1 x 2
Result:  4 x 3 x 5 x 2
```

In [1]:
import numpy as np

In [2]:
x = np.arange(4)
xx = x.reshape(4,1)
y = np.ones(5)
z = np.ones((3,4))

In [3]:
print(x.shape, y.shape)
x + y

((4,), (5,))


ValueError: operands could not be broadcast together with shapes (4,) (5,) 

In [4]:
print(xx.shape, x.shape)
xx + y

((4, 1), (4,))


array([[ 1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.,  3.],
       [ 4.,  4.,  4.,  4.,  4.]])

In [5]:
print(x.shape, z.shape)
x + z

((4,), (3, 4))


array([[ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.]])

#### Outer operation

Broadcasting provides a convenient way of taking outer operations (outer product, outer addtion, etc).

In [6]:
a = np.array([0., 10., 20., 30.])
b = np.array([1., 2., 3.])
a[:, np.newaxis] + b

array([[  1.,   2.,   3.],
       [ 11.,  12.,  13.],
       [ 21.,  22.,  23.],
       [ 31.,  32.,  33.]])

## q2 - neural network basics

### (a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value

$ \sigma(x) = \frac{1}{1 + e^{-x}} $

**求导：**

$ \sigma'(x) = - (1 + e^{-x})^{-2} (- e^{-x}) = \sigma(x) \frac{e^{-x}}{1 + e^{-x}} = \sigma(x) (1 - \sigma(x)) $


### (b) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation

$$ \hat{y} = softmax(\theta) $$
$$ CE(y, \hat{y}) 
= -\Sigma{y_i log(\hat{y}_i)} 
= -\Sigma{y_i log(softmax(\theta)_i)}
$$  

* y is one-hot vector
* 注意这里 $\theta$ 是 softmax input vector，而不是参数

**求解：**

记 $ softmax(x) = s(x) $  
与 sigmoid 类似，可求得：
$$ \frac{\partial{s(\theta_i)}}{\partial{\theta_i}} = s(\theta_i)\cdot(1 - s(\theta_i)) $$
$$ \frac{\partial{s(\theta_k)}}{\partial{\theta_i}} = -s(\theta_k)\cdot s(\theta_i),  k\neq i $$

设正确的 class 为 k (即 $y_k = 1$ )，则：
$$ CE(y, \hat{y}) 
= -log(softmax(\theta)_k)
$$

当 $k = i$ 时：
$$ \frac{\partial{CE(y, \hat{y})}}{\partial{\theta_i}}
= -\frac{\partial{s(\theta_i)}/\partial{\theta_i}}{s(\theta_i)}
= s(\theta_i) - 1 = \hat{y}_i - 1
$$
当 $k \neq i$ 时：
$$ \frac{\partial{CE(y, \hat{y})}}{\partial{\theta_i}}
= -\frac{\partial{s(\theta_k)}/\partial{\theta_i}}{s(\theta_k)}
= - s(\theta_i) = \hat{y}_i
$$
综上，
$$ \frac{\partial{CE(y, \hat{y})}}{\partial{\theta}}
= \hat{y} - y
$$



### (c) 单隐层神经网络交叉熵损失函数的梯度





### (d) 参数个数

$ (Dx + 1) * H + (H + 1) * Dy $

## q3 - word2vec

## q4 - sentiment analysis