In [1]:
import numpy as np
print('numpy version:', np.__version__)

import matplotlib as mpl
print('matplotlib version:', mpl.__version__)
import matplotlib.pyplot as plt


numpy version: 1.23.5
matplotlib version: 3.6.2


### 1. Neural Networks

In this problem we will analyze a simple neural network to understand its classification properties. Consider the neural network given in the figure below, with **ReLU activation functions (denoted by $f$) on all neurons, and a softmax activation function in the output layer**:  
<img src="images/homework3_nn.png" alt="network architecture" width="400" height="250">  
Given an input $x=[x_1, x_2]^T$, the hidden units in the network are activated in stages as described by the following equations:
$$
z_i = x_1W_{1i} + x_2W_{2i} + W_{0i}, \  f(z_i) = \max\{z_i, 0\}, \  i=1..4  \\
u_j = \sum_i{f(z_i)V_{ij}}+V_{0j}, \  f(u_j) = \max\{u_j, 0\}, \  j=1..2
$$
The final output of the network is obtained by applying the **softmax** function to the last hidden layer,  
$$
o_j = \frac{e^{f(u_j)}}{\sum_k{e^{f(u_k)}}}, \  j=1..2
$$ 			 	 
In this problem, we will consider the following setting of parameters:
$$
\begin{bmatrix}
W_{11} & W_{21} & W_{01} \\ W_{12} & W_{22} & W_{02} \\
W_{13} & W_{23} & W_{03} \\ W_{14} & W_{24} & W_{04}
\end{bmatrix} = 
\begin{bmatrix}
1 & 0 & -1 \\ 0 & 1 & -1 \\
-1 & 0 & -1 \\ 0 & -1 & -1
\end{bmatrix}
$$
  
$$
\begin{bmatrix}
V_{11} & V_{21} & V_{31} & V_{41} & V_{01} \\
V_{12} & V_{22} & V_{32} & V_{42} & V_{02}
\end{bmatrix} = 
\begin{bmatrix}
1 & 1 & 1 & 1 & 0 \\
-1 & -1 & -1 & -1 & 2
\end{bmatrix}
$$

In [2]:
W = np.array([1, 0, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1]).reshape(4, -1)
W

array([[ 1,  0, -1],
       [ 0,  1, -1],
       [-1,  0, -1],
       [ 0, -1, -1]])

In [3]:
V = np.array([1, 1, 1, 1, 0, -1, -1, -1, -1, 2]).reshape(2, -1)
V

array([[ 1,  1,  1,  1,  0],
       [-1, -1, -1, -1,  2]])

In [4]:
def relu(x):
    return np.max([np.zeros_like(x), x], axis=0)

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

In [5]:
print(relu(-3), relu(0), relu(2))

0 0 2


In [6]:
tmp = softmax(np.array([-2, 0, 4]))
print(tmp, tmp.sum())

[0.00242826 0.01794253 0.97962921] 1.0


In [7]:
x = np.array([3, 14])
np.concatenate([x, [1]]).reshape(-1, 1)

array([[ 3],
       [14],
       [ 1]])

In [8]:
W @ np.concatenate([x, [1]]).reshape(-1, 1)

array([[  2],
       [ 13],
       [ -4],
       [-15]])

In [9]:
hidden_output = relu(W @ np.concatenate([x, [1]]).reshape(-1, 1))
hidden_output

array([[ 2],
       [13],
       [ 0],
       [ 0]])

In [10]:
output_in = np.concatenate([hidden_output, np.ones((1, hidden_output.shape[1]))], axis=0)
output_in

array([[ 2.],
       [13.],
       [ 0.],
       [ 0.],
       [ 1.]])

In [11]:
V @ output_in

array([[ 15.],
       [-13.]])

In [12]:
relu(V @ output_in)

array([[15.],
       [ 0.]])

In [13]:
softmax(relu(V @ output_in))

array([[9.99999694e-01],
       [3.05902227e-07]])

In [14]:
softmax(np.array([0, 2]))

array([0.11920292, 0.88079708])

In [15]:
softmax(np.array([3, 0]))

array([0.95257413, 0.04742587])

In [16]:
np.log(999) / 3

2.3022515928828513

### 2. LSTM

The diagram below shows a single LSTM unit that consists of Input, Output, and Forget gates.  
\![LSTM unit scheme](images/homework3_lstm_scheme.png)  
<img src="images/homework3_lstm_scheme.png" alt="lstm unit scheme" width="300" height="250">  
The behavior of such a unit as a recurrent neural network is specified by a set of update equations. These equations define how the gates, “memory cell" $c_t$ and the “visible state" $h_t$ are updated in response to input $x_t$ and previous states $c_{t-1}$, $h_{t-1}$.  
For the LSTM unit,
$$
f_t = \text{sigmoid}(W^{f, h}h_{t-1} + W^{f, x}x_t + b_f) \\
i_t = \text{sigmoid}(W^{i, h}h_{t-1} + W^{i, x}x_t + b_i) \\
o_t = \text{sigmoid}(W^{o, h}h_{t-1} + W^{o, x}x_t + b_o) \\
c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W^{c,h}h_{t-1} + W^{c,x}x_t + b_c) \\
h_t = o_t \odot \tanh(c_t)
$$
where $\odot$ stands for element-wise multiplication. The adjustable parameters in this unit are matrices W as well as offset parameter vectors b. By changing these parameters, we change how the unit evolves as a function of inputs $x_t$  
To keep things simple, in this problem we assume that $x_t$, $c_t$ and $h_t$ are all scalars. Concretely, suppose that the parameters are given by  
$$
\begin{matrix}
W^{f,h}=0 & W^{f, x}=0 & b_f=-100 & W^{c,h}=-100 \\
W^{i,h}=0 & W^{i,x}=100 & b_i=100 & W^{c,x}=50 \\
W^{o,h}=0 & W^{o,x}=100 & b_o=0 & b_c=0
\end{matrix}
$$
We run this unit with initial conditions $h_{-1}=0$ and $c_{-1}=0$, and in response to the following input sequence: [0, 0, 1, 1, 1, 0] (For example, $x_0=0$, $x_1=0$, $x_2=1$ and so on).

In [17]:
# calculate the sequence h_0, h_1, ... h_5 (round h_i to closest integer on every step)
def calc_f_t(h, x):
    lin = 0*h + 0*x - 100
    return 0 # sigmoid(-100) -> 0

def calc_i_t(h, x):
    lin = 0*h + 100*x + 100
    if lin <= -1:
        return 0
    elif lin >= 1:
        return 1
    else:
        return 1 / (1 + np.exp(-lin))
    
def calc_o_t(h, x):
    lin = 0*h + 100*x
    if lin <= -1:
        return 0
    elif lin >= 1:
        return 1
    else:
        return 1 / (1 + np.exp(-lin))
    
def calc_c_t(c, f, i, h, x):
    lin = -100*h + 50*x
    if lin <= -1:
        lin = -1
    elif lin >= 1:
        lin = 1
    else:
        lin = np.tanh(lin)
    return f*c + i*lin

def calc_h_t(o, c):
    tmp = np.tanh(c)
    if tmp <= -1:
        tmp = -1
    elif tmp >= 1:
        tmp = 1
    return o * tmp

def take_step(h, c, x):
    f_t = calc_f_t(h, x)
    i_t = calc_i_t(h, x)
    o_t = calc_o_t(h, x)
    c_t = calc_c_t(c, f_t, i_t, h, x)
    h_t = calc_h_t(o_t, c_t)
    print('Parameter values:')
    print('f_t = ', f_t)
    print('i_t = ', i_t)
    print('o_t = ', o_t)
    print('c_t = ', c_t)
    print('h_t = ', h_t)
    return (h_t, c_t)

In [18]:
# calculating sequence for x = [0, 0, 1, 1, 1, 0]
take_step(0, 0, 0)

Parameter values:
f_t =  0
i_t =  1
o_t =  0.5
c_t =  0.0
h_t =  0.0


(0.0, 0.0)

In [19]:
take_step(0, 0, 0)

Parameter values:
f_t =  0
i_t =  1
o_t =  0.5
c_t =  0.0
h_t =  0.0


(0.0, 0.0)

In [20]:
take_step(0, 0, 1)

Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  1
h_t =  0.7615941559557649


(0.7615941559557649, 1)

In [21]:
take_step(1, 1, 1)

Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  -1
h_t =  -0.7615941559557649


(-0.7615941559557649, -1)

In [22]:
take_step(-1, -1, 1)

Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  1
h_t =  0.7615941559557649


(0.7615941559557649, 1)

In [23]:
take_step(1, 1, 0)

Parameter values:
f_t =  0
i_t =  1
o_t =  0.5
c_t =  -1
h_t =  -0.3807970779778824


(-0.3807970779778824, -1)

In [24]:
def take_several_steps(h_0, c_0, X):
    h = h_0
    c = c_0
    for i, x in enumerate(X):
        print('STEP #', i+1)
        h, c = take_step(h, c, x)
        if h >= -0.5 and h <= 0.5:
            h = 0
        else:
            h = np.round(h, 0)
        print('Corrected value of h = ', h)

In [25]:
X = [1, 1, 0, 1, 1]
take_several_steps(0, 0, X)

STEP # 1
Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  1
h_t =  0.7615941559557649
Corrected value of h =  1.0
STEP # 2
Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  -1
h_t =  -0.7615941559557649
Corrected value of h =  -1.0
STEP # 3
Parameter values:
f_t =  0
i_t =  1
o_t =  0.5
c_t =  1
h_t =  0.3807970779778824
Corrected value of h =  0
STEP # 4
Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  1
h_t =  0.7615941559557649
Corrected value of h =  1.0
STEP # 5
Parameter values:
f_t =  0
i_t =  1
o_t =  1
c_t =  -1
h_t =  -0.7615941559557649
Corrected value of h =  -1.0


### 3. Backpropagation

One of the key steps for training multi-layer neural networks is stochastic gradient descent. We will use the back-propagation algorithm to compute the gradient of the loss function with respect to the model parameters.  
Consider the L-layer neural network below:  
<img src="images/homework3_nn_backprop.png" alt="nn scheme" width="400" height="250">  
In the following problems, we will the following notation: $b_j^l$ is the bias of the j-th neuron in the l-th layer, $a_j^l$ is the activation of j-th neuron in the l-th layer, and $w_{jk}^l$ is the weight for the connection from the k-th neuron in the (l-1)-th layer to the j-th neuron in the l-th layer.  
If the activation function is $f$ and the loss function we are minimizing is C, then the equations describing the network are:  
$$
a_j^l = f(\sum_k{w_{jk}^l a_k^{l-1}} + b_j^l) \\
\text{Loss} = C(a^L)
$$
Note that notations without subscript denote the corresponding vector or matrix, so that $a^l$ is activation vector of the l-th layer, and $w^l$ is the weights matrix in l-th layer for $l=1,..,L$.

In [26]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [27]:
X = np.array([-2, 0, 1, 4])
print(sigmoid(X)*(1-sigmoid(X)))

[0.10499359 0.25       0.19661193 0.01766271]


In [28]:
sigmoid(-1.15)

0.24048908305088898

In [29]:
(sigmoid(-1.15) - 1)**2 / 2

0.28842841648243966

In [30]:
# dC / db
z_2 = -1.15
(sigmoid(z_2)-1)*sigmoid(z_2)*(1-sigmoid(z_2))

-0.13872777081136367

In [31]:
# dC / dw_2
(sigmoid(z_2)-1)*sigmoid(z_2)*(1-sigmoid(z_2))*0.03

-0.00416183312434091

In [32]:
# dC / dw_1
(sigmoid(z_2)-1)*sigmoid(z_2)*(1-sigmoid(z_2))*(-5)*3

2.0809165621704553

### 4. Word Embeddings