<a href="https://colab.research.google.com/github/singwang-cn/Neural-Network/blob/master/aml2022_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Machine Learning (2022) Final Report Assignment

Answer Questions 1 to 4 (either in Japanese or English). Submit a report in either PDF (.pdf) or JupyterNotebook (.ipynb) format.

## Question 1 (50 points)

Consider a convolutional neural network (CNN) that predicts a label $\hat{y} \in \{0, 1\}$ for a given sentence $\boldsymbol{X} \in \mathbb{R}^{d \times T}$. Here, a sentence is represented by a matrix $\boldsymbol{X} = (\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T)$ consisting of a concatenation of $T$ word embeddings, $\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T \in \mathbb{R}^d$, where $d$ is the size of word embeddings, and $T$ is the number of words in the sentence.

These equations define the whole architecture of the CNN.

\begin{align}
\hat{y} &= \begin{cases}
1 & (0.5 < p) \\
0 & (p \leq 0.5)
\end{cases} \\
p &= \sigma(\boldsymbol{v}^\top \boldsymbol{s}) \\
\boldsymbol{s} &= \max(\boldsymbol{c}_1, \dots, \boldsymbol{c}_{T-\delta+1}) \\
\boldsymbol{c}_t &= {\rm ReLU}(\boldsymbol{W} \boldsymbol{x}_{t:t+\delta-1} + \boldsymbol{b}) & (\forall t \in \{1, \dots, T-\delta+1\}) \\
\boldsymbol{x}_{t:t+\delta-1} &= \boldsymbol{x}_{t} \oplus \boldsymbol{x}_{t+1} \oplus \dots \oplus \boldsymbol{x}_{t+\delta-1}
\end{align}

Here:

+ $\boldsymbol{W} \in \mathbb{R}^{m \times \delta d}$, $\boldsymbol{b} \in \mathbb{R}^m, \boldsymbol{v} \in \mathbb{R}^m$ are the model parameters;
+ $m$ denotes the number of output channels of the CNN;
+ $\delta$ denotes the width (kernel size) of the convolution;
+ $\sigma(\cdot)$ denotes the standard sigmoid function;
+ $\max(\cdot)$ presents the max pooling operation;
+ ${\rm ReLU}(\cdot)$ denotes the ReLU activation function;
+ $\oplus$ presents a concatenation of vectors.

Setting the hyperparameters $d=3, m=2, \delta=2$, we initialize the model parameters as follows.

\begin{align}
\boldsymbol{W} &= \begin{pmatrix}
-3 & -2 & -1 & -1 & -2 & -3 \\
3 & 2 & 3 & 2 & 3 & 2
\end{pmatrix} \\
\boldsymbol{b} &= \begin{pmatrix}
-0.2 \\ 0.1
\end{pmatrix} \\
\boldsymbol{v} &= \begin{pmatrix}
-1 \\ 2
\end{pmatrix}
\end{align}

Suppose that we give a negative ($y=0$) training instance with the sentence ($T = 5$),

\begin{align}
\boldsymbol{X} &= \begin{pmatrix}
-0.3 & 0 & 0.1 & 0 & 0 \\
-0.2 & -0.1 & 0 & 0.1 & 0 \\
-0.1 & -0.2 & 0.1 & 0 & 0.1
\end{pmatrix} ,
\end{align}
to the CNN model, and answer the following questions.

**(1)** Find the value of the vector $\boldsymbol{x}_{3:4}$.

In [238]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [239]:
d = 3
m = 2
delta = 2 

W = np.array([[-3, 3],
              [-2, 2],
              [-1, 3],
              [-1, 2],
              [-2, 3],
              [-3, 2]])
b = np.array([-0.2, 0.1])
v = np.array([-1., 2.])
X = np.array([[-0.3, -0.2, -0.1],
              [0, -0.1, -0.2],
              [0.1, 0, 0.1],
              [0, 0.1, 0],
              [0, 0, 0.1]])

In [240]:
#Answer (1)
x_34 = np.concatenate(X[2:4], axis=0)
print('x_(3:4) = ', x_34)

x_(3:4) =  [0.1 0.  0.1 0.  0.1 0. ]


**(2)** Find the values of the hidden vectors $\boldsymbol{c}_1, \boldsymbol{c}_2, \boldsymbol{c}_3, \boldsymbol{c}_4$.

In [241]:
C = F.relu(torch.Tensor(np.concatenate((X[:-(delta-1)], X[(delta-1):]), axis=1).dot(W)+b))

In [242]:
#Answer (2)
for i in range(C.shape[0]):
  print('c[' + str(i+1) + '] =', C[i].numpy())

c[1] = [2. 0.]
c[2] = [0. 0.]
c[3] = [0. 1.]
c[4] = [0.  0.5]


**(3)** Find the value of the vector $\boldsymbol{s}$.


In [243]:
#Answer (3)
s = C.numpy().reshape((int((C.shape[0]*C.shape[1] / (delta*delta))) ,delta*delta)).max(axis=1)
print('s =', s)

s = [2. 1.]


**(4)** Find the value of $p$.

In [244]:
def sigmoid_func(x):
  return 1 / (1 + np.exp(-x))

In [245]:
#Answer (4)
p = sigmoid_func(v.dot(s))
print('p =', p)

p = 0.5


**(5)** Write the formula of the binary cross-entropy loss between the correct label $y$ and the probability estimate $p$.

Answer (5)

\begin{equation*}
Loss = -\frac{1}{N}\sum_{i=1}^N y_i\log p_i+(1-y_i)\log (1-p_i)
\end{equation*}
where $N$ is the number of data, $y_i$ is the label and $p$ is the predicted value.

In [246]:
def bceloss_func(y, p):
  n = 1
  if type(y) != 'int':
    n = y.shape[0]
  loss_array = y*np.log(p) + (1-y)*np.log(1-p)
  return -loss_array.sum(axis=0) / n

**(6)** Compute the loss value by using the formula of (5) for the training instance.

In [253]:
#Answer (6)
loss = bceloss_func(np.array([0]), p)
print('loss =',loss)

loss = 0.6931471805599453


**(7)** Compute the gradient of the loss function with respect to $\boldsymbol{v}$ for the training instance.

In [248]:
class build_model(nn.Module):
  def __init__(self, W, b, v, kernel_size=2, stride=2) -> None:
    super().__init__()
    
    self.linear1 = nn.Linear(6, 2, bias=True)
    self.linear1.weight.data = torch.Tensor(W.T)
    self.linear1.bias.data = torch.Tensor(b)
    
    self.linear2 = nn.Linear(2, 1, bias=False)
    self.linear2.weight.data = torch.Tensor(v.T)
    self.flatten = nn.Flatten(start_dim=1, end_dim=-1)
    self.relu = nn.ReLU()
    self.pool = nn.MaxPool2d(kernel_size=(kernel_size, kernel_size), stride=stride)
    self.sigmoid = nn.Sigmoid()

  def forward(self, X):
    Y = self.linear1(X)
    Y = self.pool(self.relu(Y))
    Y = self.linear2(self.flatten(Y))
    Y = self.sigmoid(Y)
    return Y

In [249]:
X_T = torch.Tensor([np.concatenate((X[:-(delta-1)], X[(delta-1):]), axis=1)])
model = build_model(W, b, v, 2, 2)
Y_T = model(X_T)

In [250]:
Y = torch.Tensor([0])
cost_func = nn.BCELoss()
loss = cost_func(Y_T, Y)
loss.backward()

In [251]:
#Answer (7)
v_grad = model.linear2.weight.grad.numpy()
print('the gradient of the loss function with respect to v:\n', v_grad)

the gradient of the loss function with respect to v:
 [1.  0.5]


**(8)** Compute the gradients of the loss function with respect to $\boldsymbol{W}$ for the training instance.

In [252]:
#Answer (8)
w_grad = model.linear1.weight.grad.numpy()
print('the gradient of the loss function with respect to w:\n', w_grad)

the gradient of the loss function with respect to w:
 [[0.15 0.1  0.05 0.   0.05 0.1 ]
 [0.1  0.   0.1  0.   0.1  0.  ]]


## Question 2 (20 points)

Give names of two datasets that can be used to evaluate the quality of word embeddings, and explain the datasets with the following perspectives.

+ Brief explanation of the task for the evaluation.
+ Statistics of the dataset (e.g., the number of instances)
+ Measure(s) for evaluating the quality

## Question 3 (20 points)

Explain two reasons why Transformers are superior to Recurrent Neural Network
(RNN) in sequence-to-sequence tasks such as Machine Translation.

## Question 4 (10 points)

Implement the code for using a pre-trained **language** model. Show the code and its output as well as the following information:

+ The detail of the pre-trained language model, for example,
    + https://huggingface.co/EleutherAI/gpt-j-6B
    + https://huggingface.co/rinna/japanese-gpt-1b
    + https://huggingface.co/facebook/blenderbot-400M-distill
+ The task addressed by the model (e.g., "text generation", "summarization", "chatbot")
