<a href="https://colab.research.google.com/github/singwang-cn/Neural-Network/blob/master/aml2022_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Machine Learning (2022) Final Report Assignment

Answer Questions 1 to 4 (either in Japanese or English). Submit a report in either PDF (.pdf) or JupyterNotebook (.ipynb) format.

## Question 1 (50 points)

Consider a convolutional neural network (CNN) that predicts a label $\hat{y} \in \{0, 1\}$ for a given sentence $\boldsymbol{X} \in \mathbb{R}^{d \times T}$. Here, a sentence is represented by a matrix $\boldsymbol{X} = (\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T)$ consisting of a concatenation of $T$ word embeddings, $\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T \in \mathbb{R}^d$, where $d$ is the size of word embeddings, and $T$ is the number of words in the sentence.

These equations define the whole architecture of the CNN.

\begin{align}
\hat{y} &= \begin{cases}
1 & (0.5 < p) \\
0 & (p \leq 0.5)
\end{cases} \\
p &= \sigma(\boldsymbol{v}^\top \boldsymbol{s}) \\
\boldsymbol{s} &= \max(\boldsymbol{c}_1, \dots, \boldsymbol{c}_{T-\delta+1}) \\
\boldsymbol{c}_t &= {\rm ReLU}(\boldsymbol{W} \boldsymbol{x}_{t:t+\delta-1} + \boldsymbol{b}) & (\forall t \in \{1, \dots, T-\delta+1\}) \\
\boldsymbol{x}_{t:t+\delta-1} &= \boldsymbol{x}_{t} \oplus \boldsymbol{x}_{t+1} \oplus \dots \oplus \boldsymbol{x}_{t+\delta-1}
\end{align}

Here:

+ $\boldsymbol{W} \in \mathbb{R}^{m \times \delta d}$, $\boldsymbol{b} \in \mathbb{R}^m, \boldsymbol{v} \in \mathbb{R}^m$ are the model parameters;
+ $m$ denotes the number of output channels of the CNN;
+ $\delta$ denotes the width (kernel size) of the convolution;
+ $\sigma(\cdot)$ denotes the standard sigmoid function;
+ $\max(\cdot)$ presents the max pooling operation;
+ ${\rm ReLU}(\cdot)$ denotes the ReLU activation function;
+ $\oplus$ presents a concatenation of vectors.

Setting the hyperparameters $d=3, m=2, \delta=2$, we initialize the model parameters as follows.

\begin{align}
\boldsymbol{W} &= \begin{pmatrix}
-3 & -2 & -1 & -1 & -2 & -3 \\
3 & 2 & 3 & 2 & 3 & 2
\end{pmatrix} \\
\boldsymbol{b} &= \begin{pmatrix}
-0.2 \\ 0.1
\end{pmatrix} \\
\boldsymbol{v} &= \begin{pmatrix}
-1 \\ 2
\end{pmatrix}
\end{align}

Suppose that we give a negative ($y=0$) training instance with the sentence ($T = 5$),

\begin{align}
\boldsymbol{X} &= \begin{pmatrix}
-0.3 & 0 & 0.1 & 0 & 0 \\
-0.2 & -0.1 & 0 & 0.1 & 0 \\
-0.1 & -0.2 & 0.1 & 0 & 0.1
\end{pmatrix} ,
\end{align}
to the CNN model, and answer the following questions.

**(1)** Find the value of the vector $\boldsymbol{x}_{3:4}$.

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
d = 3
m = 2
delta = 2 

W = np.array([[-3, 3],
              [-2, 2],
              [-1, 3],
              [-1, 2],
              [-2, 3],
              [-3, 2]])
b = np.array([-0.2, 0.1])
v = np.array([-1., 2.])
X = np.array([[-0.3, -0.2, -0.1],
              [0, -0.1, -0.2],
              [0.1, 0, 0.1],
              [0, 0.1, 0],
              [0, 0, 0.1]])

In [3]:
#Answer (1)
x_34 = np.concatenate(X[2:4], axis=0)
print('x_(3:4) = ', x_34)

x_(3:4) =  [0.1 0.  0.1 0.  0.1 0. ]


**(2)** Find the values of the hidden vectors $\boldsymbol{c}_1, \boldsymbol{c}_2, \boldsymbol{c}_3, \boldsymbol{c}_4$.

In [4]:
C = F.relu(torch.Tensor(np.concatenate((X[:-(delta-1)], X[(delta-1):]), axis=1).dot(W)+b))

In [5]:
#Answer (2)
for i in range(C.shape[0]):
  print('c[' + str(i+1) + '] =', C[i].numpy())

c[1] = [2. 0.]
c[2] = [0. 0.]
c[3] = [0. 1.]
c[4] = [0.  0.5]


**(3)** Find the value of the vector $\boldsymbol{s}$.


In [6]:
#Answer (3)
s = C.numpy().reshape((int((C.shape[0]*C.shape[1] / (delta*delta))) ,delta*delta)).max(axis=1)
print('s =', s)

s = [2. 1.]


**(4)** Find the value of $p$.

In [7]:
def sigmoid_func(x):
  return 1 / (1 + np.exp(-x))

In [8]:
#Answer (4)
p = sigmoid_func(v.dot(s))
print('p =', p)

p = 0.5


**(5)** Write the formula of the binary cross-entropy loss between the correct label $y$ and the probability estimate $p$.

Answer (5)

\begin{equation*}
Loss = -\frac{1}{N}\sum_{i=1}^N y_i\log p_i+(1-y_i)\log (1-p_i)
\end{equation*}
where $N$ is the number of data, $y_i$ is the label and $p$ is the predicted value.

In [9]:
def bceloss_func(y, p):
  n = 1
  if type(y) != 'int':
    n = y.shape[0]
  loss_array = y*np.log(p) + (1-y)*np.log(1-p)
  return -loss_array.sum(axis=0) / n

**(6)** Compute the loss value by using the formula of (5) for the training instance.

In [10]:
#Answer (6)
loss = bceloss_func(np.array([0]), p)
print('loss =',loss)

loss = 0.6931471805599453


**(7)** Compute the gradient of the loss function with respect to $\boldsymbol{v}$ for the training instance.

In [11]:
class build_model(nn.Module):
  def __init__(self, W, b, v, kernel_size=2, stride=2) -> None:
    super().__init__()
    
    self.linear1 = nn.Linear(6, 2, bias=True)
    self.linear1.weight.data = torch.Tensor(W.T)
    self.linear1.bias.data = torch.Tensor(b)
    
    self.linear2 = nn.Linear(2, 1, bias=False)
    self.linear2.weight.data = torch.Tensor(v.T)
    self.flatten = nn.Flatten(start_dim=1, end_dim=-1)
    self.relu = nn.ReLU()
    self.pool = nn.MaxPool2d(kernel_size=(kernel_size, kernel_size), stride=stride)
    self.sigmoid = nn.Sigmoid()

  def forward(self, X):
    Y = self.linear1(X)
    Y = self.pool(self.relu(Y))
    Y = self.linear2(self.flatten(Y))
    Y = self.sigmoid(Y)
    return Y

In [18]:
X_T = torch.Tensor([np.concatenate((X[:-(delta-1)], X[(delta-1):]), axis=1)])
model = build_model(W, b, v, 2, 2)
Y_T = model(X_T)

In [19]:
Y = torch.Tensor([0])
cost_func = nn.BCELoss()
loss = cost_func(Y_T, Y)
loss.backward()

In [20]:
#Answer (7)
v_grad = model.linear2.weight.grad.numpy()
print('the gradient of the loss function with respect to v:\n', v_grad)

the gradient of the loss function with respect to v:
 [1.  0.5]


**(8)** Compute the gradients of the loss function with respect to $\boldsymbol{W}$ for the training instance.

In [21]:
#Answer (8)
w_grad = model.linear1.weight.grad.numpy()
print('the gradient of the loss function with respect to w:\n', w_grad)

the gradient of the loss function with respect to w:
 [[0.15 0.1  0.05 0.   0.05 0.1 ]
 [0.1  0.   0.1  0.   0.1  0.  ]]


## Question 2 (20 points)

Give names of two datasets that can be used to evaluate the quality of word embeddings, and explain the datasets with the following perspectives.

+ Brief explanation of the task for the evaluation.
+ Statistics of the dataset (e.g., the number of instances)
+ Measure(s) for evaluating the quality

## WordSim353
+ WordSim353 is a test collection for measuring word similarity or relatedness, containing a split of the test set into two subsets, one for evaluating similarity, and the other for evaluating relatedness.
+ It contains two sets of English word pairs along with human-assigned similarity judgements.
The first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects.
The second set (set2) contains 200 word pairs with similarity assessed by 16 subjects. Each word vector has 353 features.
+ Replace each non-English word in the dataset with its 5-best translations into English using state-of-the-art machine translation technology. The vector corresponding to each Spanish word is calculated by collecting features from all the contexts of any of its translations. Once the vectors are generated, the similarities are calculated in the same way as before. Cited from https://aclanthology.org/N09-1003.pdf

## Google dataset
https://arxiv.org/pdf/1301.3781.pdf
+  a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions.
+ Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each category were created in two steps: first, a list of similar word pairs was created manually. Then, a large list of questions is formed by connecting two word pairs.
+ We evaluate the overall accuracy for all question types, and for each question type separately (se- mantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector computed is exactly the same as the correct word in the question; synonyms are thus counted as mistakes.

## Other reference
https://www.cambridge.org/core/services/aop-cambridge-core/content/view/EDF43F837150B94E71DBB36B28B85E79/S204877031900012Xa.pdf/evaluating_word_embedding_models_methods_and_experimental_results.pdf

## Question 3 (20 points)

Explain two reasons why Transformers are superior to Recurrent Neural Network
(RNN) in sequence-to-sequence tasks such as Machine Translation.

1. In sequence-to-sequence tasks, the input data is a sequence of word vectors. Because the hidden vector at time $t$ depends on the hidden vector at time $t-1$ in RNN, the hidden vectors can not be processed parallelly. In contrast, the attention can be computed by only computing matrix in parallel in Transformer. Hence Transformer is more computationally efficient.
2. That the architecture is not deeper more effective has been comfirmed in previous works. An effective solution is to add residual connections to the architecture. It is hard to add residual connections to a RNN but easier to add residual connections between attention blocks. Hence Transformer can be designed to larger scale with more parameters so that it can have more powerful performance.

## Question 4 (10 points)

Implement the code for using a pre-trained **language** model. Show the code and its output as well as the following information:

+ The detail of the pre-trained language model, for example,
    + https://huggingface.co/EleutherAI/gpt-j-6B
    + https://huggingface.co/rinna/japanese-gpt-1b
    + https://huggingface.co/facebook/blenderbot-400M-distill
+ The task addressed by the model (e.g., "text generation", "summarization", "chatbot")


# BERT
##1. Detials
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

### Training data
The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

### Masking procedure
+ 15% of the tokens are masked.
+ In 80% of the cases, the masked tokens are replaced by [MASK].
+ In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+ In the 10% remaining cases, the masked tokens are left as is.

### Evaluation results

|Task|MNLI-(m/mm)|QQP|QNLI|SST-2|CoLA|STS-B|MRPC|RTE|Average|
|----|----|----|----|----|----|----|----|----|----|
||84.6/83.4|71.2|90.5|93.5|52.1|85.8|88.9|66.4|79.6|


##2. Tasks addressed by the model
+ Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
+ Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.
+ Because BERT is a base pre-trained model, it also can be used to solve other NLP tasks such as diverse natural language understanding, question/answer, sentence-pair completion and so on by fine tuning a pre-trained BERT with an additional output layer.

In [25]:
!pip install transformers -q

[K     |████████████████████████████████| 4.7 MB 38.2 MB/s 
[K     |████████████████████████████████| 6.6 MB 61.7 MB/s 
[K     |████████████████████████████████| 596 kB 68.4 MB/s 
[K     |████████████████████████████████| 101 kB 13.3 MB/s 
[?25h

In [26]:
#Implementation from here
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-cased')

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [34]:
#a few tests
unmasker("Alan Mathison Turing was highly influential in the development of theoretical [MASK] science.")

[{'score': 0.9071993827819824,
  'sequence': 'Alan Mathison Turing was highly influential in the development of theoretical computer science.',
  'token': 2775,
  'token_str': 'computer'},
 {'score': 0.011156082153320312,
  'sequence': 'Alan Mathison Turing was highly influential in the development of theoretical cognitive science.',
  'token': 12176,
  'token_str': 'cognitive'},
 {'score': 0.008305317722260952,
  'sequence': 'Alan Mathison Turing was highly influential in the development of theoretical political science.',
  'token': 1741,
  'token_str': 'political'},
 {'score': 0.007529881317168474,
  'sequence': 'Alan Mathison Turing was highly influential in the development of theoretical information science.',
  'token': 1869,
  'token_str': 'information'},
 {'score': 0.0046945237554609776,
  'sequence': 'Alan Mathison Turing was highly influential in the development of theoretical social science.',
  'token': 1934,
  'token_str': 'social'}]

In [33]:
unmasker("Alan Mathison Turing graduated at King's College, [MASK], with a degree in mathematics.")

[{'score': 0.7612192034721375,
  'sequence': "Alan Mathison Turing graduated at King's College, Cambridge, with a degree in mathematics.",
  'token': 3900,
  'token_str': 'Cambridge'},
 {'score': 0.19253723323345184,
  'sequence': "Alan Mathison Turing graduated at King's College, London, with a degree in mathematics.",
  'token': 1498,
  'token_str': 'London'},
 {'score': 0.019461840391159058,
  'sequence': "Alan Mathison Turing graduated at King's College, Oxford, with a degree in mathematics.",
  'token': 3500,
  'token_str': 'Oxford'},
 {'score': 0.0027144227642565966,
  'sequence': "Alan Mathison Turing graduated at King's College, Auckland, with a degree in mathematics.",
  'token': 8279,
  'token_str': 'Auckland'},
 {'score': 0.0024129985831677914,
  'sequence': "Alan Mathison Turing graduated at King's College, Toronto, with a degree in mathematics.",
  'token': 3506,
  'token_str': 'Toronto'}]