# Week2_1 Assignment

## Basic 
- BERT hidden state에서 특정 단어의 embedding을 여러 방식으로 추출 및 생성할 수 있다.

## Challenge
- Cosine Similarity 함수를 구현할 수 있다. 
- 단어들의 유사도를 cosine similarity로 구해 비교할 수 있다. 

## Advanced
- 문장 embedding을 구해 문장 간 유사도를 구할 수 있다.

### Reference
- [참고URL](http://jalammar.github.io/illustrated-bert/)
- [참고URL](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#33-creating-word-and-sentence-vectors-from-hidden-states)

In [None]:
import os
import sys
import pandas as pd
import numpy as np 
import torch
import random

### BERT 모델과 토크나이저 로드   
- 두 사람의 대화에서 (단어 및 문장의) embedding을 생성하고자 한다. 아래 대화를 BERT 모델에 입력해 출력값 중 "hidden states"값을 가져오자.
- `Hidden States`는 3차원 텐서를 가지고 있는 list 타입이다. List에는 BERT 모델의 각 layer마다의 hidden state 3차원 텐서를 갖고 있으며 각 텐서는 (batch_size, sequence_length, hidden_size) shape을 가진다. BERT-base 모델은 12 layer를 갖고 있고 이와 별도로 Embedding Layer 1개를 더 갖고 있기 때문에 `len(hidden states)`는 13개가 된다. 
    - batch_size: 학습 시 설정한 배치 사이즈. 또는 BERT 모델에 입력된 문장의 개수
    - sequence_length: 문장의 token의 개수. 
    - hidden size: token의 embedding size 
- Reference
    - [BertTokenizer.tokenize() 함수의 매개변수 설명](https://huggingface.co/transformers/v3.0.2/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__)
    - [BERTModel.forward() 함수의 매개변수 및 리턴 값 설명](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel.forward)

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 10.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 56.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 36.3 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 63.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fou

In [None]:
from transformers import BertTokenizer, BertModel

In [None]:
tokenizer_bert = BertTokenizer.from_pretrained("bert-base-cased")
model_bert = BertModel.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
normal_person = ["what do you do when you have free time?"]
nerd = ["I code. code frees my minds, body and soul."]
normal_person.append("(what a nerd...) coding?")
nerd.append("Yes. coding is the best thing to do in the free time.")

for i in range(len(normal_person)):
    print(f"Normal Person asked: {normal_person[i]}")
    print(f"Nerd answers: {nerd[i]}")

Normal Person asked: what do you do when you have free time?
Nerd answers: I code. code frees my minds, body and soul.
Normal Person asked: (what a nerd...) coding?
Nerd answers: Yes. coding is the best thing to do in the free time.


- BERT tokenizer tokenize() 함수의 argument 알아보기 -> [BERTTokenizer.tokenize() parameter](https://huggingface.co/transformers/v3.0.2/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__)

In [None]:
# train data embedding
# truncation <- max_len 넘어가지 않도록 자르기
# padding <- max(seq_len, max_len) zero padding
# return_tensors <- return 2d tensor 

inputs = tokenizer_bert(
    text = normal_person,
    text_pair = nerd,
    truncation = True,
    padding = "longest", 
    return_tensors='pt'
    )

print(inputs['input_ids'].shape)

torch.Size([2, 28])


In [None]:
# decoding
for i in range(len(inputs['input_ids'])):
    print(f"Coversation {i} -> '{tokenizer_bert.decode(inputs['input_ids'][i])}'")

Coversation 0 -> '[CLS] what do you do when you have free time? [SEP] I code. code frees my minds, body and soul. [SEP] [PAD] [PAD]'
Coversation 1 -> '[CLS] ( what a nerd... ) coding? [SEP] Yes. coding is the best thing to do in the free time. [SEP]'


In [None]:
tokenizer_bert.encode('code', add_special_tokens=False)

[3463]

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(device)

cuda


In [None]:
# CPU -> GPU
inputs = inputs.to(device)
model_bert.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:
# embedding
outputs = model_bert(
    **inputs, 
    output_hidden_states=True
    )

In [None]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

In [None]:
hidden_states = outputs['hidden_states']
print(f"# layers : {len(hidden_states)}")
print(f"tensor shape in each layer : {hidden_states[-1].shape}")

# layers : 13
tensor shape in each layer : torch.Size([2, 28, 768])


###  [BASIC]Q1. 1번째 sequence (문장)에서 "code"라는 단어의 인덱스를 반환하라

In [None]:
# code의 index 찾기
word = "code"
token_id = tokenizer_bert.encode(word, add_special_tokens=False)
print(token_id)
seq1 = inputs['input_ids'][0]
token_index = (seq1 == token_id[0]).nonzero()#[0]
print(token_index)

[3463]
tensor([[13],
        [15]], device='cuda:0')


In [None]:
# validation
for index in token_index:
    assert word == tokenizer_bert.decode(seq1[index])

### Q2. 1번째 sequence의 1번째 "code" 토큰의 embedding을 여러가지 방식으로 구하고자 한다. BERT hidden state를 다음의 방식으로 인덱싱해 embedding을 구하라
- 1 layer
- last layer
- sum all 12 layers
- sum last 4 layers
- concat last 4 layers
- average last 4 layers

In [None]:
index = token_index[0]

first_layer_emb = hidden_states[1][0, index.item(), :]
print(first_layer_emb.shape)

last_layer_emb = hidden_states[-1][0,  index.item(), :]
print(last_layer_emb.shape)

sum_all_layer_emb = sum([hs[0,  index.item(), :] for hs in hidden_states])
print(sum_all_layer_emb.shape)

sum_last4_layer_emb = sum([hidden_states[i][0, index.item(), :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)])
print(sum_last4_layer_emb.shape)

concat_last4_layer_emb = torch.cat([hidden_states[i][0, index.item(), :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)], dim=0)
print(concat_last4_layer_emb.shape)

mean_last4_layer_emb = sum_last4_layer_emb / 4
print(mean_last4_layer_emb.shape)

torch.Size([768])
torch.Size([768])
torch.Size([768])
torch.Size([768])
torch.Size([3072])
torch.Size([768])


### Q3. `sum_last_four_layer` 방식으로 1번째 sequence의 2개의 "code" 토큰 사이의 코사인 유사도를 계산하라

In [None]:
def cosine_similarity_manual(x,y,small_number=1e-8):

    def l2_norm(a):
        return torch.pow(a,2).sum().sqrt()
    
    return torch.inner(x,y) / max(l2_norm(x)*l2_norm(y), small_number)

In [None]:
sum_last4_layer_emb_1 = sum_last4_layer_emb
sum_last4_layer_emb_2 = sum([hidden_states[i][0, token_index[-1].item(), :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)])
score = cosine_similarity_manual(sum_last4_layer_emb_1, sum_last4_layer_emb_2)
print(f"Similariy between first 'code' and second 'code' is {score}")

Similariy between first 'code' and second 'code' is 0.842657208442688


### Q4. 2번째 sequence에서 "coding"이라는 토큰의 위치를 반환하라

In [None]:
# code의 index 찾기 
word = "coding"
token_id = tokenizer_bert.encode(word, add_special_tokens=False)
print(token_id)
seq2 = inputs['input_ids'][1]
token_index = (seq2 == token_id[0]).nonzero()#[0]
print(token_index)

[19350]
tensor([[10],
        [15]], device='cuda:0')


### Q5. `concat_last4_layer_emb` 방식으로 2번째 sequence의 2개의 "coding" 토큰 사이의 코사인 유사도를 계산하라

In [None]:
concat_last4_layer_emb_1 = torch.cat([hidden_states[i][1, token_index[0].item(), :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)], dim=0)
concat_last4_layer_emb_2 = torch.cat([hidden_states[i][1, token_index[-1].item(), :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)], dim=0)
score =  cosine_similarity_manual(concat_last4_layer_emb_1, concat_last4_layer_emb_2)
print(f"Similariy between first '{word}' and second '{word}' is {score}")

Similariy between first 'coding' and second 'coding' is 0.8681784868240356


### Q6. 2번째 sequence에서 랜덤하게 토큰 하나를 뽑아보자. 그 랜덤 토큰과 2번째 sequence의 2번째 "coding" 토큰의 코사인 유사도를 계산해보자

In [None]:
token_len = hidden_states[0][0].shape[0]

In [None]:
random_idx = random.randint(0,token_len-1)
random_token_id = inputs['input_ids'][-1][random_idx].item()
random_word = tokenizer_bert.decode([random_token_id])
print(f"Random word : {random_word}")

concat_last4_layer_emb_other = torch.cat([hidden_states[i][1, random_idx, :] for i in range(len(hidden_states)-1, len(hidden_states)-1-4, -1)], dim=0)
score =  cosine_similarity_manual(concat_last4_layer_emb_1, concat_last4_layer_emb_other)
print(f"Similariy between first '{word}' and second '{random_word}' is {score}")

Random word : ne
Similariy between first 'coding' and second 'ne' is 0.6469892263412476


### Q7. 1번째 sequence와 2번째 sequence의 문장 유사도를 구해보자. 문장의 엠베딩은 마지막 레이어의 첫번째 토큰 ('[CLS]')으로 생성한다.

In [None]:
sequence_1_embedding = hidden_states[-1][0, 0, :]
sequence_2_embedding = hidden_states[-1][1, 0, :]
print(f"Shape : {sequence_1_embedding.shape}")

Shape : torch.Size([768])


In [None]:
score = cosine_similarity_manual(sequence_1_embedding, sequence_2_embedding).item()
print(f"Similariy between the first sequence and the second sequence is {score}")

Similariy between the first sequence and the second sequence is 0.8130239248275757
