## BERT

- Bidirectional Encoder Representations from transformers
- Transformer의 Encoder를 쌓아 만든 구조
- Encoder를 쌓아 만든 구조이기 때문에 Classification 등의 작업을 잘 수행함

In [35]:
from transformers import BertModel, BertTokenizer
import torch

- BERT-uncased
    - 단어에 lower를 수행함
    - accent marks를 지움 (프랑스어 등)

- BERT-cased
    - 단어에 어떠한 변환도 주지 않음
    - 특수한 경우(NER)에 사용됨
    - NER
        - Named Entity Recognition (개체명 인식)
        - NER은 미리 정의해 둔 사람, 회사, 장소, 단위 등에 해당하는 단어(개체명)를 문서에서 인식하여 추출 분류하는 기법이다.
        - 추출된 개체명은 인명(person), 지명(location), 기관명(organization), 시간(time) 등으로 분류된다.
        - 개체명 인식은 정보 추출을 목적으로 시작되어 NLP, 정보 검색 등에 사용된다.

- 출처: https://k-min-algorithm.tistory.com/37 [K-MIN'S ALGORITHM:티스토리]

In [36]:
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [37]:
# sentence

sentence = "I am a student"

token = tokenizer.tokenize(sentence)

print("pre :", sentence)
print("post :", token)

pre : I am a student
post : ['i', 'am', 'a', 'student']


In [38]:
# BERT 모델 특징
## BERT 모델은 문장의 시작은 [CLS], 문장의 끝은 [SEP]을 추가해야함
## max_length에 나머지 부분은 [PAD]로 채워서 문장의 길이를 맞춰야 함

max_length = 10

new_token = ['[CLS]'] + token + ['[SEP]']

print("pre :", token)
print("post :", new_token)



pre : ['i', 'am', 'a', 'student']
post : ['[CLS]', 'i', 'am', 'a', 'student', '[SEP]']


In [39]:
def padding(sentence, max_length):

    check = max_length - len(sentence)

    if check >= 0:
        sentence += ['[PAD]'] * check
    else:
        sentence = sentence[:check - 1] + ['[SEP]']
    
    return sentence

pad_token = padding(new_token, max_length)

print("padding :", pad_token)

padding : ['[CLS]', 'i', 'am', 'a', 'student', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


In [40]:
# attention mask

attention_mask = [1 if i != '[PAD]' else 0 for i in pad_token]

print("sentence :", pad_token)
print("attention mask :", attention_mask)

sentence : ['[CLS]', 'i', 'am', 'a', 'student', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
attention mask : [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]


In [41]:
# token -> id로 변경

token_ids = tokenizer.convert_tokens_to_ids(pad_token)

print("sentence :", pad_token)
print("token ids :", token_ids)

sentence : ['[CLS]', 'i', 'am', 'a', 'student', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
token ids : [101, 1045, 2572, 1037, 3076, 102, 0, 0, 0, 0]


In [42]:
# model에 넣기 위해 tensor로 변환
# unsqueeze(dim), dim에 차원을 추가

token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

print(token_ids)
print(attention_mask)

tensor([[ 101, 1045, 2572, 1037, 3076,  102,    0,    0,    0,    0]])
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])


In [52]:
# Embedding vector

output = model(token_ids, attention_mask = attention_mask)

print(output['last_hidden_state'].shape)
# 문장 token들의 벡터
# batch_size, sequence_length, hidden_size

# CLS 토큰의 벡터
print(output['pooler_output'].shape)

torch.Size([1, 10, 768])
torch.Size([1, 768])


## QnA Task

- 질문과 문맥을 받고 답변하는 Task
- Datasets : SQUAD (질의 응답이 담겨 있는 자연어 데이터)

In [69]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [73]:
question = ['What is the capital city of France?', ' Who wrote the novel "Pride and Prejudice"?', 'How far is the Moon from Earth?', ' What is the largest ocean on Earth?', 'When was the United Nations founded?']
paragraph = ['The capital city of France is Paris.', 'Jane Austen wrote the novel "Pride and Prejudice."', 'The Moon is about 238,855 miles (384,400 kilometers) from Earth.', 'The Pacific Ocean is the largest ocean on Earth.', 'The United Nations was founded on October 24, 1945.']

In [98]:
def preprocessing(q, p, max_length):


    token_ids = []
    input_ids = []
    segment_ids = []

    for question, paragraph in zip(q, p):

        qq = '[CLS]' + question + '[SEP]'
        pp = paragraph + '[SEP]'

        qt = tokenizer.tokenize(qq)
        pt = tokenizer.tokenize(pp)

        tokens = qt + pt

        check = max_length - len(tokens)

        if check >= 0:
            tokens += ['[PAD]'] * check
        else:
            tokens = tokens[:check - 1] + ['[SEP]']

        input_id = tokenizer.convert_tokens_to_ids(tokens)
        segment_id = [0] * len(qt)
        segment_id += [1] * len(pt)
        segment_id += [0] * check

        token_ids.append(tokens)
        input_ids.append(input_id)
        segment_ids.append(segment_id)
    
    return token_ids, torch.tensor(input_ids), torch.tensor(segment_ids)

max_length = 100

tokens, input_ids, segment_ids = preprocessing(question, paragraph, max_length)

print(f'문장 갯수 : {len(tokens)},\ninput_ids shape : {input_ids.shape},\ninput_segments_ids shape : {segment_ids.shape}')



문장 갯수 : 5,
input_ids shape : torch.Size([5, 100]),
input_segments_ids shape : torch.Size([5, 100])


In [99]:
print(question[0], paragraph[0])
print(tokenizer.decode(input_ids[0]))

What is the capital city of France? The capital city of France is Paris.
[CLS] what is the capital city of france? [SEP] the capital city of france is paris. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [100]:
output = model(input_ids, token_type_ids = segment_ids)
print(output.keys())
# 각 인덱스가 응답의 시작/끝 토큰이 될 확률

odict_keys(['start_logits', 'end_logits'])


In [101]:
start_scores = output['start_logits']
end_scores = output['end_logits']

print(start_scores.shape, end_scores.shape)

torch.Size([5, 100]) torch.Size([5, 100])


In [102]:
start_index = torch.argmax(start_scores[0])
end_index = torch.argmax(end_scores[0])

print(start_index, end_index)

tensor(16) tensor(16)


In [103]:
for i in range(len(start_scores)):

    print('Q :', question[i])
    start_index = torch.argmax(start_scores[i])
    end_index = torch.argmax(end_scores[i])
    print('A :', ' '.join(tokens[i][start_index:end_index+1]))

Q : What is the capital city of France?
A : paris
Q :  Who wrote the novel "Pride and Prejudice"?
A : jane austen
Q : How far is the Moon from Earth?
A : 238 , 85 ##5 miles
Q :  What is the largest ocean on Earth?
A : the pacific ocean
Q : When was the United Nations founded?
A : october 24 , 1945
