<a href="https://colab.research.google.com/github/unmo/nlp_for_bert/blob/main/memo/session2/session2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# シンプルなBertの実装

In [1]:
!pip install folium==0.2.1
!pip install urllib==1.25.11
!pip install pytorch_transformers==1.2.0

Collecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[?25l[K     |████▊                           | 10 kB 17.3 MB/s eta 0:00:01[K     |█████████▍                      | 20 kB 16.4 MB/s eta 0:00:01[K     |██████████████                  | 30 kB 11.2 MB/s eta 0:00:01[K     |██████████████████▊             | 40 kB 8.6 MB/s eta 0:00:01[K     |███████████████████████▍        | 51 kB 5.8 MB/s eta 0:00:01[K     |████████████████████████████    | 61 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 69 kB 2.8 MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.2.1-py3-none-any.whl size=79809 sha256=178dc168d81a6aafead55861a0e9de5303ae94f7aa2392de0c6953681653748e
  Stored in directory: /root/.cache/pip/wheels/9a/f0/3a/3f79a6914ff5affaf50cabad60c9f4d565283283c97f0bdccf
Successfully built folium
Installing collected packages: folium
  Attempting uni

## 文章の一部の予測
文章における一部の単語をmaskし、それをBERTのモデルを使って予測する(MaskedLM)

In [6]:
import torch
from pytorch_transformers import BertForMaskedLM
from pytorch_transformers import BertTokenizer


text = "[CLS] I played baseball with my friends at school yesterday [SEP]"
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
words = tokenizer.tokenize(text)
print(words)

100%|██████████| 231508/231508 [00:00<00:00, 2614711.44B/s]


['[CLS]', 'i', 'played', 'baseball', 'with', 'my', 'friends', 'at', 'school', 'yesterday', '[SEP]']


文章の一部をMASKする

In [7]:
msk_idx = 3
words[msk_idx] = "[MASK]"  # 単語を[MASK]に置き換える
print(words)

['[CLS]', 'i', 'played', '[MASK]', 'with', 'my', 'friends', 'at', 'school', 'yesterday', '[SEP]']


単語を対応するインデックスに変換する

In [8]:
word_ids = tokenizer.convert_tokens_to_ids(words)  # 単語をインデックスに変換
word_tensor = torch.tensor([word_ids])  # テンソルに変換
print(word_tensor)

tensor([[ 101, 1045, 2209,  103, 2007, 2026, 2814, 2012, 2082, 7483,  102]])


BERTのモデルを使って予測を行う

In [9]:
msk_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
msk_model.cuda()
msk_model.eval()  # 評価モード

x = word_tensor.cuda()
y = msk_model(x)
result = y[0]
print(result.size())  # tensorの場合、sizeでshapeが見れる

_, max_ids = torch.topk(result[0][msk_idx], k=10)  # 最も大きい5つの値
result_words = tokenizer.convert_ids_to_tokens(max_ids.tolist())  # インデックスを単語に変換

print(result_words)

100%|██████████| 433/433 [00:00<00:00, 83305.06B/s]
100%|██████████| 440473133/440473133 [00:12<00:00, 34908635.96B/s]


torch.Size([1, 11, 30522])
['basketball', 'football', 'soccer', 'baseball', 'tennis', 'chess', 'golf', 'guitar', 'pool', 'softball']


## 文章が連続しているかどうかの判定

BERTのモデルを使って、2つの文章が連続しているかどうかの判定を行う(Next Sentence Prediction)
show_continuityでは、2つの文章の連続性を判定し、表示する

In [59]:
from pytorch_transformers import BertForNextSentencePrediction

def show_continuity(text, seg_ids):
    ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
    ids_tensor = torch.tensor([ids])
    seg_tensor = torch.tensor([seg_ids])

    print(ids_tensor)
    print(seg_tensor)

    x = ids_tensor.cuda()
    s = seg_tensor.cuda()

    nsp_model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")
    nsp_model.cuda()
    nsp_model.eval()

    y = nsp_model(x, s)
    print(y)
    result = torch.softmax(y[0], dim=1)
    print(result)
    # print(nsp_model)
    print(f"連続確率： {result[0][0].item()*100}")

In [71]:
text = "[CLS] What is soccer ? [SEP] It is a game of shoot the boal [SEP]"
seg_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]  # 0:前の文章の単語、1:後の文章の単語
show_continuity(text, seg_ids)

tensor([[ 101, 2054, 2003, 4715, 1029,  102, 2009, 2003, 1037, 2208, 1997, 5607,
         1996, 8945, 2389,  102]])
tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
(tensor([[ 5.5447, -4.9307]], device='cuda:0', grad_fn=<AddmmBackward0>),)
tensor([[9.9997e-01, 2.8221e-05]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
連続確率： 99.99717473983765


In [75]:
text = "[CLS] What is soccer ? [SEP] This is made with flour and milk [SEP]"
seg_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]  # 0:前の文章の単語、1:後の文章の単語
show_continuity(text, seg_ids)

tensor([[  101,  2054,  2003,  4715,  1029,   102,  2023,  2003,  2081,  2007,
         13724,  1998,  6501,   102]])
tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]])
(tensor([[-4.1400,  7.1774]], device='cuda:0', grad_fn=<AddmmBackward0>),)
tensor([[1.2160e-05, 9.9999e-01]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
連続確率： 0.0012159755897300784
