# 完形填空

利用语言模型，可以完成完形填空（fill mask），预测缺失的单词。

当前，效果最好的语言模型是Bert系列的预训练语言模型。

In [1]:
import os

from transformers import pipeline

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
model_name = "hfl/chinese-macbert-base"

nlp = pipeline("fill-mask",
               model=model_name,
               tokenizer=model_name,
               device=-1,  # gpu device id
               )
from pprint import pprint



pprint(nlp(f"明天天{nlp.tokenizer.mask_token}很好?"))
print("*" * 42)
pprint(nlp(f"明天心{nlp.tokenizer.mask_token}很好?"))
print("*" * 42)
pprint(nlp(f"张亮在哪里任{nlp.tokenizer.mask_token}?"))
print("*" * 42)
pprint(nlp(f"少先队员{nlp.tokenizer.mask_token}该为老人让座位。"))

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3006528317928314,
  'sequence': '明 天 天 气 很 好?',
  'token': 3698,
  'token_str': '气'},
 {'score': 0.1056363433599472,
  'sequence': '明 天 天 会 很 好?',
  'token': 833,
  'token_str': '会'},
 {'score': 0.09691553562879562,
  'sequence': '明 天 天 还 很 好?',
  'token': 6820,
  'token_str': '还'},
 {'score': 0.08303344994783401,
  'sequence': '明 天 天 就 很 好?',
  'token': 2218,
  'token_str': '就'},
 {'score': 0.08257968723773956,
  'sequence': '明 天 天 都 很 好?',
  'token': 6963,
  'token_str': '都'}]
******************************************
[{'score': 0.6035364270210266,
  'sequence': '明 天 心 情 很 好?',
  'token': 2658,
  'token_str': '情'},
 {'score': 0.2056306004524231,
  'sequence': '明 天 心 会 很 好?',
  'token': 833,
  'token_str': '会'},
 {'score': 0.0558624342083931,
  'sequence': '明 天 心 也 很 好?',
  'token': 738,
  'token_str': '也'},
 {'score': 0.026620158925652504,
  'sequence': '明 天 心 就 很 好?',
  'token': 2218,
  'token_str': '就'},
 {'score': 0.015123367309570312,
  'sequence': '明 天 心 态 很 好?',
 

模型默认保存在：`~/.cache/huggingface/transformers`

`hfl/chinese-macbert-base`模型介绍：
https://huggingface.co/hfl/chinese-macbert-base?text=%E5%B7%B4%E9%BB%8E%E6%98%AF%5BMASK%5D%E5%9B%BD%E7%9A%84%E9%A6%96%E9%83%BD%E3%80%82


不通过pipeline，可以自己写预测逻辑：

In [2]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

# tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
# model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

sequence = f"明天天{nlp.tokenizer.mask_token}很好."
input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


明天天会很好.
明天天气很好.
明天天都很好.
明天天还很好.
明天天也很好.


本节完。