| [09_dialogue/01_对话模型.ipynb](https://github.com/shibing624/nlp-tutorial/tree/main/09_dialogue/01_对话模型.ipynb)  | 基于transformers的Bert问答模型  |[Open In Colab](https://colab.research.google.com/github/shibing624/nlp-tutorial/blob/main/09_dialogue/01_对话模型.ipynb) |


# 对话模型


使用transformers的Bert模型完成阅读理解任务。

In [1]:
!pip install transformers

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [2]:
import os

from transformers import pipeline
from transformers import AutoModelForQuestionAnswering, BertTokenizer
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
bert_model = 'luhua/chinese_pretrain_mrc_macbert_large'
print(bert_model)
# model = AutoModelForQuestionAnswering.from_pretrained(bert_model_dir)
# tokenizer = BertTokenizer.from_pretrained(bert_model_dir)
nlp = pipeline("question-answering",
               model=bert_model,
               tokenizer=bert_model,
               device=-1,  # gpu device id
               )
context = r"""
大家好，我是张亮，目前任职当当网架构部架构师一职，也是高可用架构群的一员。我为大家提供了一份imagenet数据集，希望能够为图像分类任务做点贡献。
"""

# context = ' '.join(list(context))

result = nlp(question="张亮在哪里任职?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="张亮为图像分类提供了什么数据集?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")


Answer: '当当网', score: 0.6132, start: 14, end: 17
Answer: 'imagenet数据集', score: 0.6238, start: 47, end: 58



# Custom Predict

In [3]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
bert_model = 'luhua/chinese_pretrain_mrc_macbert_large'
# tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained(bert_model)
tokenizer = BertTokenizer.from_pretrained(bert_model)

text = r"""
大家好，我是张明亮，目前任职腾讯架构部架构师一职，也是高可用架构群的一员。我为大家提供了一份imagenet数据集，希望能够为图像分类任务做点贡献
"""
questions = [
    "张亮在哪里上班?",
    "张亮有啥数据?",
    "谁提供了imagenet数据集?",
]
for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    #print(f'inputs:{inputs}')
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    #print(outputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: 张亮在哪里上班?
Answer: 腾 讯
Question: 张亮有啥数据?
Answer: imagenet 数 据 集
Question: 谁提供了imagenet数据集?
Answer: 张 明 亮


本节完。