<center><a href="https://5loi.com/about_loi"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a></center>

# 概述
> Overview

## 任务描述
> Task Description

- 给定一段上下文和一个自然语言查询，我们希望为该查询生成一个答案
- 根据答案的生成方式，任务可以大致分为两种类型：
    1. 抽取式问答(Extractive Question Answering)
    2. **生成式问答**(Generative Question Answering)


### 使用 S2S 和 GPT 类模型进行生成式问答
> Generative Question-Answering with S2S and GPT-like models

给定一个问题和一段上下文，两者都使用自然语言，为该问题生成一个答案。与 BERT 类模型不同，答案不需要局限于上下文中的某个片段。

Given a question and a context, both in natural language, generate an answer for the question. Unlike the BERT-like models, there is no constraint that the answer should be a span within the context.

In [None]:
BRANCH = 'main'

# 导入和常量
> Imports and constants

In [None]:
import os
import wget
import gc

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.nlp.models.question_answering.qa_gpt_model import GPTQAModel
from nemo.collections.nlp.models.question_answering.qa_s2s_model import S2SQAModel

gc.disable()

In [None]:
# set the following paths
DATA_DIR = "data" # directory for storing datasets
WORK_DIR = "work_dir" # directory for storing trained models, logs, additionally downloaded scripts

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(WORK_DIR, exist_ok=True)

# 配置
> Configuration

模型在配置文件中定义，该文件声明了多个重要部分：
- **model**: 所有与模型相关的参数 - 语言模型、跨度预测、优化器和调度器、数据集以及任何其他相关信息
- **trainer**: 传递给 PyTorch Lightning 的任何参数
- **exp_manager**: 用于设置实验管理器的所有参数 - 目标目录、名称、日志信息

我们将下载在 `NeMo/examples/nlp/question_answering/conf/qa_conf.yaml` 中提供的默认配置文件，并编辑训练不同模型所需的必要值

In [None]:
# download the model's default configuration file 
config_dir = WORK_DIR + '/conf/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + "qa_conf.yaml"):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/conf/qa_conf.yaml', config_dir)
else:
    print ('config file already exists')

In [None]:
# this will print the entire default config of the model
config_path = f'{WORK_DIR}/conf/qa_conf.yaml'
print(config_path)
config = OmegaConf.load(config_path)
print("Default Config - \n")
print(OmegaConf.to_yaml(config))

# 在 SQuAD v2.0 上训练和测试模型
> Training and testing models on SQuAD v2.0

## 数据集
> Dataset

在本例中，我们将下载 [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 数据集以展示如何进行训练和推理。有两个数据集，SQuAD1.0 和 SQuAD2.0。SQuAD 1.1（SQuAD 数据集的先前版本）包含 100,000 多个关于 500 多篇文章的问答对。SQuAD2.0 数据集将 SQuAD1.1 中的 100,000 个问题与由众包工作者对抗性编写的 50,000 多个无法回答的问题相结合，使其看起来类似于可回答的问题。

我们已经准备了数据目录 "squad"，其中包含以下四个文件，用于训练和评估：


```
squad  
│
└───v1.1
│   │ -  train-v1.1.json
│   │ -  dev-v1.1.json
│
└───v2.0
    │ -  train-v2.0.json
    │ -  dev-v2.0.json
```

In [None]:
!ls -LR {DATA_DIR}/squad

## 设置数据集配置值
> Set dataset config values

In [None]:
# if True, model will load features from cache if file is present, or
# create features and dump to cache file if not already present
config.model.dataset.use_cache = False

# indicates whether the dataset has unanswerable questions
config.model.dataset.version_2_with_negative = True

# indicates whether the dataset is of extractive nature or not
# if True, context spans/chunks that do not contain answer are treated as unanswerable 
config.model.dataset.check_if_answer_in_context = True

# set file paths for train, validation, and test datasets
config.model.train_ds.file = f"{DATA_DIR}/squad/v2.0/train-v2.0.json"
config.model.validation_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"
config.model.test_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"

# set batch sizes for train, validation, and test datasets
config.model.train_ds.batch_size = 8
config.model.validation_ds.batch_size = 8
config.model.test_ds.batch_size = 8

# set number of samples to be used from dataset. setting to -1 uses entire dataset
config.model.train_ds.num_samples = 5000
config.model.validation_ds.num_samples = 1000
config.model.test_ds.num_samples = 100

## 设置训练器配置值
> Set trainer config values

In [None]:
config.trainer.max_epochs = 1
config.trainer.max_steps = -1 # takes precedence over max_epochs
config.trainer.precision = 16
config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use [0] this tutorial does not support multiple GPUs. If needed please use NeMo/examples/nlp/question_answering/question_answering.py
config.trainer.accelerator = "gpu"
config.trainer.strategy="auto"

## 设置实验管理器配置值
> Set experiment manager config values

In [None]:
# config.exp_manager.exp_dir = WORK_DIR
# config.exp_manager.name = "QA-SQuAD2"
# config.exp_manager.create_wandb_logger=False

## 用于 SQuAD v2.0 的 S2S BART 模型
> S2S BART model for SQuAD v2.0

### 设置模型配置值
> Set model config values

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "facebook/bart-base"
config.model.tokenizer.tokenizer_name = "facebook/bart-base"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/bart_squad_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 5e-5

#remove vocab_file from gpt model
config.model.tokenizer.vocab_file = None

### 创建训练器并初始化模型
> Create trainer and initialize model

In [None]:
# uncomment below line and run if you get an error while initializing tokenizer on Colab (reference: https://github.com/huggingface/transformers/issues/8690)
# !rm -r /root/.cache/huggingface/

trainer = pl.Trainer(**config.trainer)
model = S2SQAModel(config.model, trainer=trainer)

### 训练、测试和保存模型
> Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### 加载保存的模型并运行推理
> Load the saved model and run inference

In [None]:
model = S2SQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
    devices=eval_device,
    accelerator=config.trainer.accelerator,
    precision=16,
    logger=False,
)

all_preds, all_nbest = model.inference(
    config.model.test_ds.file,
#     output_prediction_file=output_prediction_file,
#     output_nbest_file=output_nbest_file,
    num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
    print(all_preds[question_id])

## 用于 SQuAD v2.0 的 GPT2 模型
> GPT2 model for SQuAD v2.0

### 练习 # 1 - 设置模型配置值
> Exercise # 1 - Set model config values

* 修改 `<FIXME>` 以使用 `gpt2` 预训练模型和分词器。
* Modify the `<FIXME>` to use the `gpt2` pre-trained model and tokenizer. 

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = <<<<FIXME>>>>
config.model.tokenizer.tokenizer_name = <<<<FIXME>>>>

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/gpt2_squad_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 1e-4

### 创建训练器并初始化模型
> Create trainer and initialize model

In [None]:
trainer = pl.Trainer(**config.trainer)
model = GPTQAModel(config.model, trainer=trainer)

### 练习 # 2 - 训练、测试和保存模型
> Exercise # 2 - Train, test, and save the model

* 修改 `<FIXME>` 以训练、测试和保存模型。
* Modify the `<FIXME>` to train, test, and save the model. 

In [None]:
<<<<FIXME>>>>.fit(<<<<FIXME>>>>)
<<<<FIXME>>>>.test(<<<<FIXME>>>>)

<<<<FIXME>>>>.save_to(config.model.nemo_path)

### 练习 # 3 - 加载保存的模型并运行推理
> Exercise # 3 - Load the saved model and run inference

* 修改 `<FIXME>` 以从保存的模型运行推理。
* Modify the `<FIXME>` to run inference from a saved model. 

In [None]:
model = GPTQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
    devices=eval_device,
    accelerator=config.trainer.accelerator,
    precision=16,
    logger=False,
)

all_preds, all_nbest = model.<<<<FIXME>>>>(
    config.model.test_ds.file,
    num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
    print(all_preds[question_id])

<center><a href="https://5loi.com/about_loi"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a></center>