| Stage 4: Reinforcement Learning | 基于人类反馈的强化学习(RLHF)，用奖励模型来训练SFT模型，生成模型使用奖励或惩罚来更新其策略，以便生成更高质量、更符合人类偏好的文本   | [scripts/rl_training.py](https://github.com/shibing624/MedicalGPT/blob/main/scripts/rl_training.py) | [scripts/run_rl.sh](https://github.com/shibing624/MedicalGPT/blob/main/scripts/run_rl.sh)    | [notebook/run_rl_training.ipynb](https://github.com/shibing624/MedicalGPT/blob/main/notebook/run_rl_training.ipynb)     | [Open In Colab](https://colab.research.google.com/github/shibing624/MedicalGPT/blob/main/notebook/run_rl_training.ipynb)           |

# Stage 4: Reinforcement Learning Training

第四阶段：RL(Reinforcement Learning)基于人类反馈的强化学习(RLHF)，用奖励模型来训练SFT模型，生成模型使用奖励或惩罚来更新其策略，以便生成更高质量、更符合人类偏好的文本


#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型、奖励模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Bloom的`bigscience/bloomz-560m`
2. 奖励模型：使用的是`OpenAssistant/reward-model-deberta-v3-large-v2`
3. 数据集：RL阶段的数据可以复用SFT的数据集，使用的是Belle的1千条抽样数据，位于`data/finetune`文件夹

## 配置运行环境

本地执行可注释以下配置环境的命令，colab执行要打开注释，用于配置环境

In [None]:
# !git clone --depth 1 https://github.com/shibing624/MedicalGPT.git
# %cd MedicalGPT
# %ls

安装库和依赖包：

```
loguru
transformers>=4.28.1
datasets
tensorboard
tqdm>=4.47.0
peft>=0.3.0
trl
```

In [1]:
# !pip install git+https://github.com/lvwerra/trl
# !pip install -r requirements.txt

In [None]:
# %cd notebook

## 咱们开始吧

环境配置完成，开始导入包

In [2]:
# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description: Train a model from SFT using PPO
"""

import os
from dataclasses import dataclass, field
from glob import glob
from typing import Optional

import torch
from datasets import load_dataset
from loguru import logger
from peft import LoraConfig, TaskType
from tqdm import tqdm
from transformers import (
    AutoModelForSequenceClassification,
    BloomForCausalLM,
    AutoModel,
    LlamaTokenizer,
    LlamaForCausalLM,
    BloomTokenizerFast,
    AutoTokenizer,
    HfArgumentParser,
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, set_seed


2023-06-07 15:57:23.133400: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-07 15:57:23.305250: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-07 15:57:25.082512: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-07 15:57:25.082617: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] 


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/flemingxu/disk/py38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


In [3]:
os.environ["TOKENIZERS_PARALLELISM"] = "FALSE"
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

MODEL_CLASSES = {
    "bloom": (BloomForCausalLM, BloomTokenizerFast),
    "chatglm": (AutoModel, AutoTokenizer),
    "llama": (LlamaForCausalLM, LlamaTokenizer),
}

DEFAULT_PAD_TOKEN = "[PAD]"
PROMPT_TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response: "
)


In [4]:
@dataclass
class ScriptArguments:
    """
    The name of the Casual LM model we wish to fine with PPO
    """
    # Model arguments
    model_type: str = field(
        default="bloom",
        metadata={"help": "Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys())}
    )
    model_name_or_path: Optional[str] = field(
        default="bigscience/bloomz-560m", metadata={"help": "The model checkpoint for weights initialization."}
    )
    reward_model_name_or_path: Optional[str] = field(default="OpenAssistant/reward-model-deberta-v3-large-v2", metadata={"help": "The reward model name"})

    tokenizer_name_or_path: Optional[str] = field(
        default=None, metadata={"help": "The tokenizer for weights initialization."}
    )
    load_in_8bit: bool = field(default=False, metadata={"help": "Whether to load the model in 8bit mode or not."})
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=False,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    torch_dtype: Optional[str] = field(
        default="float16",
        metadata={
            "help": (
                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
                "dtype will be automatically derived from the model's weights."
            ),
            "choices": ["auto", "bfloat16", "float16", "float32"],
        },
    )
    device_map: Optional[str] = field(
        default="auto",
        metadata={"help": "Device to map model to. If `auto` is passed, the device will be selected automatically. "},
    )
    trust_remote_code: bool = field(
        default=True,
        metadata={"help": "Whether to trust remote code when loading a model from a remote checkpoint."},
    )
    # Dataset arguments
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file_dir: Optional[str] = field(default="../data/finetune/", metadata={"help": "The input jsonl data file folder."})
    validation_file_dir: Optional[str] = field(default="../data/finetune/", metadata={"help": "The evaluation jsonl file folder."}, )
    batch_size: Optional[int] = field(default=8, metadata={"help": "Batch size"})
    max_source_length: Optional[int] = field(default=256, metadata={"help": "Max length of prompt input text"})
    max_target_length: Optional[int] = field(default=256, metadata={"help": "Max length of output text"})
    min_target_length: Optional[int] = field(default=4, metadata={"help": "Min length of output text"})
    max_train_samples: Optional[int] = field(
        default=100,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=10,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    validation_split_percentage: Optional[float] = field(
        default=0.01,
        metadata={
            "help": "The percentage of the train set used as validation set in case there's no validation split"
        },
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None, metadata={"help": "The number of processes to use for the preprocessing."},
    )
    # Training arguments
    use_peft: bool = field(default=True, metadata={"help": "Whether to use peft"})
    target_modules: Optional[str] = field(default=None)
    lora_rank: Optional[int] = field(default=8)
    lora_dropout: Optional[float] = field(default=0.05)
    lora_alpha: Optional[float] = field(default=32.0)
    modules_to_save: Optional[str] = field(default=None)
    peft_path: Optional[str] = field(default=None)

    do_train: bool = field(default=True, metadata={"help": "Whether to run training."})
    do_eval: bool = field(default=False, metadata={"help": "Whether to run eval on the validation set."})
    mini_batch_size: Optional[int] = field(default=1, metadata={"help": "PPO minibatch size"})
    early_stopping: Optional[bool] = field(default=False, metadata={"help": "Whether to early stop"})
    target_kl: Optional[float] = field(default=0.1, metadata={"help": "The kl target for early stopping"})
    reward_baseline: Optional[float] = field(
        default=0.0, metadata={"help": "Baseline value that is subtracted from the reward"},
    )
    init_kl_coef: Optional[float] = field(
        default=0.2, metadata={"help": "Initial KL penalty coefficient (used for adaptive and linear control)"},
    )
    adap_kl_ctrl: Optional[bool] = field(default=True, metadata={"help": "Use adaptive KL control, otherwise linear"})
    learning_rate: Optional[float] = field(default=1.5e-5, metadata={"help": "Learning rate"})
    ppo_epochs: Optional[int] = field(default=4, metadata={"help": "the number of ppo epochs"})
    gradient_accumulation_steps: Optional[int] = field(
        default=1, metadata={"help": "the number of gradient accumulation steps"}
    )
    save_steps: Optional[int] = field(default=50, metadata={"help": "X steps to save the model"})
    output_dir: Optional[str] = field(default="outputs-rl", metadata={"help": "n steps to save the model"})
    seed: Optional[int] = field(default=0, metadata={"help": "the seed"})
    max_steps: Optional[int] = field(default=50, metadata={"help": "number of steps to train"})
    log_with: Optional[str] = field(default="tensorboard", metadata={"help": "log with wandb or tensorboard"})


In [5]:
args = ScriptArguments()
args

ScriptArguments(model_type='bloom', model_name_or_path='../../../models/bigscience/bloomz-560m', reward_model_name_or_path='../../../models/OpenAssistant/reward-model-deberta-v3-large-v2', tokenizer_name_or_path=None, load_in_8bit=False, cache_dir=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True, dataset_name=None, dataset_config_name=None, train_file_dir='../data/finetune/', validation_file_dir='../data/finetune/', batch_size=8, max_source_length=256, max_target_length=256, min_target_length=4, max_train_samples=100, max_eval_samples=10, overwrite_cache=False, validation_split_percentage=0.01, preprocessing_num_workers=None, use_peft=True, target_modules=None, lora_rank=8, lora_dropout=0.05, lora_alpha=32.0, modules_to_save=None, peft_path=None, do_train=True, do_eval=False, mini_batch_size=1, early_stopping=False, target_kl=0.1, reward_baseline=0.0, init_kl_coef=0.2, adap_kl_ctrl=True, learning_rate=1.5e-05, ppo_epochs=4, gradient_accum

In [6]:
args.model_type

'bloom'

In [7]:
torch_dtype = (
    args.torch_dtype
    if args.torch_dtype in ["auto", None]
    else getattr(torch, args.torch_dtype)
)
torch_dtype

torch.float16

定义lora参数量和奖励模型方法

In [8]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )


def get_reward_score(reward_model, reward_tokenizer, question, answer, device):
    """
    Get the reward score for a given question and answer pair.
    """
    inputs = reward_tokenizer(question, answer, return_tensors='pt').to(device)
    score = reward_model(**inputs).logits[0].cpu().detach()

    return score


In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Load tokenizer

In [10]:
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer_class


transformers.models.bloom.tokenization_bloom_fast.BloomTokenizerFast

In [11]:
# Load tokenizer
tokenizer_kwargs = {
    "cache_dir": args.cache_dir,
    "use_fast": args.use_fast_tokenizer,
    "trust_remote_code": args.trust_remote_code,
}
tokenizer_name_or_path = args.tokenizer_name_or_path
if not tokenizer_name_or_path:
    tokenizer_name_or_path = args.model_name_or_path
tokenizer = tokenizer_class.from_pretrained(tokenizer_name_or_path, **tokenizer_kwargs)
# Required for llama
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": DEFAULT_PAD_TOKEN})

tokenizer

BloomTokenizerFast(name_or_path='../../../models/bigscience/bloomz-560m', vocab_size=250680, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False)

## Load model

In [12]:
logger.info("Load model")
torch_dtype = (
    args.torch_dtype
    if args.torch_dtype in ["auto", None]
    else getattr(torch, args.torch_dtype)
)
device = "cuda" if torch.cuda.is_available() else "cpu"

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=args.target_modules,
    inference_mode=False,
    r=args.lora_rank,
    lora_alpha=args.lora_alpha,
    lora_dropout=args.lora_dropout,
)

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    args.model_name_or_path,
    load_in_8bit=args.load_in_8bit,
    cache_dir=args.cache_dir,
    torch_dtype=torch_dtype,
    device_map=args.device_map,
    trust_remote_code=args.trust_remote_code,
    peft_config=peft_config if args.use_peft else None,
)
print_trainable_parameters(model)


# Load reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
    args.reward_model_name_or_path,
    load_in_8bit=args.load_in_8bit,
    cache_dir=args.cache_dir,
    torch_dtype=torch_dtype,
)
reward_model.to(device)
reward_tokenizer = AutoTokenizer.from_pretrained(
    args.reward_model_name_or_path, **tokenizer_kwargs
)

2023-06-07 15:58:10.385 | INFO     | __main__:<cell line: 1>:1 - Load model


trainable params: 787457 || all params: 560002049 || trainable%: 0.14061680692171896


In [13]:
# Get datasets
if args.dataset_name is not None:
    # Downloading and loading a dataset from the hub.
    raw_datasets = load_dataset(
        args.dataset_name,
        args.dataset_config_name,
        cache_dir=args.cache_dir,
    )
    if "validation" not in raw_datasets.keys():
        raw_datasets["validation"] = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            split=f"train[:{args.validation_split_percentage}%]",
            cache_dir=args.cache_dir,
        )
        raw_datasets["train"] = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            split=f"train[{args.validation_split_percentage}%:]",
            cache_dir=args.cache_dir,
        )
else:
    data_files = {}
    if args.train_file_dir is not None and os.path.exists(args.train_file_dir):
        train_data_files = glob(f'{args.train_file_dir}/**/*.json', recursive=True) + glob(
            f'{args.train_file_dir}/**/*.jsonl', recursive=True)
        logger.info(f"train files: {', '.join(train_data_files)}")
        data_files["train"] = train_data_files
    if args.validation_file_dir is not None and os.path.exists(args.validation_file_dir):
        eval_data_files = glob(f'{args.validation_file_dir}/**/*.json', recursive=True) + glob(
            f'{args.validation_file_dir}/**/*.jsonl', recursive=True)
        logger.info(f"eval files: {', '.join(eval_data_files)}")
        data_files["validation"] = eval_data_files
    raw_datasets = load_dataset(
        'json',
        data_files=data_files,
        cache_dir=args.cache_dir,
    )
    # If no validation data is there, validation_split_percentage will be used to divide the dataset.
    if "validation" not in raw_datasets.keys():
        raw_datasets["validation"] = load_dataset(
            'json',
            data_files=data_files,
            split=f"train[:{args.validation_split_percentage}%]",
            cache_dir=args.cache_dir,
        )
        raw_datasets["train"] = load_dataset(
            'json',
            data_files=data_files,
            split=f"train[{args.validation_split_percentage}%:]",
            cache_dir=args.cache_dir,
        )
logger.info(f"Raw datasets: {raw_datasets}")

2023-06-07 15:58:38.665 | INFO     | __main__:<cell line: 2>:27 - train files: ../data/finetune/Belle_open_source_1k.json
2023-06-07 15:58:38.668 | INFO     | __main__:<cell line: 2>:32 - eval files: ../data/finetune/Belle_open_source_1k.json


Downloading and preparing dataset json/default to /home/flemingxu/.cache/huggingface/datasets/json/default-81868ba195386786/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/flemingxu/.cache/huggingface/datasets/json/default-81868ba195386786/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

2023-06-07 15:58:40.425 | INFO     | __main__:<cell line: 53>:53 - Raw datasets: DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 1000
    })
})


In [14]:
# Preprocessing the datasets
max_source_length = args.max_source_length
max_target_length = args.max_target_length

def preprocess_function(examples):
    new_examples = {
        "query": [],
        "input_ids": [],
    }
    for instruction, input in zip(examples['instruction'], examples['input']):
        if input:
            instruction = instruction + "\n" + input
        source = PROMPT_TEMPLATE.format_map({"instruction": instruction})
        tokenized_question = tokenizer(
            source, truncation=True, max_length=max_source_length, padding="max_length",
            return_tensors="pt"
        )
        new_examples["query"].append(source)
        new_examples["input_ids"].append(tokenized_question["input_ids"])

    return new_examples


In [15]:
# Preprocess the dataset
train_dataset = None
max_train_samples = 0
if args.do_train:
    if "train" not in raw_datasets:
        raise ValueError("--do_train requires a train dataset")
    train_dataset = raw_datasets['train']
    max_train_samples = len(train_dataset)
    if args.max_train_samples is not None and args.max_train_samples > 0:
        max_train_samples = min(len(train_dataset), args.max_train_samples)
        train_dataset = train_dataset.select(range(max_train_samples))
    logger.debug(f"Example train_dataset[0]: {train_dataset[0]}")
    tokenized_dataset = train_dataset.shuffle().map(
        preprocess_function,
        batched=True,
        num_proc=args.preprocessing_num_workers,
        remove_columns=train_dataset.column_names,
        load_from_cache_file=not args.overwrite_cache,
        desc="Running tokenizer on dataset",
    )
    train_dataset = tokenized_dataset.filter(
        lambda x: len(x['input_ids']) > 0
    )
    logger.debug(f"Num train_samples: {len(train_dataset)}")
    logger.debug("Tokenized training example:")
    logger.debug(train_dataset[0]['input_ids'])

2023-06-07 15:58:40.964 | DEBUG    | __main__:<cell line: 4>:12 - Example train_dataset[0]: {'instruction': '为给定的句子生成一个同义句。\nShe is studying for her final exams.', 'input': '', 'output': 'She is preparing for her last exams.'}


Running tokenizer on dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

2023-06-07 15:58:41.660 | DEBUG    | __main__:<cell line: 4>:24 - Num train_samples: 100
2023-06-07 15:58:41.661 | DEBUG    | __main__:<cell line: 4>:25 - Tokenized training example:
2023-06-07 15:58:41.663 | DEBUG    | __main__:<cell line: 4>:26 - [[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 111757, 632, 660, 54103, 861, 63808, 267, 20165, 17, 66828, 267, 12427, 861, 156788, 115739, 368, 8821, 6149, 105311, 182924, 29, 189, 38363, 10967, 

定义PPO模型配置

In [16]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

output_dir = args.output_dir
config = PPOConfig(
    steps=args.max_steps,
    model_name=args.model_name_or_path,
    learning_rate=args.learning_rate,
    log_with=args.log_with,
    batch_size=args.batch_size,
    mini_batch_size=args.mini_batch_size,
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    optimize_cuda_cache=True,
    early_stopping=args.early_stopping,
    target_kl=args.target_kl,
    seed=args.seed,
    init_kl_coef=args.init_kl_coef,
    adap_kl_ctrl=args.adap_kl_ctrl,
    accelerator_kwargs={"project_dir": output_dir},
)
# Set seed before initializing value head for deterministic eval
set_seed(config.seed)

config

PPOConfig(model_name='../../../models/bigscience/bloomz-560m', steps=50, learning_rate=1.5e-05, adap_kl_ctrl=True, init_kl_coef=0.2, target=6, horizon=10000, gamma=1, lam=0.95, cliprange=0.2, cliprange_value=0.2, vf_coef=0.1, batch_size=8, forward_batch_size=None, mini_batch_size=1, gradient_accumulation_steps=1, ppo_epochs=4, remove_unused_columns=True, log_with='tensorboard', tracker_kwargs={}, accelerator_kwargs={'project_dir': 'outputs-rl'}, tracker_project_name='trl', max_grad_norm=None, seed=0, optimize_cuda_cache=True, early_stopping=False, target_kl=0.1, push_to_hub_if_best_kwargs={}, compare_steps=1)

In [17]:
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
trainer = PPOTrainer(
    config,
    model,
    ref_model=None,
    tokenizer=tokenizer,
    dataset=train_dataset,
    data_collator=collator,
)
trainer

<trl.trainer.ppo_trainer.PPOTrainer at 0x7f49e9c77160>

开始训练之前，先测试各模型是否能正常工作：

- 生成模型：测试generate方法的结果是否符合预期
- 奖励模型：测试reward模型的reward_score的结果是否符合预期

1. 测试生成模型的效果，SFT后的生成模型能否正常生成通顺的文本：

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast
ref_model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, device_map="auto")
ref_tokenizer = BloomTokenizerFast.from_pretrained(args.model_name_or_path)
query=["who are you?", "what is the capital of USA?", "what is the capital of India?"]
prompt_query = ['<human>:' + q.strip() + '\n<bot>:' for q in query]
for q in prompt_query:
    inputs = ref_tokenizer(q, return_tensors="pt")
    inputs = inputs.to(device=device)

    generate_ids = ref_model.generate(
        **inputs,
        max_new_tokens=120, 
        do_sample=True, 
        top_p=0.85, 
        temperature=1.0, 
        repetition_penalty=1.0, 
        eos_token_id=ref_tokenizer.eos_token_id, 
        bos_token_id=ref_tokenizer.bos_token_id, 
        pad_token_id=ref_tokenizer.pad_token_id,
    )

    output = ref_tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]
    print(output)
    print()

del ref_model
del ref_tokenizer

<human>:who are you?
<bot>:I am the one with the most ability

<human>:what is the capital of USA?
<bot>:Boston

<human>:what is the capital of India?
<bot>:Shree



In [19]:
def empty_cache():
    import torch
    torch.cuda.empty_cache()
    torch.cuda.empty_cache()
    torch.cuda.empty_cache()
    for i in range(10):
        torch.cuda.empty_cache()
        
empty_cache()

2. 测试 reward 模型的效果，能否正常工作：

In [20]:
q = 'hi'
a = 'hello'
b = 'torccccc'

good_score = get_reward_score(reward_model, reward_tokenizer, q, a, device)
logger.info(f"good_score: {good_score}")

bad_score = get_reward_score(reward_model, reward_tokenizer, q, b, device)
logger.info(f"bad_score: {bad_score}")
assert good_score > bad_score
    

2023-06-07 15:58:49.475 | INFO     | __main__:<cell line: 6>:6 - good_score: tensor([-1.1055], dtype=torch.float16)
2023-06-07 15:58:49.540 | INFO     | __main__:<cell line: 9>:9 - bad_score: tensor([-2.5137], dtype=torch.float16)


训练loop:

In [21]:
# These arguments are passed to the `generate` function of the PPOTrainer
generation_kwargs = {
    "max_new_tokens": max_target_length,
    "temperature": 1.0,
    "repetition_penalty": 1.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
    "bos_token_id": tokenizer.bos_token_id,
}
print(generation_kwargs)

{'max_new_tokens': 256, 'temperature': 1.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'do_sample': True, 'pad_token_id': 3, 'eos_token_id': 2, 'bos_token_id': 1}


In [22]:
def save_model(save_dir):
    """Save model"""
    trainer.accelerator.unwrap_model(trainer.model).save_pretrained(save_dir)
    trainer.tokenizer.save_pretrained(save_dir)

In [23]:
# Training
if args.do_train:
    logger.info("*** Train ***")
    total_steps = config.total_ppo_epochs
    for step, batch in tqdm(enumerate(trainer.dataloader)):
        if step >= total_steps:
            break
        question_tensors = batch["input_ids"]
        question_tensors = [torch.LongTensor(i).to(device).squeeze(0) for i in question_tensors]
        responses = []
        response_tensors = []
        for q_tensor in question_tensors:
            response_tensor = trainer.generate(
                q_tensor,
                return_prompt=False,
                **generation_kwargs,
            )
            r = tokenizer.batch_decode(response_tensor, skip_special_tokens=True)[0]
            responses.append(r)
            response_tensors.append(response_tensor.squeeze(0))
        batch["response"] = responses

        # Compute reward score
        score_outputs = [
            get_reward_score(reward_model, reward_tokenizer, q, r, device) for q, r in
            zip(batch["query"], batch["response"])
        ]
        rewards = [torch.tensor(float(score) - args.reward_baseline) for score in score_outputs]

        # Run PPO step
        try:
            stats = trainer.step(question_tensors, response_tensors, rewards)
            trainer.log_stats(stats, batch, rewards)
            logger.debug(f"Step {step}/{total_steps}: reward score:{score_outputs}")
        except ValueError as e:
            logger.warning(f"Failed to log stats for step {step}, because of {e}")

        if step and step % args.save_steps == 0:
            save_dir = os.path.join(output_dir, f"checkpoint-{step}")
            save_model(save_dir)
    # Save final model
    save_model(output_dir)

2023-06-07 15:58:50.721 | INFO     | __main__:<cell line: 2>:3 - *** Train ***
0it [00:00, ?it/s]You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
2023-06-07 15:59:36.909 | DEBUG    | __main__:<cell line: 2>:34 - Step 0/7: reward score:[tensor([-3.6855], dtype=torch.float16), tensor([-2.9043], dtype=torch.float16), tensor([-2.5605], dtype=torch.float16), tensor([-3.1738], dtype=torch.float16), tensor([-3.1387], dtype=torch.float16), tensor([-3.7012], dtype=torch.float16), tensor([-4.6055], dtype=torch.float16), tensor([-2.8008], dtype=torch.float16)]
1it [00:46, 46.19s/it]2023-06-07 16:00:19.815 | DEBUG    | __main__:<cell line: 2>:34 - Step 1/7: reward score:[tensor([-0.4875], dtype=torch.float16), tensor([-1.9980], dtype=torch.float16), tensor([-3.1348], dtype=torch.float16), tensor([-2.7383], dtype=torch.floa

In [24]:
output_dir

'outputs-rl'

In [26]:
%ls outputs-rl

adapter_config.json  pytorch_model.bin        tokenizer.json         [0m[01;34mtrl[0m/
adapter_model.bin    special_tokens_map.json  tokenizer_config.json


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`，合并到base model的方法见`scripts/merge_peft_adapter.py`
- 日志保存在`output_dir/trl`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/trl --host 0.0.0.0 --port 8009`

本节完。