This notebook trains a RLHF model with DeepSpeed Chat using PPO.
This is the third step for training an instruct LLMs with DeepSpeed Chat.
The details of this step are explained in this article: [Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback](https://kaitchup.substack.com/p/train-instruct-llms-on-your-gpu-with-6a5)


You can find more details on supervised fine-tuning and training a reward model using DeepSpeed Chat in these article:

[Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning](https://kaitchup.substack.com/p/train-instruct-llms-on-your-gpu-with)

[Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model](https://kaitchup.substack.com/p/train-instruct-llms-on-your-gpu-with-1e1)


In [None]:
!pip install deepspeed>=0.9.0

!git clone https://github.com/microsoft/DeepSpeedExamples.git
%cd DeepSpeedExamples/applications/DeepSpeed-Chat/
!pip install -r requirements.txt

Cloning into 'DeepSpeedExamples'...
remote: Enumerating objects: 8686, done.[K
remote: Counting objects: 100% (2085/2085), done.[K
remote: Compressing objects: 100% (334/334), done.[K
remote: Total 8686 (delta 1860), reused 1803 (delta 1714), pack-reused 6601[K
Receiving objects: 100% (8686/8686), 22.27 MiB | 22.51 MiB/s, done.
Resolving deltas: 100% (4952/4952), done.
/content/DeepSpeedExamples/applications/DeepSpeed-Chat
Collecting datasets>=2.8.0 (from -r requirements.txt (line 1))
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.97 (from -r requirements.txt (line 2))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.15.0 (f

Train the RLHF model starting from a copy of the model trained in Step 1. It requires at least 10 GB of CPU RAM and 16 GB of VRAM for a batch size of 8.
The model is saved in a directory named rlhf.

In [None]:
%cd training/step3_rlhf_finetuning/


!deepspeed --num_gpus 1 main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path kaitchup/OPT-1.3B-SFT-DSChatLoRA \
   --critic_model_name_or_path kaitchup/OPT-350M-RM-DSChat \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 24 \
   --per_device_training_batch_size 24 \
   --generation_batches 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate 5e-4 \
   --critic_learning_rate 5e-6 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 6 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --actor_zero_stage 0 \
   --critic_zero_stage 0 \
   --offload_reference_model \
   --actor_lora_dim 128 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --disable_actor_dropout \
   --enable_hybrid_engine \
   --only_optimize_lora \
   --output_dir ./rlhf

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
-------------------------------------------------------------------------------------
|E2E latency=4.40s |Gather latency=0.00s (0.00%) |Generate time=1.84s (41.81%) |Training time=2.09s (47.60%) |Others=0.47 (10.59%)|CurSamplesPerSec=5.46 |AvgSamplesPerSec=5.42
Epoch: 0 | Step: 651 | PPO Epoch: 1 | Actor Loss: 0.004032135009765625 | Critic Loss: 0.0271148681640625 | Unsupervised Loss: 0.0
End-to-End => Latency: 3.66s, TFLOPs: 46.88, Samples/sec: 6.57, Time/seq 0.15s, Batch Size: 24, Total Seq. Length: 512
Generation => Latency: 1.95s, Per-token Latency 7.63 ms, TFLOPs: 16.80, BW: 374.48 GB/sec, Answer Seq. Length: 256
Training   => Latency: 1.70s, TFLOPs: 81.42
Actor Model Parameters => 1.429 B, Critic Model Parameters => 0.331 B
Average reward score: 0.9375
-------------------------------------------------------------------------------------
|E2E latency=4.52s |Gather latency=0.00s (0.00%) |Gen

You can compare the RLHF model with the original model fine-tuned at step 1 with the following script. Replace the model paths by your own.

In [None]:
%cd training/step1_supervised_finetuning/
!python prompt_eval.py --model_name_or_path_baseline  kaitchup/OPT-1.3B-SFT-DSChatLoRA \
                --model_name_or_path_finetune  kaitchup/OPT-1.3B-RLHF-DSChatLoRA

/content/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning
[2023-09-21 17:41:51,968] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)lve/main/config.json: 100% 770/770 [00:00<00:00, 4.63MB/s]
Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 2.42MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.87MB/s]
Downloading model.safetensors: 100% 2.63G/2.63G [02:07<00:00, 20.7MB/s]
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 50272. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Downloading (…)lve/main/config.json: 100%