This notebook performs supervised fine-tuning with DeepSpeed Chat on OPT-1.3B. For training efficiency, it uses LoRA adapters.
The notebook can run on configurations with at least 10 GB of VRAM and 24 GB of CPU RAM.

You can find more details on supervised fine-tuning using DeepSpeed Chat in this article: [Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning](https://kaitchup.substack.com/p/train-instruct-llms-on-your-gpu-with)

In [None]:
!pip install deepspeed>=0.9.0

!git clone https://github.com/microsoft/DeepSpeedExamples.git
%cd DeepSpeedExamples/applications/DeepSpeed-Chat/
!pip install -r requirements.txt

Cloning into 'DeepSpeedExamples'...
remote: Enumerating objects: 8456, done.[K
remote: Counting objects: 100% (897/897), done.[K
remote: Compressing objects: 100% (348/348), done.[K
remote: Total 8456 (delta 533), reused 751 (delta 440), pack-reused 7559[K
Receiving objects: 100% (8456/8456), 22.28 MiB | 15.51 MiB/s, done.
Resolving deltas: 100% (4748/4748), done.
/content/DeepSpeedExamples/applications/DeepSpeed-Chat
Collecting datasets>=2.8.0 (from -r requirements.txt (line 1))
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.97 (from -r requirements.txt (line 2))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.3 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.15.0 (from -

In [None]:
%cd training/step1_supervised_finetuning/
!deepspeed --num_gpus 1 main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 8 \
   --per_device_eval_batch_size 8 \
   --max_seq_len 512 \
   --learning_rate 1e-3 \
   --weight_decay 0. \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 100 \
   --seed 1234 \
   --only_optimize_lora \
   --zero_stage 0 \
   --gradient_checkpointing \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir results

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
Model Parameters: 1.429 B, Latency: 2.68s, TFLOPs: 12.75, Samples/sec: 2.98, Time/seq 0.34s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.64s, TFLOPs: 12.95, Samples/sec: 3.03, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.63s, TFLOPs: 13.01, Samples/sec: 3.04, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.64s, TFLOPs: 12.96, Samples/sec: 3.03, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.65s, TFLOPs: 12.91, Samples/sec: 3.02, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.66s, TFLOPs: 12.84, Samples/sec: 3.01, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
Model Parameters: 1.429 B, Latency: 2.64s, TFLOPs: 12.94, Samples/sec: 3.03, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
[2023

In [None]:
!python prompt_eval.py --model_name_or_path_baseline  facebook/opt-1.3b \
                        --model_name_or_path_finetune  kaitchup/OPT-1.3B-SFT-DSChatLoRA

[2023-09-04 14:03:43,198] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50272. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Downloading (…)lve/main/config.json: 100% 770/770 [00:00<00:00, 4.44MB/s]
Downloading pytorch_model.bin: 100% 2.63G/2.63G [11:11<00:00, 3.92MB/s]
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50272. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing th