In this notebook, I experiment with ReLoRa: a framework that enables pre-training with low-rank adapters that I described here:.

*Last update: July 19th 2023. If something doesn't work anymore, you will have to retrieve versions available at that date. Please also comment about the issue on the related post so that I can fix the notebook: *

In [None]:
!git clone https://github.com/Guitaricet/peft_pretraining.git

Cloning into 'peft_pretraining'...
remote: Enumerating objects: 635, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 635 (delta 43), reused 40 (delta 29), pack-reused 574[K
Receiving objects: 100% (635/635), 1.20 MiB | 23.15 MiB/s, done.
Resolving deltas: 100% (398/398), done.


In [None]:
%cd peft_pretraining/

/content/peft_pretraining


Install all the requirements:

In [None]:
!pip install -r requirements.txt

Collecting transformers (from -r requirements.txt (line 2))
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers (from -r requirements.txt (line 3))
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from -r requirements.txt (line 4))
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft (from -r requirements.txt (line 5))
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hColl

Since training takes a lot of time, I recommend first trying the framework with hyper-parameters that will stop the training early and shorten the validation (which can take many hours on consumer hardware).

The framework doesn't have the option to shorten the validation, so we will have to do it manually. Open the file torchrun_main.py, and replace the line:

*if evaluated_on_tokens > target_eval_tokens:*


Note: At the time of writing this article, this line is line 129.
by:


*if evaluated_on_tokens > target_eval_tokens or total_batches > 10:*

Step one: We initialize pre-training without peft and without LoRa

In [None]:
!torchrun --nproc-per-node 1 torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 4 \
    --total_batch_size 8 \
    --lr 5e-4 \
    --max_length 512 \
    --tags warm_start_250M \
    --save_every 10 \
    --num_training_steps 2 \
    --workers 1 \
    --eval_every 1

Starting script
local rank: 0, device: 0
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice: 3
[34m[1mwandb[0m: You chose "Don't visualize my results"
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: W&B syncing is set to [1m`offline`[0m in this directory.  
[34m[1mwandb[0m: Run [1m`wandb online`[0m or set [1mWANDB_MODE=online[0m to enable cloud syncing.
[32m2023-07-19 09:39:14.105[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m191[0m - [1mUsing torch.distributed with rank 0 (only rank 0 will log)[0m
[32m2023-07-19 09:39:14.106[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m192[0m - [1m****************************************[0m
[32m2023-07-19 09:39:14.106[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m193[0m - [1mStarting training with the arguments[0m
[

Step two: Perform ReLoRa

In [None]:
!torchrun --nproc-per-node 1 torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 4 \
    --total_batch_size 8 \
    --lr 1e-3 \
    --max_length 512 \
    --use_peft \
    --relora 5 \
    --cycle_length 5 \
    --restart_warmup_steps 10 \
    --scheduler cosine_restarts \
    --warmup_steps 2 \
    --reset_optimizer_on_relora True \
    --num_training_steps 50 \
    --save_every 10 \
    --eval_every 10 \
    --continue_from checkpoints/llama_250m-2023-07-19-09-39-08/model_3 \ #change this line to the name of your checkpoints
    --tags relora_250M

Starting script
local rank: 0, device: 0
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice: 3
[34m[1mwandb[0m: You chose "Don't visualize my results"
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: W&B syncing is set to [1m`offline`[0m in this directory.  
[34m[1mwandb[0m: Run [1m`wandb online`[0m or set [1mWANDB_MODE=online[0m to enable cloud syncing.
[32m2023-07-19 09:58:43.765[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m190[0m - [1mUsing torch.distributed with rank 0 (only rank 0 will log)[0m
[32m2023-07-19 09:58:43.766[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m191[0m - [1m****************************************[0m
[32m2023-07-19 09:58:43.766[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m192[0m - [1mStarting training with the arguments[0m
[

Now that we are sure that everything works, we can run ReLoRa pre-training with reasonable hyperparameters. *Note: I recommend removing "total_batches > 10" that we have added to speed up validation.*

In [None]:
torchrun --nproc-per-node 1 torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 4 \
    --total_batch_size 8\
    --lr 5e-4 \
    --max_length 512 \
    --tags warm_start_250M \
    --save_every 1000 \
    --num_training_steps 10000

torchrun --nproc-per-node 1 torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 4 \
    --total_batch_size 8 \
    --lr 1e-3 \
    --max_length 512 \
    --use_peft \
    --relora 5000 \
    --cycle_length 5000 \
    --restart_warmup_steps 100 \
    --scheduler cosine_restarts \
    --warmup_steps 500 \
    --reset_optimizer_on_relora True \
    --num_training_steps 10000 \
    --save_every 5000 \
    --eval_every 5000 \
    --continue_from <your checkpoint from step one>\ #change this line to the name of your checkpoints
    --tags relora_250M