# Fine-tune Dolly-v2-7b with Ray AIR LightningTrainer and FSDP

## Set up ray cluster 
In this example, we are using a ray cluster with 1 g4dn.8xlarge instance (head node) and 16 g4dn.4xlarge instances (worker nodes). Each instance has one Tesla T4 GPU (16GiB Memory). 

We define a `runtime_env` to install the necessary Python libraries on each node. You can skip this step if you have already installed all the required packages in your workers' base image.

In [2]:
import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            "accelerate>=0.18.0",
            "transformers>=4.28.0",
            "torch>=2.0.0",
            "pytorch_lightning>=2.0",
        ]
    }
)

  from .autonotebook import tqdm as notebook_tqdm
find: ‘.git’: No such file or directory
2023-05-03 01:22:08,570	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.0.108.10:6379...
2023-05-03 01:22:08,586	INFO worker.py:1607 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_m411tiqu8eluvt1k5ivfqj4q5r/services?redirect_to=dashboard 
2023-05-03 01:22:09,161	INFO packaging.py:520 -- Creating a file package for local directory '/tmp/ray_tmp_module/ray'.
2023-05-03 01:22:10,019	INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_f34236f1aec697e6.zip' (152.99MiB) to Ray cluster...
2023-05-03 01:22:10,568	INFO packaging.py:360 -- Successfully pushed file package 'gcs://_ray_pkg_f34236f1aec697e6.zip'.
2023-05-03 01:22:10,596	INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_15f0985dd965ef454042c1796aa119b5.zip' (0.16MiB) to Ray cluster...
2023-05-03 01:22:10,597	INFO packaging.py:360 -- Succe

0,1
Python version:,3.8.13
Ray version:,3.0.0.dev0
Dashboard:,http://console.anyscale-staging.com/api/v2/sessions/ses_m411tiqu8eluvt1k5ivfqj4q5r/services?redirect_to=dashboard


In [3]:
num_workers = 16
batch_size_per_worker = 10
MODEL_NAME = "databricks/dolly-v2-7b"

## Prepare your data 
We are using tiny_shakespeare for fine-tuning, which contains 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post ['The Unreasonable Effectiveness of Recurrent Neural Networks'](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). 

Dataset samples:
```
BAPTISTA:
I know him well: you are welcome for his sake.

GREMIO:
Saving your tale, Petruchio, I pray,
Let us, that are poor petitioners, speak too:
Baccare! you are marvellous forward.

PETRUCHIO:
O, pardon me, Signior Gremio; I would fain be doing.
```

Here, we have adopted similar pre-processing logic from another demo: {ref}`GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed <gpt-j-6b-finetune-deepspeed>`.

In [4]:
import ray
import pandas as pd
from datasets import load_dataset
from ray.data.preprocessors import BatchMapper, Chain
from transformers import AutoTokenizer, AutoModelForCausalLM

def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=256,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)

splitter = BatchMapper(split_text, batch_format="pandas")
tokenizer = BatchMapper(tokenize, batch_format="pandas")
preprocessor = Chain(splitter, tokenizer)

hf_dataset = load_dataset("tiny_shakespeare")
ray_datasets = ray.data.from_huggingface(hf_dataset)

Downloading builder script: 100%|██████████| 3.73k/3.73k [00:00<00:00, 4.33MB/s]
Downloading metadata: 100%|██████████| 1.90k/1.90k [00:00<00:00, 1.86MB/s]
Downloading readme: 100%|██████████| 6.10k/6.10k [00:00<00:00, 6.94MB/s]


Downloading and preparing dataset tiny_shakespeare/default to /home/ray/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e...


Downloading data: 1.12MB [00:00, 15.3MB/s]                  
                                                                         

Dataset tiny_shakespeare downloaded and prepared to /home/ray/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 1164.33it/s]


We first split the original paragraphs into multiple sentences, then tokenize them. Here are some samples:

In [5]:
ds = ray_datasets["train"]
splitter.fit_transform(ds).take(10)

2023-05-03 01:22:20,623	INFO datastream.py:2271 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2023-05-03 01:22:20,626	INFO streaming_executor.py:87 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper]
2023-05-03 01:22:20,629	INFO streaming_executor.py:88 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-05-03 01:22:20,629	INFO streaming_executor.py:90 -- Tip: To enable per-operator progress reporting, set RAY_DATA_VERBOSE_PROGRESS=1.
                                                                                                                   

[{'text': 'Before we proceed any further, hear me speak.'},
 {'text': 'Speak, speak.'},
 {'text': 'You are all resolved rather to die than to famish?'},
 {'text': 'Resolved. resolved.'},
 {'text': 'First, you know Caius Marcius is chief enemy to the people.'},
 {'text': "We know't, we know't."},
 {'text': "Let us kill him, and we'll have corn at our own price."},
 {'text': "Is't a verdict?"},
 {'text': "No more talking on't; let it be done: away, away!"},
 {'text': 'One word, good citizens.'}]

## Define your lightning model

In this example, we use the [Dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b) model for finetuning. It is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. We load the model weights from Huggingface Model Hub and encapsulate it into a `pl.LightningModule`.

:::{note}
Make sure you pass the FSDP wrapped model parameters `self.trainer.model.parameters()` into the optimizer, instead of `self.model.parameters()`. 
:::


In [6]:
import torch
import pytorch_lightning as pl

class DollyV2Model(pl.LightningModule):
    def __init__(self, lr=2e-5, eps=1e-8):
        super().__init__()
        self.lr = lr
        self.eps = eps
        self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        self.predictions = []
        self.references = []

    def forward(self, batch):
        outputs = self.model(
            batch["input_ids"], 
            attention_mask=batch["attention_mask"], 
            labels=batch["labels"]
        )
        return outputs.loss

    def training_step(self, batch, batch_idx):
        loss = self.forward(batch)
        self.log("train_loss", loss, prog_bar=True, on_step=True)
        return loss

    def configure_optimizers(self):
        if self.global_rank == 0:
            print(self.trainer.model)
        return torch.optim.AdamW(self.trainer.model.parameters(), lr=self.lr, eps=self.eps)

## Configure your FSDP strategy
As Dolly-v2-7b is a relatively large model, it cannot be properly fit into a single commercial GPU. In this example, we use the FSDP strategy to shard model parameters across multiple workers. This allows us to avoid GPU out-of-memory issues and support a larger global batch size.

:::{note}
FSDP is a type of data parallelism that shards model parameters, optimizer states and gradients across DDP ranks. This was inspired by Xu et al. as well as the ZeRO Stage 3 from DeepSpeed. You may refer to these blogs for more information:

- [Getting Started with Fully Sharded Data Parallel(FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#:~:text=FSDP%20is%20a%20type%20of,sizes%20for%20our%20training%20job.)
- [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/)
- [PyTorch FSDP Tutorial](https://www.youtube.com/watch?v=8_k76AHu__s&list=PL_lsbAsL_o2BT6aerEKgIoufVD_fodnuT)
:::

To start trainig with Lightning's [FSDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html#lightning.pytorch.strategies.FSDPStrategy), you only need to provide the initialization arguments in `LightningConfigBuilder.strategy()`. Behind the scenes, LightningTrainer handles the cluster environment settings and job launching.


:::{tips}
Some tips for FSDP configutarion:
- `sharding_strategy`:
    - `ShardingStrategy.NO_SHARD`: Parameters, gradients, and optimizer states are not sharded. Similar to DDP.
    - `ShardingStrategy.SHARD_GRAD_OP`: Gradients and optimizer states are sharded during computation, while parameters are sharded outside computation. Similar to ZeRO stage 2.
    - `ShardingStrategy.FULL_SHARD`: Parameters, gradients, and optimizer states are sharded. It has minimal GRAM usage among the 3 options. Similar to ZeRO stage 3.
- `auto_wrap_policy`:
    - Model layers are often wrapped with FSDP in a layered fashion. This means that only the layers in a single FSDP instance are required to aggregate all parameters to a single device during forwarding or backward calculations.
    - Use `transformer_auto_wrap_policy` to automatically wrap each Transformer Block into a single FSDP instance. 
- `backward_prefetch` and `forward_prefetch`:
    - Overlap the upcoming all-gather while executing the current forward/backward pass. It can improve throughput but may slightly increase peak memory usage.
:::

In [7]:
import functools
from ray.train.lightning import LightningTrainer, LightningConfigBuilder
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from torch.distributed.fsdp import ShardingStrategy, BackwardPrefetch
from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer

# Define the model sharding policy:
# Wrap every GPTNeoXLayer as its own FSDP instance
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls = {GPTNeoXLayer}
)

# Aggregate all arguments for LightningTrainer
lightning_config = (
    LightningConfigBuilder()
    .module(cls=DollyV2Model, lr=2e-5, eps=1e-8)
    .trainer(
        max_epochs=1, 
        accelerator="gpu", 
        precision="16-mixed",
        max_steps=40, # Accelerate the release test
    )
    .strategy(
        name="fsdp",
        sharding_strategy=ShardingStrategy.FULL_SHARD,
        backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
        forward_prefetch=True,
        auto_wrap_policy=auto_wrap_policy,
        limit_all_gathers=True,
        activation_checkpointing=[GPTNeoXLayer],
    )
    .checkpointing(save_top_k=0, save_weights_only=True, save_last=True)
)

In [None]:
from pytorch_lightning.callbacks import TQDMProgressBar

# Create a customized progress bar for LightningTrainer
class DollyV2ProgressBar(TQDMProgressBar):
    def __init__(self, num_iters_per_epoch, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_iters_per_epoch = num_iters_per_epoch
    
    def on_train_epoch_start(self, trainer, *_):
        super().on_train_epoch_start(trainer, *_)
        self.train_progress_bar.reset(self.num_iters_per_epoch)

total_batches = splitter.fit_transform(ray_datasets["train"]).count()
num_iters_per_epoch = total_batches // (num_workers * batch_size_per_worker)
lightning_config.trainer(callbacks=[DollyV2ProgressBar(num_iters_per_epoch)])

## Fine-tune with LightningTrainer

```{note}
Here we upload the checkpoints to cloud storage by setting S3 bucket URI to {class}`air.RunConfig(storage_path) <ray.air.RunConfig>`. You can also write to your local file system. See {ref}`train-run-config` for an example.
```

In [9]:
from ray.tune.syncer import SyncConfig
# Save AIR checkpoints according to the performance on validation set
run_config = RunConfig(
    name="finetune_dolly-v2-7b",
    # storage_path="s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/ray-lightning-results-7b/",
    checkpoint_config=CheckpointConfig(),
    sync_config=SyncConfig(sync_artifacts=False)
)

# Scale the DDP training workload across 16 GPUs
# You can change this config based on your compute resources.
scaling_config = ScalingConfig(
    num_workers=num_workers, use_gpu=True, resources_per_worker={"CPU": 12, "GPU": 1}
)

trainer = LightningTrainer(
    lightning_config=lightning_config.build(),
    run_config=run_config,
    scaling_config=scaling_config,
    datasets={"train": ray_datasets["train"]},
    datasets_iter_config={"batch_size": batch_size_per_worker},
    preprocessor=preprocessor,
)
result = trainer.fit()

result


0,1
Current time:,2023-05-03 02:22:06
Running for:,00:59:42.80
Memory:,7.4/124.4 GiB

Trial name,status,loc,iter,total time (s),train_loss,epoch,step
LightningTrainer_a1a3d_00000,TERMINATED,10.0.108.10:8284,1,3024.09,0.176025,0,135


(LightningTrainer pid=8284) 2023-05-03 01:22:31,584	INFO backend_executor.py:128 -- Starting distributed worker processes: ['8425 (10.0.108.10)', '3544 (10.0.102.225)', '3576 (10.0.80.223)', '3541 (10.0.102.21)', '3563 (10.0.108.25)', '3541 (10.0.114.187)', '3508 (10.0.67.62)', '3424 (10.0.86.122)', '3532 (10.0.113.13)', '3488 (10.0.96.142)', '3474 (10.0.122.128)', '3572 (10.0.112.171)', '3504 (10.0.78.238)', '3554 (10.0.79.247)', '3521 (10.0.107.4)', '3570 (10.0.104.19)']
(RayTrainWorker pid=8425) 2023-05-03 01:22:33,824	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=16]
(LightningTrainer pid=8284) 2023-05-03 01:22:34,427	INFO streaming_executor.py:87 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper->BatchMapper] -> AllToAllOperator[RandomizeBlockOrder]
(LightningTrainer pid=8284) 2023-05-03 01:22:34,427	INFO streaming_executor.py:88 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, obj

(RayTrainWorker pid=3474, ip=10.0.122.128) FullyShardedDataParallel(
(RayTrainWorker pid=3474, ip=10.0.122.128)   (_fsdp_wrapped_module): _LightningModuleWrapperBase(
(RayTrainWorker pid=3474, ip=10.0.122.128)     (_forward_module): DollyV2Model(
(RayTrainWorker pid=3474, ip=10.0.122.128)       (model): GPTNeoXForCausalLM(
(RayTrainWorker pid=3474, ip=10.0.122.128)         (gpt_neox): GPTNeoXModel(
(RayTrainWorker pid=3474, ip=10.0.122.128)           (embed_in): Embedding(50280, 4096)
(RayTrainWorker pid=3474, ip=10.0.122.128)           (layers): ModuleList(
(RayTrainWorker pid=3474, ip=10.0.122.128)             (0-31): 32 x FullyShardedDataParallel(
(RayTrainWorker pid=3474, ip=10.0.122.128)               (_fsdp_wrapped_module): CheckpointWrapper(
(RayTrainWorker pid=3474, ip=10.0.122.128)                 (_checkpoint_wrapped_module): GPTNeoXLayer(
(RayTrainWorker pid=3474, ip=10.0.122.128)                   (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
(Ra

(RayTrainWorker pid=8425) 
(RayTrainWorker pid=8425)   | Name  | Type               | Params
(RayTrainWorker pid=8425) ---------------------------------------------
(RayTrainWorker pid=8425) 0 | model | GPTNeoXForCausalLM | 402 M 
(RayTrainWorker pid=8425) ---------------------------------------------
(RayTrainWorker pid=8425) 402 M     Trainable params
(RayTrainWorker pid=8425) 0         Non-trainable params
(RayTrainWorker pid=8425) 402 M     Total params
(RayTrainWorker pid=8425) 1,611.039 Total estimated model params size (MB)
(RayTrainWorker pid=8425)   rank_zero_warn(
(RayTrainWorker pid=3532, ip=10.0.113.13) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] [repeated 15x across cluster]


Epoch 0:   0%|          | 0/134 [00:00<?, ?it/s]
Epoch 0:   1%|          | 1/134 [00:20<45:23, 20.48s/it, v_num=0, train_loss=12.90]
(RayTrainWorker pid=8425) FullyShardedDataParallel( [repeated 15x across cluster]
(RayTrainWorker pid=8425)   (_fsdp_wrapped_module): _LightningModuleWrapperBase( [repeated 15x across cluster]
(RayTrainWorker pid=8425)     (_forward_module): DollyV2Model( [repeated 15x across cluster]
(RayTrainWorker pid=8425)       (model): GPTNeoXForCausalLM( [repeated 15x across cluster]
(RayTrainWorker pid=8425)         (gpt_neox): GPTNeoXModel( [repeated 15x across cluster]
(RayTrainWorker pid=8425)           (embed_in): Embedding(50280, 4096) [repeated 15x across cluster]
(RayTrainWorker pid=8425)           (layers): ModuleList( [repeated 15x across cluster]
(RayTrainWorker pid=8425)             (0-31): 32 x FullyShardedDataParallel( [repeated 15x across cluster]
(RayTrainWorker pid=8425)               (_fsdp_wrapped_module): CheckpointWrapper( [repeated 15x across 

Trial name,_report_on,date,done,epoch,experiment_tag,hostname,iterations_since_restore,node_ip,pid,should_checkpoint,step,time_since_restore,time_this_iter_s,time_total_s,timestamp,train_loss,training_iteration,trial_id
LightningTrainer_a1a3d_00000,train_epoch_end,2023-05-03_02-12-53,True,0,0,ip-10-0-108-10,1,10.0.108.10,8284,True,135,3024.09,3024.09,3024.09,1683105172,0.176025,1,a1a3d_00000


(RayTrainWorker pid=8425) `Trainer.fit` stopped: `max_epochs=1` reached.
(RayTrainWorker pid=8425) RayFSDPStrategy: tearing down strategy...


Epoch 0: : 135it [46:32, 20.68s/it, v_num=0, train_loss=0.176]


2023-05-03 02:22:06,997	INFO tune.py:1010 -- Total run time: 3583.09 seconds (3224.62 seconds for the tuning loop).


Result(
  metrics={'_report_on': 'train_epoch_end', 'train_loss': 0.176025390625, 'epoch': 0, 'step': 135, 'should_checkpoint': True, 'done': True, 'trial_id': 'a1a3d_00000', 'experiment_tag': '0'},
  path='s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/ray-lightning-results-7b/finetune_dolly-v2-7b/LightningTrainer_a1a3d_00000_0_2023-05-03_01-22-24',
  checkpoint=LightningCheckpoint(uri=s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/ray-lightning-results-7b/finetune_dolly-v2-7b/LightningTrainer_a1a3d_00000_0_2023-05-03_01-22-24/checkpoint_000000)
)

We finished training in 3024s. The price for an on-demand g4dn.4xlarge instance is `$1.204/hour`, while a g4dn.4xlarge instance costs `$2.176/hour`. The total cost would be `($1.204 * 15 + $2.176) * 3024 / 3600 = $17`.

## Text-generation with HuggingFace Pipeline

Next, we can use the [HuggingFace Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) to generate predictions from our fine-tuned model. Let's input some prompts and see if our tuned Dolly can speak like Shakespeare:

In [None]:
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="right")

# 7B model cannot fit in one T4 GPU (15GiB). Load it to CPU first.
dolly = result.checkpoint.get_model(model_class=DollyV2Model, map_location=torch.device("cpu"))

# Using device_map="auto", 🤗 Accelerate automatically put layers to different devices based on the available resources.
nlp_pipeline = pipeline(task="text-generation", model=dolly.model, tokenizer=tokenizer, device_map="auto")

for prompt in ["This is", "I am", "Once more"]:
    print(nlp_pipeline(prompt, max_new_tokens=15, do_sample=True, pad_token_id=tokenizer.eos_token_id))