(dolly_lightning_fsdp_finetuning)=

# Fine-tune `dolly-v2-7b` with Ray Train, PyTorch Lightning and FSDP

In this example, we demonstrate how to use Ray Train to fine-tune a [`dolly-v2-7b`](https://huggingface.co/databricks/dolly-v2-7b) model. `dolly-v2-12b` is a 12 billion parameter causal language model created by Databricks, derived from EleutherAI’s [Pythia-12b](https://huggingface.co/EleutherAI/pythia-12b), and fine-tuned on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data).

We load the pre-trained model from the HuggingFace model hub into a LightningModule and launch an FSDP fine-tuning job across 16 T4 GPUs with the help of {class}`Ray TorchTrainer <ray.train.torch.TorchTrainer>`. It is also straightforward to fine-tune other similar large language models in a similar manner as shown in this example.

Before starting this example, we highly recommend reading [Ray Train Key Concepts](train-key-concepts) and [Ray Data Key Concepts](data_key_concepts).

## Set up ray cluster 
In this example, we are using a ray cluster with 16 g4dn.4xlarge instances. Each instance has one Tesla T4 GPU (16GiB Memory). 

We define a `runtime_env` to install the necessary Python libraries on each node. You can skip this step if you have already installed all the required packages in your workers' base image. We tested this example with `pytorch_lightning==2.0.2` and `transformers==4.29.2`.

In [1]:
import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            "transformers>=4.26.0",
            "torch>=1.12.0",
            "pytorch_lightning>=2.0",
        ]
    }
)

2023-08-30 10:19:23,505	INFO util.py:159 -- Outdated packages:
  ipywidgets==7.8.0 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2023-08-30 10:19:23,569	INFO worker.py:1459 -- Connecting to existing Ray cluster at address: 10.0.23.226:6379...
2023-08-30 10:19:23,618	INFO worker.py:1640 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-29cev7pafynccfbmnzvd6giqpe.i.anyscaleuserdata-staging.com [39m[22m
2023-08-30 10:19:23,621	INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_eb5ee6ea6668e2003d3815d2e85033e7.zip' (0.51MiB) to Ray cluster...
2023-08-30 10:19:23,624	INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_eb5ee6ea6668e2003d3815d2e85033e7.zip'.


0,1
Python version:,3.8.13
Ray version:,3.0.0.dev0
Dashboard:,http://session-29cev7pafynccfbmnzvd6giqpe.i.anyscaleuserdata-staging.com


In [2]:
MODEL_NAME = "databricks/dolly-v2-7b"

## Prepare your data 
We are using tiny_shakespeare for fine-tuning, which contains 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post ['The Unreasonable Effectiveness of Recurrent Neural Networks'](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). 

Dataset samples:
```
BAPTISTA:
I know him well: you are welcome for his sake.

GREMIO:
Saving your tale, Petruchio, I pray,
Let us, that are poor petitioners, speak too:
Baccare! you are marvellous forward.

PETRUCHIO:
O, pardon me, Signior Gremio; I would fain be doing.
```

Here, we have adopted similar pre-processing logic from another demo: {ref}`GPT-J-6B Fine-Tuning with Ray Train and DeepSpeed <gptj_deepspeed_finetune>`.

In [3]:
import ray
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=256,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)

hf_dataset = load_dataset("tiny_shakespeare")
train_ds = ray.data.from_huggingface(hf_dataset["train"])

Downloading builder script:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/435k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

2023-08-30 10:19:26,896	INFO util.py:159 -- Outdated packages:
  ipywidgets==7.8.0 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


We first split the original paragraphs into multiple sentences, then tokenize them. Here are some samples:

In [4]:
# First split the dataset into multiple sentences.
train_ds = train_ds.map_batches(split_text, batch_format="pandas")
train_ds.take(10)

2023-08-30 10:19:30,862	INFO dataset.py:2380 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2023-08-30 10:19:30,865	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)] -> LimitOperator[limit=10]
2023-08-30 10:19:30,866	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-08-30 10:19:30,867	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[{'text': 'Before we proceed any further, hear me speak.'},
 {'text': 'Speak, speak.'},
 {'text': 'You are all resolved rather to die than to famish?'},
 {'text': 'Resolved. resolved.'},
 {'text': 'First, you know Caius Marcius is chief enemy to the people.'},
 {'text': "We know't, we know't."},
 {'text': "Let us kill him, and we'll have corn at our own price."},
 {'text': "Is't a verdict?"},
 {'text': "No more talking on't; let it be done: away, away!"},
 {'text': 'One word, good citizens.'}]

In [5]:
# Then tokenize the dataset.
train_ds = train_ds.map_batches(tokenize, batch_format="pandas")

## Define your lightning model

In this example, we use the [dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b) model for finetuning. It is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. We load the model weights from Huggingface Model Hub and encapsulate it into a `pl.LightningModule`.

:::{note}
Make sure you pass the FSDP wrapped model parameters `self.trainer.model.parameters()` into the optimizer, instead of `self.model.parameters()`. 
:::


In [6]:
import torch
import pytorch_lightning as pl

class DollyV2Model(pl.LightningModule):
    def __init__(self, lr=2e-5, eps=1e-8):
        super().__init__()
        self.save_hyperparameters()
        self.lr = lr
        self.eps = eps
        self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

    def forward(self, batch):
        outputs = self.model(
            batch["input_ids"], 
            attention_mask=batch["attention_mask"], 
            labels=batch["labels"]
        )
        return outputs.loss

    def training_step(self, batch, batch_idx):
        loss = self.forward(batch)
        self.log("train_loss", loss, prog_bar=True, on_step=True)
        return loss

    def configure_optimizers(self):
        if self.global_rank == 0:
            print(self.trainer.model)
        return torch.optim.AdamW(self.trainer.model.parameters(), lr=self.lr, eps=self.eps)

[2m[1m[36m(autoscaler +12s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[2m[1m[36m(autoscaler +12s)[0m [activity] Cluster is active (source: ray).


## Configure your FSDP strategy
As `dolly-v2-7b` is a relatively large model, it cannot be properly fit into a single commercial GPU. In this example, we use the FSDP strategy to shard model parameters across multiple workers. This allows us to avoid GPU out-of-memory issues and support a larger global batch size.

![](https://user-images.githubusercontent.com/26745457/236892936-d4b91751-4689-421e-ac5f-edfd2eeeb635.png)
Image source: [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/)

:::{note}
FSDP is a type of data parallelism that shards model parameters, optimizer states and gradients across DDP ranks. This was inspired by Xu et al. as well as the ZeRO Stage 3 from DeepSpeed. You may refer to these blogs for more information:

- [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/)
- [Getting Started with Fully Sharded Data Parallel(FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#:~:text=FSDP%20is%20a%20type%20of,sizes%20for%20our%20training%20job.)
- [PyTorch FSDP Tutorial](https://www.youtube.com/watch?v=8_k76AHu__s&list=PL_lsbAsL_o2BT6aerEKgIoufVD_fodnuT)
:::

To start trainig with Lightning's [FSDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html#lightning.pytorch.strategies.FSDPStrategy), you only need to create a {class}`~ray.train.lightning.RayFSDPStrategy` with the same initialization arguments. Behind the scenes, Ray TorchTrainer handles the cluster environment settings and job launching.


In [7]:
import functools
import pytorch_lightning as pl 

from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from torch.distributed.fsdp import ShardingStrategy, BackwardPrefetch
from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer

from ray.train.lightning import RayFSDPStrategy


# Define the model sharding policy:
# Wrap every GPTNeoXLayer as its own FSDP instance
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls = {GPTNeoXLayer}
)

fsdp_strategy = RayFSDPStrategy(
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    forward_prefetch=True,
    auto_wrap_policy=auto_wrap_policy,
    limit_all_gathers=True,
    activation_checkpointing=[GPTNeoXLayer],
)

2023-08-30 10:19:36,919	INFO util.py:159 -- Outdated packages:
  ipywidgets==7.8.0 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


:::{tip}

Some tips for FSDP configutarion:
- `sharding_strategy`:
    - `ShardingStrategy.NO_SHARD`: Parameters, gradients, and optimizer states are not sharded. Similar to DDP.
    - `ShardingStrategy.SHARD_GRAD_OP`: Gradients and optimizer states are sharded during computation, and additionally, parameters are sharded outside computation. Similar to ZeRO stage-2.
    - `ShardingStrategy.FULL_SHARD`: Parameters, gradients, and optimizer states are sharded. It has minimal GRAM usage among the 3 options. Similar to ZeRO stage-3.
- `auto_wrap_policy`:
    - Model layers are often wrapped with FSDP in a layered fashion. This means that only the layers in a single FSDP instance are required to aggregate all parameters to a single device during forwarding or backward calculations.
    - Use `transformer_auto_wrap_policy` to automatically wrap each Transformer Block into a single FSDP instance. 
- `backward_prefetch` and `forward_prefetch`:
    - Overlap the upcoming all-gather while executing the current forward/backward pass. It can improve throughput but may slightly increase peak memory usage.
:::

## Fine-tune with Ray TorchTrainer

Ray TorchTrainer allows you to easily schedule your PyTorch training workload on the Ray cluster. It integrates with mainstream PyTorch ecosystem like Lightning, Transformers, and Accelerate. To launch your training on multiple nodes and GPUs, follow these steps:

- Define your training function for each worker. A normal Lightning training function with Ray Train utilities.
- Define the {class}`~ray.train.ScalingConfig` that specifies the compute resources.
- Define the {class}`~ray.train.RunConfig` that specifies the storage path and checkpointing logics.
- Define a {class}`~ray.train.torch.TorchTrainer`, and launch training with {meth}`~ray.train.torch.TorchTrainer.fit`.

In [8]:
num_workers = 16
batch_size_per_worker = 10

In [None]:
# To accelerate release tests
train_ds = train_ds.limit(num_workers * batch_size_per_worker * 10)  # each worker has 10 batches

Additionally, remember to define a Lightning callback that saves and reports checkpoints at the end of each training epoch. Ray Train offers a simple implementation, {meth}`~ray.train.lightning.RayTrainReportCallback`, which stores your checkpoint and metrics in remote storage. You can retrieve them later, after training has finished. The internal behaviors are as follows:

- Get the latest metrics from `pl.Trainer.callback_metrics`.
- Save the checkpoint to the local disk with `pl.Trainer.save_checkpoint()`.
- Create a Ray Train checkpoint with {meth}`~ray.train.Checkpoint.from_directory`.
- Report the metrics and Ray Train checkpoint with {meth}`~ray.train.report`.

Note that you can also implement your own report callback with customized logics, such as saving customized checkpoint files or reporting at a different frequency.

In [11]:
from pytorch_lightning.callbacks import TQDMProgressBar

# Create a customized progress bar for LightningTrainer
class DollyV2ProgressBar(TQDMProgressBar):
    def __init__(self, num_iters_per_epoch, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_iters_per_epoch = num_iters_per_epoch
    
    def on_train_epoch_start(self, trainer, *_):
        super().on_train_epoch_start(trainer, *_)
        self.train_progress_bar.reset(self.num_iters_per_epoch)

total_batches = train_ds.count()
num_iters_per_epoch = total_batches // (num_workers * batch_size_per_worker)
prog_bar = DollyV2ProgressBar(num_iters_per_epoch)

2023-08-30 10:19:36,958	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)->MapBatches(tokenize)]
2023-08-30 10:19:36,959	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-08-30 10:19:36,959	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json: 100%|██████████| 450/450 [00:00<00:00, 66.4kB/s]
Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s][0m 
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 7.62MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 228/228 [00:00<00:00, 142kB/s]


In [10]:

from ray.train import Checkpoint
from ray.train.lightning import RayLightningEnvironment, RayTrainReportCallback, prepare_trainer

# Training function for each worker
def train_func(config):
    lr = config["lr"]
    eps = config["eps"]
    strategy = config["strategy"]
    batch_size_per_worker = config["batch_size_per_worker"]

    # Model
    model = DollyV2Model(lr=lr, eps=eps)

    # Ray Data Ingestion
    train_ds = ray.train.get_dataset_shard("train")
    train_dataloader = train_ds.iter_torch_batches(batch_size=batch_size_per_worker)

    # Lightning Trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        devices="auto",
        accelerator="auto", 
        precision="16-mixed",
        strategy=strategy,
        plugins=[RayLightningEnvironment()],
        callbacks=[RayTrainReportCallback()],
        enable_checkpointing=False,
    )

    trainer = prepare_trainer(trainer)

    trainer.fit(model, train_dataloaders=train_dataloader)

```{note}
Since this example runs with multiple nodes, we need to persist checkpoints
and other outputs to some external storage for access after training has completed.
**You should set up cloud storage or NFS, then replace `storage_path` with your own cloud bucket URI or NFS path.**

See the [storage guide](tune-storage-options) for more details.
```

In [12]:
storage_path="s3://your-bucket-here"  # TODO: Set up cloud storage
# storage_path="/mnt/path/to/nfs"     # TODO: Alternatively, set up NFS

In [13]:
storage_path = "/mnt/cluster_storage"

In [14]:
from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

# Save AIR checkpoints according to the performance on validation set
run_config = RunConfig(
    name="finetune_dolly-v2-7b",
    storage_path=storage_path,
    checkpoint_config=CheckpointConfig(num_to_keep=1),
)

# Scale the DDP training workload across 16 GPUs
# You can change this config based on your compute resources.
scaling_config = ScalingConfig(
    num_workers=num_workers, use_gpu=True, resources_per_worker={"CPU": 12, "GPU": 1}
)

# Configuration to pass into train_func
train_config = {
    "lr": 2e-5,
    "eps": 1e-8,
    "strategy": fsdp_strategy,
    "batch_size_per_worker": 10
}

# Define a TorchTrainer and launch you training workload
ray_trainer = TorchTrainer(
    train_func,
    train_loop_config=train_config,
    run_config=run_config,
    scaling_config=scaling_config,
    datasets={"train": train_ds},
)
result = ray_trainer.fit()

result


0,1
Current time:,2023-08-30 10:29:49
Running for:,00:10:03.36
Memory:,37.9/124.3 GiB

Trial name,status,loc,iter,total time (s),train_loss,epoch,step
TorchTrainer_6ae58_00000,TERMINATED,10.0.23.226:8870,1,593.88,12.5078,0,5


[2m[36m(TrainTrainable pid=8870)[0m StorageContext on SESSION (rank=None):
[2m[36m(TrainTrainable pid=8870)[0m StorageContext<
[2m[36m(TrainTrainable pid=8870)[0m   storage_path=/mnt/cluster_storage
[2m[36m(TrainTrainable pid=8870)[0m   storage_local_path=/home/ray/ray_results
[2m[36m(TrainTrainable pid=8870)[0m   storage_filesystem=<pyarrow._fs.LocalFileSystem object at 0x7f3d46319130>
[2m[36m(TrainTrainable pid=8870)[0m   storage_fs_path=/mnt/cluster_storage
[2m[36m(TrainTrainable pid=8870)[0m   experiment_dir_name=finetune_dolly-v2-7b
[2m[36m(TrainTrainable pid=8870)[0m   trial_dir_name=TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46
[2m[36m(TrainTrainable pid=8870)[0m   current_checkpoint_index=0
[2m[36m(TrainTrainable pid=8870)[0m >
[2m[36m(TorchTrainer pid=8870)[0m Starting distributed worker processes: ['8972 (10.0.23.226)', '4128 (10.0.2.17)', '4094 (10.0.53.250)', '4511 (10.0.40.16)', '4123 (10.0.41.152)', '4082 (10.0.44.99)', '4151 (10.0.14.94)

[2m[1m[36m(autoscaler +2m52s)[0m [workspace snapshot] New snapshot created successfully (size: 444.12 KB).


[2m[36m(RayTrainWorker pid=4511, ip=10.0.40.16)[0m Using 16bit Automatic Mixed Precision (AMP)[32m [repeated 5x across cluster][0m
[2m[36m(RayTrainWorker pid=4123, ip=10.0.41.152)[0m Missing logger folder: /home/ray/ray_results/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46/lightning_logs[32m [repeated 4x across cluster][0m
Downloading pytorch_model.bin: 100%|██████████| 13.8G/13.8G [02:07<00:00, 109MB/s][32m [repeated 89x across cluster][0m
Downloading pytorch_model.bin: 100%|██████████| 13.8G/13.8G [02:07<00:00, 109MB/s][32m [repeated 6x across cluster][0m
[2m[36m(RayTrainWorker pid=8972)[0m GPU available: True (cuda), used: True
[2m[36m(RayTrainWorker pid=8972)[0m TPU available: False, using: 0 TPU cores
[2m[36m(RayTrainWorker pid=8972)[0m IPU available: False, using: 0 IPUs
[2m[36m(RayTrainWorker pid=8972)[0m HPU available: False, using: 0 HPUs
[2m[36m(RayTrainWorker pid=4115, ip=10.0.46.116)[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES:

[2m[36m(RayTrainWorker pid=8972)[0m FullyShardedDataParallel(
[2m[36m(RayTrainWorker pid=8972)[0m   (_fsdp_wrapped_module): _LightningModuleWrapperBase(
[2m[36m(RayTrainWorker pid=8972)[0m     (_forward_module): DollyV2Model(
[2m[36m(RayTrainWorker pid=8972)[0m       (model): GPTNeoXForCausalLM(
[2m[36m(RayTrainWorker pid=8972)[0m         (gpt_neox): GPTNeoXModel(
[2m[36m(RayTrainWorker pid=8972)[0m           (embed_in): Embedding(50280, 4096)
[2m[36m(RayTrainWorker pid=8972)[0m           (layers): ModuleList(
[2m[36m(RayTrainWorker pid=8972)[0m             (0-31): 32 x FullyShardedDataParallel(
[2m[36m(RayTrainWorker pid=8972)[0m               (_fsdp_wrapped_module): CheckpointWrapper(
[2m[36m(RayTrainWorker pid=8972)[0m                 (_checkpoint_wrapped_module): GPTNeoXLayer(
[2m[36m(RayTrainWorker pid=8972)[0m                   (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
[2m[36m(RayTrainWorker pid=8972)[0m         

[2m[36m(RayTrainWorker pid=8972)[0m 
[2m[36m(RayTrainWorker pid=8972)[0m   | Name  | Type               | Params
[2m[36m(RayTrainWorker pid=8972)[0m ---------------------------------------------
[2m[36m(RayTrainWorker pid=8972)[0m 0 | model | GPTNeoXForCausalLM | 402 M 
[2m[36m(RayTrainWorker pid=8972)[0m ---------------------------------------------
[2m[36m(RayTrainWorker pid=8972)[0m 402 M     Trainable params
[2m[36m(RayTrainWorker pid=8972)[0m 0         Non-trainable params
[2m[36m(RayTrainWorker pid=8972)[0m 402 M     Total params
[2m[36m(RayTrainWorker pid=8972)[0m 1,611.039 Total estimated model params size (MB)
[2m[36m(RayTrainWorker pid=4094, ip=10.0.53.250)[0m Using 16bit Automatic Mixed Precision (AMP)[32m [repeated 10x across cluster][0m
[2m[36m(RayTrainWorker pid=4079, ip=10.0.59.245)[0m Missing logger folder: /home/ray/ray_results/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46/lightning_logs[32m [repeated 11x across

(pid=9091) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=9091)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)->MapBatches(tokenize)] -> OutputSplitter[split(16, equal=True)]
[2m[36m(SplitCoordinator pid=9091)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['5ed99d043a52f67deb150f34202c09b77bd37409502ebf6e581b0544', 'e3754d1e1017e68dd919b35d35ea62ed7b005ad96452f371721fc9fa', '73a8b9377fe9531a84eaa7b30c966fbb11bc36aff070d55c8f7acd1a', '8efe8198d7c04d45714ae757f298c316117405f3a8b25b87a71e0d9e', 'ef922c93f3b2fc93ebe5a521426d24fb8aae7e13c65f9fbd106aea2a', '042b668e5553a589a4f6693c45deee0abe57a1d754812172af425acb', '5249cff3eab41121f840c17a79e6a3cd0af0f059def707a39e055fcf', '8bd0f431ab3733c4b423c1d50db06460e3c210de47355b3b4d215c31', '9ed138bfe1f9c7dca484ee08d8311806389adb3af7a76566a6f4dfaa', '7e2fcb5dfe4ab1b572d87257f9e13bbc22b33ba968b1e67a79505589', '9484193409a5346c0838a4a

(pid=9091) Running: 0.0/272.0 CPU, 0.0/16.0 GPU, 118.71 MiB/1.86 GiB object_store_memory 0:   0%|          | 0…

Epoch 0: : 1it [00:27, 27.52s/it, v_num=0, train_loss=12.90]
Epoch 0: : 2it [00:46, 23.33s/it, v_num=0, train_loss=12.50]
Epoch 0: : 3it [01:04, 21.62s/it, v_num=0, train_loss=12.50]
Epoch 0: : 4it [01:22, 20.72s/it, v_num=0, train_loss=12.50]
Epoch 0: : 5it [01:41, 20.22s/it, v_num=0, train_loss=12.50]


[2m[36m(RayTrainWorker pid=4115, ip=10.0.46.116)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46/checkpoint_000000)


[2m[1m[36m(autoscaler +7m51s)[0m [workspace snapshot] New snapshot created successfully (size: 460.94 KB).


[2m[36m(RayTrainWorker pid=8972)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46/checkpoint_000000)[32m [repeated 15x across cluster][0m


Epoch 0: : 5it [06:19, 75.83s/it, v_num=0, train_loss=12.50]


[2m[36m(RayTrainWorker pid=8972)[0m `Trainer.fit` stopped: `max_steps=5` reached.
[2m[36m(RayTrainWorker pid=8972)[0m RayFSDPStrategy: tearing down strategy...
2023-08-30 10:29:49,993	INFO tune.py:1142 -- Total run time: 603.42 seconds (603.33 seconds for the tuning loop).


Result(
  metrics={'train_loss': 12.5078125, 'epoch': 0, 'step': 5},
  path='/mnt/cluster_storage/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune_dolly-v2-7b/TorchTrainer_6ae58_00000_0_2023-08-30_10-19-46/checkpoint_000000)
)

We finished training in 2361s. The price for an on-demand g4dn.4xlarge instance is `$1.204/hour`, while a g4dn.4xlarge instance costs `$2.176/hour`. The total cost would be `($1.204 * 15 + $2.176) * 2699 / 3600 = $15.17`.

## Text-generation with HuggingFace Pipeline

We can use the [HuggingFace Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) to generate predictions from our fine-tuned model. Let's input some prompts and see if our tuned Dolly can speak like Shakespeare:

In [17]:
import os
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="right")

ckpt_path = os.path.join(result.checkpoint.path, "checkpoint.ckpt")

dolly = DollyV2Model.load_from_checkpoint(ckpt_path, map_location=torch.device("cpu"))

nlp_pipeline = pipeline(
    task="text-generation", 
    model=dolly.model, 
    tokenizer=tokenizer, 
    device_map="auto"
)


[2m[1m[36m(autoscaler +27m50s)[0m [workspace snapshot] New snapshot created successfully (size: 464.29 KB).


Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [18]:
for prompt in ["This is", "I am", "Once more"]:
    print(nlp_pipeline(prompt, max_new_tokens=20, do_sample=True, pad_token_id=tokenizer.eos_token_id))

[{'generated_text': 'This is a very important point. Your brain naturally defaults to the negation of this clause. As a programmer'}]
[{'generated_text': 'I am the first person the the Doctor ever talked to about his experiences in the Daleks. I was the'}]
[{'generated_text': 'Once more, the city of Austin is being plagued with the Zika virus, which can cause severe birth defects'}]
[2m[1m[36m(autoscaler +32m49s)[0m [workspace snapshot] New snapshot created successfully (size: 464.79 KB).
[2m[1m[36m(autoscaler +37m51s)[0m [workspace snapshot] New snapshot created successfully (size: 463.93 KB).
[2m[1m[36m(autoscaler +42m51s)[0m [workspace snapshot] New snapshot created successfully (size: 464.10 KB).


References:
- [PyTorch FSDP Tutorial](https://www.youtube.com/watch?v=8_k76AHu__s&list=PL_lsbAsL_o2BT6aerEKgIoufVD_fodnuT)
- [Getting Started with Fully Sharded Data Parallel(FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#:~:text=FSDP%20is%20a%20type%20of,sizes%20for%20our%20training%20job.)
- [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/)
- [Hugging Face: dolly-v2-7b Model Card](https://huggingface.co/databricks/dolly-v2-7b)
- [Hugging Face: Handling big models for inference](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)