Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] LightningTrainer Dolly V2 FSDP Fine-tuning Example #34990

Merged
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
5918983
init
woshiyyya Apr 26, 2023
f427704
can forward, not backward
woshiyyya Apr 27, 2023
29cde26
run success
woshiyyya Apr 28, 2023
5c53a1e
runnable e2e
woshiyyya Apr 29, 2023
22e6bb3
using pipeline for inference
woshiyyya Apr 29, 2023
22edc1a
get reasonable output
woshiyyya May 1, 2023
b904d0f
better sections
woshiyyya May 1, 2023
be52d45
wip
woshiyyya May 2, 2023
98d5048
wip
woshiyyya May 2, 2023
d88eb02
rm useless files
woshiyyya May 2, 2023
47b4ecf
polish wording
woshiyyya May 3, 2023
d4a38db
lint
woshiyyya May 3, 2023
7ecc382
finish 1st version
woshiyyya May 3, 2023
eae132a
acclerate training from 20min to 10min with activation checkpointing
woshiyyya May 3, 2023
1edb56b
finish dolly 7B training
woshiyyya May 3, 2023
097e177
remove 3b example
woshiyyya May 5, 2023
b06de89
add release tests
woshiyyya May 5, 2023
b8b1e62
add error message
woshiyyya May 5, 2023
c63babb
Merge remote-tracking branch 'upstream/master' into air/lightning_dol…
woshiyyya May 5, 2023
fe70347
add variation
woshiyyya May 5, 2023
1be677f
add availability zone
woshiyyya May 5, 2023
d0fd15e
modify available zone
woshiyyya May 5, 2023
c8bb719
wip..
woshiyyya May 5, 2023
76b12ab
add pip packages
woshiyyya May 6, 2023
4b738a4
fix runtime env
woshiyyya May 6, 2023
670421c
update ptl requirement
woshiyyya May 6, 2023
1b5dfb0
limit train step & local dir
woshiyyya May 6, 2023
56c473a
clear output and add gce compute comfig
woshiyyya May 6, 2023
65fd090
Merge remote-tracking branch 'upstream/master' into air/lightning_dol…
woshiyyya May 8, 2023
d7ca9af
add intro section
woshiyyya May 8, 2023
d917c83
fix lint
woshiyyya May 8, 2023
f572e40
rm env config
woshiyyya May 8, 2023
39abd72
add links in doc tree and other examples
woshiyyya May 8, 2023
ce0e5d1
Merge remote-tracking branch 'upstream/master' into air/lightning_dol…
woshiyyya May 9, 2023
e8d06a1
add symlink for release test files
woshiyyya May 9, 2023
49a8370
address comments
woshiyyya May 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ parts:
- file: ray-air/examples/gptj_batch_prediction
- file: ray-air/examples/gptj_serving
- file: ray-air/examples/dreambooth_finetuning
- file: ray-air/examples/dolly_lightning_fsdp_finetuning
- file: ray-air/api/api
- file: ray-air/benchmarks

Expand Down
1 change: 1 addition & 0 deletions doc/source/ray-air/examples/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ py_test_run_all_notebooks(
"stablediffusion_batch_prediction.ipynb", # Requires GPUs
"gptj_deepspeed_fine_tuning.ipynb", # Requires release test
"opt_deepspeed_batch_inference.ipynb", # Requires release test
"dolly_lightning_fsdp_finetuning.ipynb", # Requires release test
],
data = ["//doc/source/ray-air/examples:air_examples"],
tags = ["exclusive", "team:ml", "ray_air"],
Expand Down
1,037 changes: 1,037 additions & 0 deletions doc/source/ray-air/examples/dolly_lightning_fsdp_finetuning.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"(gptj_deepspeed_finetune)=\n",
"\n",
"# GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed\n",
"\n",
"In this example, we will showcase how to use the Ray AIR for **GPT-J fine-tuning**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).\n",
Expand Down
1 change: 1 addition & 0 deletions doc/source/ray-air/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Text/NLP
- :doc:`/ray-air/examples/gptj_serving`: How to use Ray AIR to do online serving with the Hugging Face Transformers GPT-J model.
- :doc:`/ray-air/examples/dreambooth_finetuning`: How to fine-tune a DreamBooth text-to-image model with your own images.
- :doc:`/ray-air/examples/opt_deepspeed_batch_inference`: How to run batch inference on a dataset of texts with a 30B OPT model.
- :doc:`/ray-air/examples/dolly_lightning_fsdp_finetuning`: How to fine-tune a dolly-v2-7b model with Ray AIR LightningTrainer and FSDP.

Image/CV
--------
Expand Down
8 changes: 8 additions & 0 deletions doc/source/train/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ Distributed Training Examples using Ray Train

Use LightningTrainer with Ray Data and Batch Predictor

.. grid-item-card::
:img-top: /images/pytorch_lightning_small.png
:class-img-top: pt-2 w-75 d-block mx-auto fixed-height-img

.. button-ref:: dolly_lightning_fsdp_finetuning

Fine-tune LLM with AIR LightningTrainer and FSDP


Ray Train Examples Using Loggers & Callbacks
--------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1483,6 +1483,17 @@
"print(results.head(10))\n",
"print(matthews_corr)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
]
}
],
"metadata": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -741,6 +741,7 @@
"## What's next?\n",
"\n",
"- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {ref}`Hyperparameter searching with LightningTrainer + Ray Tune. <tune-pytorch-lightning-ref>`"
]
}
Expand Down
3 changes: 2 additions & 1 deletion doc/source/tune/examples/tune-pytorch-lightning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -582,6 +582,7 @@
"\n",
"- {ref}`Use LightningTrainer for Image Classification <lightning_mnist_example>`.\n",
"- {ref}`Use LightningTrainer with Ray Data and Batch Predictor <lightning_advanced_example>`\n",
"- {ref}`Fine-tune a Large Language Model with LightningTrainer and FSDP <dolly_lightning_fsdp_finetuning>`\n",
"- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
" and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune.\n",
"- {doc}`/tune/examples/includes/mnist_ptl_mini`:\n",
Expand All @@ -607,7 +608,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.15"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west-2

head_node_type:
name: head_node
instance_type: g4dn.8xlarge

worker_node_types:
- name: worker_node
instance_type: g4dn.4xlarge
min_workers: 15
max_workers: 15
use_spot: false

aws:
TagSpecifications:
- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
base_image: {{ env["RAY_IMAGE_ML_NIGHTLY_GPU"] | default("anyscale/ray:nightly-py38-cu118") }}
env_vars: {}
debian_packages:
- curl

python:
pip_packages:
- "datasets"
- "evaluate"
- "scikit-learn"
- "boto3"
- myst-parser==0.15.2
- myst-nb==0.13.1
- jupytext==1.13.6
conda_packages: []

post_build_cmds:
- pip uninstall -y ray || true && pip3 install -U {{ env["RAY_WHEELS"] | default("ray") }}
- {{ env["RAY_WHEELS_SANITY_CHECK"] | default("echo No Ray wheels sanity check") }}
- pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- pip3 install "pytorch_lightning>=2.0.0" "transformers>=4.28.0" "accelerate>=0.18.0"
Loading