Distributed Fine-Tuning of Language Models with Valohai

This GitHub repository demonstrates how to perform fine-tuning of large language models (LLM) in a distributed manner using Valohai. The repository provides detailed guides and code examples to help you get started with distributed training, showcasing the use of Valohai for efficient model fine-tuning.

Learn how to distribute the fine-tuning process of large language models across multiple machines or GPUs for improved training efficiency.

Distributed Training

In this section, we explore the various approaches to distributed training of large language models (LLMs) within the Valohai GitHub repository. Distributed training is crucial when dealing with computationally intensive tasks, and this repository provides multiple methods to achieve it.

Note: To utilize the distributed training features outlined below, you will need a machine equipped with at least 2 or more GPUs. For setup assistance, please contact our support team.

1. Torchrun (Elastic Launch)

Script: train-torchrun.py

In the second approach, we leverage the torchrun (Elastic Launch) functionality, which extends the capabilities of torch.distributed.launch. We employ the Hugging Face Transformers Trainer to fine-tune the language model using our dataset. With torchrun, you can distribute the training process without making any modifications to your existing code. This method provides a straightforward introduction to distributed training for LLMs.

Benefits:

Utilizes Transformers Trainer for model fine-tuning.
No code changes needed for distributed training.
Easy setup for those new to distributed training.

2. Accelerate Library

Script: train-accelerator.py

We employ the Accelerate library to facilitate distributed training. The train-accelerator.py script is based on the Hugging Face example for summarization without using the Trainer class. It grants you complete control over the training loop, allowing for more flexibility in customizing training processes. Meanwhile, the Accelerate library takes care of the distribution aspects, making it an efficient choice for distributed training.

Benefits:

Fine-grained control over the training loop.
Accelerate library handles distribution seamlessly.
Ideal for custom training approaches.

3. Distributed Training Across Multiple Machines

Note: you need to run train-task as Valohai Task NOT Execution.

The third approach, currently under development within this repository, tackles distributed training across multiple machines simultaneously. To achieve this, we employ Valohai's valohai.distributed, along with torch.distributed and torch.multiprocessing to establish communication between multiple machines during the training process. While still in development, this approach aims to provide a robust solution for training large language models across a distributed infrastructure.

Benefits:

Distributes training across multiple machines.
Uses Valohai's distributed capabilities.
Enables efficient scaling for demanding training workloads.

Configure the repository:

To get started login to the Valohai app and create a new project.

Using UI

Configure this repository as the project's repository, by following these steps:

Go to your project's page.
Navigate to the Settings tab.
Under the Repository section, locate the URL field.
Enter the URL of this repository.
Click on the Save button to save the changes.

Using terminal

To run the code on Valohai using the terminal, follow these steps:

Install Valohai on your machine by running the following command:

pip install valohai-cli valohai-utils

Log in to Valohai from the terminal using the command:

vh login

Create a project for your Valohai workflow. Start by creating a directory for your project:

mkdir valohai-distributed-llms
cd valohai-distributed-llms

Then, create the Valohai project:

vh project create

Clone the repository to your local machine:

git clone https://github.com/valohai/distributed-llms-example.git .

Running Executions:

Using UI

Go to the Executions tab in your project.
Create a new execution by selecting the predefined step.
Customize the execution parameters if needed.
Start the execution to run the selected step.

Using terminal

To run individual steps, execute the following command:

vh execution run <step-name> --adhoc

For example, to run the train-torchrun step, use the command:

vh execution run train-torchrun --adhoc

Running Tasks:

Using UI

Go to the Tasks tab in your project.
Create a new task by selecting the predefined step train-task.
Choose one of the following options:
- Navigate to Task type, and opt for Distributed. Adjust the execution count.
- Utilize the blueprint by clicking "Select Task blueprint" in the upper right corner.
Customize the task parameters if needed.
Start the task to run.

Contact

For bug reports and feature requests please visit GitHub Issues.

If you need any help, feel free to contact our support team!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
helpers.py		helpers.py
pyproject.toml		pyproject.toml
train-accelerator.py		train-accelerator.py
train-task.py		train-task.py
train-torchrun.py		train-torchrun.py
valohai.yaml		valohai.yaml

valohai/distributed-llms-example

Folders and files

Latest commit

History

Repository files navigation

Distributed Fine-Tuning of Language Models with Valohai

Distributed Training

1. Torchrun (Elastic Launch)

Script: train-torchrun.py

Benefits:

2. Accelerate Library

Script: train-accelerator.py

Benefits:

3. Distributed Training Across Multiple Machines

Benefits:

Configure the repository:

Running Executions:

Running Tasks:

Contact

About

Resources

Stars

Watchers

Forks

Languages