# MultiGPU Distributed Launching with Runhouse and HuggingFace Accelerate

This tutorial demonstrates how to use Runhouse with HuggingFace accelerate to launch distributed code on **your own remote hardware**. We also show how one can reproducibly perform hardware dependency autosetup, to ensure that your code runs smoothly every time.

You can run this on your own cluster, or through a standard cloud account (AWS, GCP, Azure, LambdaLabs). If you do not have any compute or cloud accounts set up, we recommend creating a [LambdaLabs](https://cloud.lambdalabs.com/) account for the easiest setup path.

## Install dependencies

In [None]:
!pip install accelerate
!pip install runhouse

In [None]:
import runhouse as rh

INFO | 2023-03-20 17:56:13,023 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-03-20 17:56:14,334 | NumExpr defaulting to 2 threads.


## Setting up the Cluster

### On-Demand Cluster (AWS, Azure, GCP, or LambdaLabs)

For instructions on setting up cloud access for on-demand clusters, please refer to
[Hardware Setup](https://runhouse-docs.readthedocs-hosted.com/en/main/rh_primitives/cluster.html#hardware-setup).

In [None]:
# single V100 GPU
# gpu = rh.cluster(name="rh-v100", instance_type="V100:1").up_if_not()

# multigpu: 4 V100s
gpu = rh.cluster(name="rh-4-v100", instance_type="V100:1").up_if_not()

# Set GPU to autostop after 60 min of inactivity (default is 30 min)
gpu.keep_warm(60)  # or -1 to keep up indefinitely

Output()

### On-Premise Cluster

For an on-prem cluster, you can instantaite it as follows, filling in the IP address, ssh user and private key path.

In [None]:
# For an existing cluster
# gpu = rh.cluster(ips=['<ip of the cluster>'], 
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
#                  name='rh-cluster')

## Setting up Functions on Remote Hardware

### Training Function
For simplicity, let's use the [training_function](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py#L114) from [accelerate/examples/nlp_example.py](https://github.com/huggingface/accelerate/blob/v0.15.0/examples/nlp_example.py) to demonstrate how to run this function remotely.

In this case, because the function is available on GitHub, we can pass in a string pointing to the GitHub function.

For local functions, for instance if we had `nlp_example.py` in our directory, we can also simply import the function.

In [None]:
# if nlp_example.py is in local directory
# from nlp_example import training_function

# if function is available on GitHub, use it's string representation
training_function = "https://github.com/huggingface/accelerate/blob/v0.15.0/examples/nlp_example.py:training_function"

Next, define the dependencies necessary to run the imported training function using accelerate.

In [None]:
reqs = ['pip:./accelerate', 'transformers', 'datasets', 'evaluate','tqdm', 'scipy', 'scikit-learn', 'tensorboard',
        'torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu117']

Now, we can put together the above components (gpu cluster, training function, and dependencies) to create our train function on remote hardware.

In [None]:
train_function_gpu = rh.function(
                          fn=training_function,
                          system=gpu,
                          reqs=reqs,
                      )

INFO | 2023-03-20 21:01:46,942 | Setting up Function on cluster.
INFO | 2023-03-20 21:01:46,951 | Installing packages on cluster rh-v100: ['GitPackage: https://github.com/huggingface/accelerate.git@v0.15.0', 'pip:./accelerate', 'transformers', 'datasets', 'evaluate', 'tqdm', 'scipy', 'scikit-learn', 'tensorboard', 'torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu117']
INFO | 2023-03-20 21:02:02,988 | Function setup complete.


`train_function_gpu` is a callable that can be used just like the original `training_function` function in the NLP example, except that it runs the function on the specified cluster/system instead.

## Launch Helper Function

Here we define a helper function for launching accelerate training, and then send the function to run on our GPU as well

In [None]:
def launch_training(training_function, *args):
    from accelerate.utils import PrepareForLaunch, patch_environment
    import torch

    num_processes = torch.cuda.device_count()
    print(f'Device count: {num_processes}')
    with patch_environment(world_size=num_processes, master_addr="127.0.01", master_port="29500",
                           mixed_precision=args[1].mixed_precision):
        launcher = PrepareForLaunch(training_function, distributed_type="MULTI_GPU")
        torch.multiprocessing.start_processes(launcher, args=args, nprocs=num_processes, start_method="spawn")

In [None]:
launch_training_gpu = rh.function(fn=launch_training).to(gpu)

INFO | 2023-03-20 19:56:15,257 | Writing out function function to /content/launch_training_fn.py as functions serialized in notebooks are brittle. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body).
INFO | 2023-03-20 19:56:15,262 | Setting up Function on cluster.
INFO | 2023-03-20 19:56:15,265 | Copying local package content to cluster <rh-v100>
INFO | 2023-03-20 19:56:20,623 | Installing packages on cluster rh-v100: ['./']
INFO | 2023-03-20 19:56:20,753 | Function setup complete.


## Launch Distributed Training
Now, we're ready to launch distributed training on our self-hosted hardware! 

In [None]:
import argparse

# define basic train args and hyperparams
train_args = argparse.Namespace(cpu=False, mixed_precision='fp16')
hps = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}

In [None]:
launch_training_gpu(train_function_gpu, hps, train_args, stream_logs=True)

INFO | 2023-03-20 20:11:45,415 | Running launch_training via gRPC
INFO | 2023-03-20 20:11:45,718 | Time to send message: 0.3 seconds
INFO | 2023-03-20 20:11:45,720 | Submitted remote call to cluster. Result or logs can be retrieved
 with run_key "launch_training_20230320_201145", e.g. 
`rh.cluster(name="~/rh-v100").get("launch_training_20230320_201145", stream_logs=True)` in python 
`runhouse logs "rh-v100" launch_training_20230320_201145` from the command line.
 or cancelled with 
`rh.cluster(name="~/rh-v100").cancel("launch_training_20230320_201145")` in python or 
`runhouse cancel "rh-v100" launch_training_20230320_201145` from the command line.
:task_name:launch_training
:task_name:launch_training
INFO | 2023-03-20 20:11:46,328 | Loading config from local file /home/ubuntu/runhouse/runhouse/builtins/config.json
INFO | 2023-03-20 20:11:46,328 | No auth token provided, so not using RNS API to save and load configs
Device count: 1
INFO | 2023-03-20 20:11:49,486 | Loading config from l

## Terminate Cluster

Once you are done using the cluster, you can terminate it as follows:

In [None]:
gpu.teardown()