# Training: Transformers

This tutorial demonstrates how to use Runhouse to facilitate model training on **your own GPU**. With Runhouse, easily run your local code or training script on a remote cluster, and reproducibly set up your remote training environment.

You can run this on your own cluster, or through a standard cloud account (AWS, GCP, Azure, LambdaLabs). If you do not have any compute or cloud accounts set up, we recommend creating a [LambdaLabs](https://cloud.lambdalabs.com/) account for the easiest setup path.

## Table of Contents

- Hardware Setup
- Dataloading and Preprocessing
- Model Training

## Install Runhouse

In [None]:
!pip install runhouse[sky]

In [2]:
import runhouse as rh

INFO | 2023-06-08 18:12:34,980 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-06-08 18:12:36,499 | NumExpr defaulting to 2 threads.


## Hardware Setup

If you're not already familiar with setting up a Runhouse cluster, please first refer to [Cluster Setup](https://www.run.house/docs/tutorials/quick_start#cluster-setup) for a more introductory and in-depth walkthrough.

In [3]:
# Optional, to sync over any hardware credentials and configs from your Runhouse account
!runhouse login --yes

# alternatively, to set up credentials locally, run `!sky check` and follow the instructions for your cloud provider(s)
# !sky check

INFO | 2023-06-08 18:12:39,041 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-06-08 18:12:40,116 | NumExpr defaulting to 2 threads.

            ____              __                             @ @ @
           [35m/[0m __ \__  ______  [35m/[0m [35m/[0m[95m_[0m  ____  __  __________     [1m[[0m[1m][0m___
          [35m/[0m [35m/_/[0m [35m/[0m [35m/[0m [35m/[0m [35m/[0m __ \[35m/[0m __ \[35m/[0m __ \[35m/[0m [35m/[0m [35m/[0m [35m/[0m ___/ _ \   [35m/[0m    [35m/[0m\____    @@
         [35m/[0m _, _/ [35m/_/[0m [35m/[0m [35m/[0m [35m/[0m [35m/[0m [35m/[0m [35m/[0m [35m/[0m [35m/_/[0m [35m/[0m [35m/_/[0m [1m([0m__  [1m)[0m  __/  [35m/_/[0m\_/[35m/____/[0m\  @@@@
        [35m/_/[0m |_|\__,_/_/ [35m/_/_/[0m [35m/_/[0m\____/\__,_/____/\___/   | || |||__|||   ||
        
[1;33mRetrieve your token 🔑 here to use 🏃 🏠 Runhouse for secrets and artifact [0m
[1;33mmanagement: [0m[

In [7]:
# sample on-demand cluster, launched through Runhouse/SkyPilot
gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws').up_if_not()

# or for your own dedicated cluster
# gpu = rh.cluster(
#            name="cpu-cluster",
#            ips=['<ip of the cluster>'],
#            ssh_creds={'ssh_user': '<user>', 'ssh_private_key':'<path_to_key>'},
#       )

INFO | 2023-06-08 18:30:56,926 | Attempting to load config for /carolineechen/rh-a10x from RNS.


Output()

## Dataloading and Preprocessing

Here, we briefly demonstrate how to 

Steps:
- take our preprocessing code, wrap it in a function called load_and_preprocess
- create a runhouse function, send it along w/ dependencies to the cluster, auto set up is handled
- call the function (which runs remotely on the cluster!)

Note that all the code inside the function runs on our gpu cluster, which means there's no need to install anything locally either.

For a more in-depth walkthrough of Runhouse's function and env APIs, please refer to the [Compute API Tutorial](https://www.run.house/docs/tutorials/api_compute).

In [11]:
def load_and_preprocess():
    from datasets import load_dataset

    dataset = load_dataset("yelp_review_full")
    dataset["train"][100]

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)
    
    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
    return [small_train_dataset, small_eval_dataset]

In [12]:
reqs = ["transformers", "datasets", "torch"]

load_and_preprocess = rh.function(fn=load_and_preprocess).to(gpu, env=reqs)

INFO | 2023-06-08 18:43:59,993 | Writing out function function to /content/load_and_preprocess_fn.py. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body).
INFO | 2023-06-08 18:44:00,000 | Setting up Function on cluster.
INFO | 2023-06-08 18:44:00,478 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-06-08 18:44:00,684 | Authentication (publickey) successful!
INFO | 2023-06-08 18:44:07,003 | Installing packages on cluster rh-a10x: ['transformers', 'datasets', 'torch', 'Package: content']
INFO | 2023-06-08 18:46:10,042 | Function setup complete.


Runhouse functions work so that you call them as you would with a local function (e.g. `data = load_and_preprocess()`) -- the code runs remotely and returns the object locally.

However, in this case, as we are running training on the same cluster and it's not useful to have the dataset sent back to local, we can simply call `.remote()` on the function to have it run async, returning an object reference to our dataset rather than the actual data. This dataset ref can be passed into later functions as if they were the actual object.

If you'd like to save down your data to file storage (e.g. `s3`, `gcs`), Runhouse also has API support for that. Please refer to our Data API Tutorial <insert link> for more information on that.

In [19]:
datasets_ref = load_and_preprocess.remote()

INFO | 2023-06-08 18:52:55,092 | Running load_and_preprocess via HTTP
INFO | 2023-06-08 18:52:55,191 | Time to call remote function: 0.1 seconds
INFO | 2023-06-08 18:52:55,193 | Submitted remote call to cluster. Result or logs can be retrieved
 with run_key "load_and_preprocess_20230608_185255", e.g. 
`rh.cluster(name="/carolineechen/rh-a10x").get("load_and_preprocess_20230608_185255", stream_logs=True)` in python 
`runhouse logs "rh-a10x" load_and_preprocess_20230608_185255` from the command line.
 or cancelled with 
`rh.cluster(name="/carolineechen/rh-a10x").cancel("load_and_preprocess_20230608_185255")` in python or 
`runhouse cancel "rh-a10x" load_and_preprocess_20230608_185255` from the command line.


## Training

Now that we have the dataset ready, it's time to train!

In a similar flow as above:
- take our training code, wrap it in a `train` function
- specify the function and relevant dependencies to be synced and installed on the remote cluster
- call the function from local, passing in your dataset reference, and watch it train remotely

Later on, we also demonstrate how you can run training from an existing script.

### Training from locally defined functions

In [20]:
def train(hf_datasets):
    [small_train_dataset, small_eval_dataset] = hf_datasets
    
    from transformers import AutoModelForSequenceClassification

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

    import numpy as np
    import evaluate

    metric = evaluate.load("accuracy")  # Requires scikit-learn

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    from transformers import TrainingArguments, Trainer

    training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    trainer.train()

In [21]:
extra_reqs = ['evaluate', 'scikit-learn', 'accelerate']

train = rh.function(fn=train).to(gpu, env=extra_reqs)

INFO | 2023-06-08 18:53:03,726 | Writing out function function to /content/train_fn.py. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body).
INFO | 2023-06-08 18:53:03,730 | Setting up Function on cluster.
INFO | 2023-06-08 18:53:05,568 | Installing packages on cluster rh-a10x: ['evaluate', 'scikit-learn', 'accelerate', 'Package: content']
INFO | 2023-06-08 18:53:17,394 | Function setup complete.


To run the function, call it as you would any Python function. Pass in the dataset reference, and optionally add `stream_logs=True` to see the logs locally.

In [22]:
train(datasets_ref, stream_logs=True)

INFO | 2023-06-08 18:53:21,114 | Running train via HTTP
INFO | 2023-06-08 18:56:10,362 | Time to call remote function: 169.25 seconds
INFO | 2023-06-08 18:56:10,365 | Submitted remote call to cluster. Result or logs can be retrieved
 with run_key "train_20230608_185610", e.g. 
`rh.cluster(name="/carolineechen/rh-a10x").get("train_20230608_185610", stream_logs=True)` in python 
`runhouse logs "rh-a10x" train_20230608_185610` from the command line.
 or cancelled with 
`rh.cluster(name="/carolineechen/rh-a10x").cancel("train_20230608_185610")` in python or 
`runhouse cancel "rh-a10x" train_20230608_185610` from the command line.
:job_id:01000000
:task_name:get_fn_from_pointers
:job_id:01000000
INFO | 2023-06-08 18:56:11,007 | Loaded Runhouse config from /home/ubuntu/.rh/config.yaml
:task_name:get_fn_from_pointers
INFO | 2023-06-08 18:56:11,821 | Writing logs on cluster to /home/ubuntu/.rh/logs/train_20230608_185610
INFO | 2023-06-08 18:56:11,821 | Appending /home/ubuntu/content to sys.pat

### Training from existing script

Runhouse also makes it easy to run scripts and commands on your remote cluster, so if you have an existing training script, you can easily directly run that on your remote compute as well.

- Sync over your working directory with the training script to the cluster
- Set up environment and package installations on the cluster
- Run the script with a simple command


To sync over the working directory, you can create a Runhouse folder resource and send it over to the cluster.



In [None]:
# rh.folder(path="local_folder_path", dest_path="remote_folder_path").to(gpu)

Alternatively, if the script lives inside a GitHub repo, you could also directly clone and install the GitHub repo remotely with the GitPackage API.

In this case, let's say we're trying to access and run [examples/nlp_example.py](https://github.com/huggingface/accelerate/blob/v0.15.0/examples/nlp_example.py) from the [accelerate GitHub repo](https://github.com/huggingface/accelerate).

In [35]:
git_package = rh.git_package(git_url='https://github.com/huggingface/accelerate.git',
                            install_method='pip',
                            revision='v0.18.0')
gpu.install_packages([git_package])

INFO | 2023-06-08 19:57:11,991 | Installing packages on cluster rh-a10x: ['GitPackage: https://github.com/huggingface/accelerate.git@v0.18.0']


Additionally install any other necessary requirements to run the script.

In [None]:
reqs = ['evaluate', 'transformers', 'datasets==2.3.2', 'scipy', 'scikit-learn', 'tqdm', 'tensorboard', 'torch==1.12.0']

env = rh.env(reqs=reqs)
env.to(gpu)

# or
# gpu.install_packages(reqs)

Now that we have the script and dependencies on the cluster, we can run the script using `gpu.run([command])`

In [None]:
gpu.run(['python accelerate/examples/nlp_example.py'])

## Terminate Cluster

To terminate the cluster after you're done using it, you can either use the `sky down cluster-name` CLI command or `cluster_var.down()` Python API.

If you set up autostop for the cluster or in your configs (default to 30 min), the cluster will automatically terminate after that period of inactivity.

In [37]:
# cli
!sky down rh-a10x

# python
# gpu.down()

Terminating 1 cluster: rh-a10x. Proceed? [Y/n]: y
[2K[1;36mTerminating 1 cluster[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [36m-:--:--[0m

2023-06-08 17:09:58,834| ERROR   | Socket exception: Connection reset by peer (104)


[2K[1;36mTerminating 1 cluster[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [36m-:--:--[0mERROR | 2023-06-08 17:09:58,834 | Socket exception: Connection reset by peer (104)
[2K[1;36mTerminating 1 cluster[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m  0%[0m [36m-:--:--[0m
[1A[2K[32mTerminating cluster rh-a10x...done.[0m
[2K[1;36mTerminating 1 cluster[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m
[?25h[0m