(train_transformers_accelerate_example)=

# Fine-tune a 🤗 Transformers model

This notebook is based on [an official 🤗 notebook - "How to fine-tune a model on text classification"](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). The main aim of this notebook is to show the process of conversion from vanilla 🤗 to Ray Train without changing the training logic unless necessary.

In this notebook, we will:
1. [Set up Ray](#setup)
2. [Load the dataset](#load)
3. [Preprocess the dataset with Ray Data](#preprocess)
4. [Run the training with Ray Train](#train)
5. [Optionally, share the model with the community](#share)

Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with `transformers==4.19.1`):

In [1]:
#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow"

## Set up Ray <a name="setup"></a>

We will use `ray.init()` to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

In [None]:
from pprint import pprint
import ray

ray.init()

We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine.

In [3]:
pprint(ray.cluster_resources())

{'CPU': 64.0,
 'GPU': 4.0,
 'accelerator_type:None': 4.0,
 'memory': 274877906944.0,
 'node:10.0.19.101': 1.0,
 'node:10.0.24.75': 1.0,
 'node:10.0.42.45': 1.0,
 'node:10.0.9.214': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 77016771377.0}


In this notebook, we will see how to fine-tune a [🤗 Transformers](https://github.com/huggingface/transformers) model for one of the text classification task of the [GLUE Benchmark](https://gluebenchmark.com/). We will be running the training using Ray Train.

You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present!

By default, we will run the training with one GPU worker.

In [4]:
use_gpu = True  # set this to False to run on CPUs
num_workers = 1  # set this to number of GPUs/CPUs you want to use

## Fine-tuning a model on a text classification task

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb).

Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [5]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [6]:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

### Loading the dataset <a name="load"></a>

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions.

As Ray AIR doesn't provide integrations for 🤗 Datasets yet, we will simply run the normal 🤗 Datasets code to load the dataset from the Hub.

In [7]:
from datasets import load_dataset

actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)

Reusing dataset glue (/home/ray/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation, and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

We will also need the metric. In order to avoid serialization errors, we will load the metric inside the training workers later. Therefore, now we will just define the function we will use.

In [8]:
from datasets import load_metric

def load_metric_fn():
    return load_metric('glue', actual_task)

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric).

### Preprocessing the data with Ray Data <a name="preprocess"></a>

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers' `Tokenizer`, which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure that:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [10]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Instead of using 🤗 Dataset objects directly, we will convert them to [Ray Data](data). Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in {meth}`~ray.data.from_huggingface` function.

In [11]:
import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(datasets["train"]),
    "validation": ray.data.from_huggingface(datasets["validation"]),
    "test": ray.data.from_huggingface(datasets["test"]),
}
ray_datasets

{'train': MaterializedDataset(
    num_blocks=1,
    num_rows=8551,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'validation': MaterializedDataset(
    num_blocks=1,
    num_rows=1043,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'test': MaterializedDataset(
    num_blocks=1,
    num_rows=1063,
    schema={sentence: string, label: int64, idx: int32}
 )}

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.

We use `map_batches` which will map the function to the datasets in a distributed fashion.

In [12]:
import numpy as np
from typing import Dict


def preprocess_function(examples: Dict[str, np.array]):
    # if we only have one column, we are inferring.
    # no need to tokenize in that case. 
    if len(examples) == 1:
        return examples
    
    return_dict = {}
    return_dict["labels"] = examples["label"]

    sentence1_key, sentence2_key = task_to_keys[task]
    if sentence2_key is None:
        ret = tokenizer(list(examples[sentence1_key]), truncation=True, padding="max_length", return_tensors="np")
    else:
        ret = tokenizer(list(examples[sentence1_key]), list(examples[sentence2_key]), truncation=True, padding="max_length", return_tensors="np")
    return_dict.update(ret)
    return return_dict

for split, dataset in ray_datasets.items():
    ray_datasets[split] = dataset.map_batches(preprocess_function, batch_format="numpy")

In [13]:
for batch in ray_datasets["train"].iter_batches(batch_size=3):
    print(batch)
    break

2023-09-05 16:09:50,547	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)]
2023-09-05 16:09:50,548	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-05 16:09:50,549	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

{'labels': array([1, 1, 1]), 'input_ids': array([[ 101, 2256, 2814, ...,    0,    0,    0],
       [ 101, 2028, 2062, ...,    0,    0,    0],
       [ 101, 2028, 2062, ...,    0,    0,    0]]), 'attention_mask': array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])}


### Fine-tuning the model with Ray Train <a name="train"></a>

Now that our data is ready, we can download the pretrained model and fine-tune it.

Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class.

We will not go into details about each specific component of the training (see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb) for that). The tokenizer is the same as we have used to encoded the dataset before.

The main difference when using the Ray Train is that we need to define our our training logics a function (`train_func`). That function will be passed to the {class}`~ray.train.torch.TorchTrainer` and will run on every Ray worker. The training will then proceed by the means of PyTorch DDP.

Make sure that you initialize the model, metric, and tokenizer inside that function. Otherwise, you may run into serialization errors.

Furthermore, `push_to_hub=True` is not yet supported. Ray will, however, checkpoint the model at every epoch, allowing you to push it to hub manually. We will do that after the training.

In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import torch
import ray.train

from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
name = f"{model_name}-finetuned-{task}"

# Calculate the maximum steps per epoch based on the number of rows in the training dataset.
# Make sure to scale by the total number of training workers and the per device batch size.
max_steps_per_epoch = ray_datasets["train"].count() // batch_size

def train_func(config):
    print(f"Is CUDA available: {torch.cuda.is_available()}")
    metric = load_metric_fn()
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

    train_ds = ray.train.get_dataset_shard("train")
    eval_ds = ray.train.get_dataset_shard("eval")
    
    batch_size_per_worker = batch_size // num_workers

    train_ds_iterable = train_ds.iter_torch_batches(batch_size=batch_size_per_worker)
    eval_ds_iterable = eval_ds.iter_torch_batches(batch_size=batch_size_per_worker)

    print("max_steps_per_epoch: ", max_steps_per_epoch)
    
    args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=batch_size_per_worker,
        per_device_eval_batch_size=batch_size_per_worker,
        learning_rate=config.get("learning_rate", 2e-5),
        num_train_epochs=config.get("epochs", 2),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        max_steps=max_steps_per_epoch * config.get("epochs", 2),
        # disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
        report_to="none",
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        if task != "stsb":
            predictions = np.argmax(predictions, axis=1)
        else:
            predictions = predictions[:, 0]
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model,
        args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.add_callback(RayTrainReportCallback())

    trainer = prepare_trainer(trainer)

    print("Starting training")
    trainer.train()

2023-09-05 16:09:54.649811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-05 16:09:54.796019: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-05 16:09:55.553285: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-05 16:09:55.553361: W tensorflow/

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

With our `train_func` complete, we can now instantiate the {class}`~ray.train.torch.TorchTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.

In [15]:
from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        storage_path="/mnt/cluster_storage/ray_results",
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)


Finally, we call the `fit` method to start training with Ray AIR. We will save the `Result` object to a variable so we can access metrics and checkpoints.

In [16]:
result = trainer.fit()

0,1
Current time:,2023-09-05 16:24:00
Running for:,00:13:59.23
Memory:,7.8/62.0 GiB

Trial name,status,loc,iter,total time (s),loss,learning_rate,epoch
TorchTrainer_56f64_00000,TERMINATED,10.0.24.75:15827,2,829.904,0.3826,0,1.5


[2m[36m(TrainTrainable pid=15827, ip=10.0.24.75)[0m 2023-09-05 16:10:05.373103: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[2m[36m(TrainTrainable pid=15827, ip=10.0.24.75)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(TrainTrainable pid=15827, ip=10.0.24.75)[0m 2023-09-05 16:10:05.542473: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2m[36m(TrainTrainable pid=15827, ip=10.0.24.75)[0m 2023-09-05 16:10:06.355419: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dyna

[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Is CUDA available: True
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m max_steps_per_epoch:  534


[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(RayTrainWorker pid=15887, ip=10.0.

[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Starting training


[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m max_steps is given, it will override any value given in num_train_epochs
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m ***** Running training *****
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Num examples = 17088
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Num Epochs = 9223372036854775807
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Instantaneous batch size per device = 16
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Total train batch size (w. parallel, distributed & accumulation) = 16
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Gradient Accumulation steps = 1
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Total optimization steps = 1068
  0%|          | 0/1068 [00:00<?, ?it/s]0.0.24.75)[0m 


(pid=15937, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['5aee0b1a098d74ee6df8a92da8e188e116fecac5160742dadbb7507b'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
  0%|          | 1/1068 [00:03<1:04:48,  3.64s/it][0m 
  0%|          | 2/1068 [00:04<33:29,  1.89s/it]  [0m 
  0%|          | 3/1068 [00:04<23:44,  1.34s/it]5)[0m 
  0%|          | 4/1068 [00:05<19:04,  1.08s/it]5)[0m 
  0%|          | 5/1068 [00:06<16:30,  1.07it/s]5)[0m 
  1%|          | 6/1

[2m[1m[36m(autoscaler +1m53s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[2m[1m[36m(autoscaler +1m53s)[0m [workspace snapshot] New snapshot created successfully (size: 168.76 KB).


 10%|█         | 109/1068 [01:19<11:28,  1.39it/s][0m 
 10%|█         | 110/1068 [01:19<11:27,  1.39it/s][0m 
 10%|█         | 111/1068 [01:20<11:28,  1.39it/s][0m 
 10%|█         | 112/1068 [01:21<11:26,  1.39it/s][0m 
 11%|█         | 113/1068 [01:21<11:27,  1.39it/s][0m 
 11%|█         | 114/1068 [01:22<11:25,  1.39it/s][0m 
 11%|█         | 115/1068 [01:23<11:25,  1.39it/s][0m 
 11%|█         | 116/1068 [01:24<11:23,  1.39it/s][0m 
 11%|█         | 117/1068 [01:24<11:24,  1.39it/s][0m 
 11%|█         | 118/1068 [01:25<11:23,  1.39it/s][0m 
 11%|█         | 119/1068 [01:26<11:22,  1.39it/s][0m 
 11%|█         | 120/1068 [01:26<11:21,  1.39it/s][0m 
 11%|█▏        | 121/1068 [01:27<11:21,  1.39it/s][0m 
 11%|█▏        | 122/1068 [01:28<11:20,  1.39it/s][0m 
 12%|█▏        | 123/1068 [01:29<11:18,  1.39it/s][0m 
 12%|█▏        | 124/1068 [01:29<11:18,  1.39it/s][0m 
 12%|█▏        | 125/1068 [01:30<11:17,  1.39it/s][0m 
 12%|█▏        | 126/1068 [01:31<11:15,  1.39it/

[2m[1m[36m(autoscaler +6m53s)[0m [workspace snapshot] New snapshot created successfully (size: 194.25 KB).


 49%|████▉     | 521/1068 [06:19<06:58,  1.31it/s][0m 
 49%|████▉     | 522/1068 [06:19<06:53,  1.32it/s][0m 
 49%|████▉     | 523/1068 [06:20<06:47,  1.34it/s][0m 
 49%|████▉     | 524/1068 [06:21<06:42,  1.35it/s][0m 
 49%|████▉     | 525/1068 [06:21<06:39,  1.36it/s][0m 
 49%|████▉     | 526/1068 [06:22<06:36,  1.37it/s][0m 
 49%|████▉     | 527/1068 [06:23<06:34,  1.37it/s][0m 
 49%|████▉     | 528/1068 [06:24<06:33,  1.37it/s][0m 
 50%|████▉     | 529/1068 [06:24<06:32,  1.37it/s][0m 
 50%|████▉     | 530/1068 [06:25<06:31,  1.37it/s][0m 
 50%|████▉     | 531/1068 [06:26<06:30,  1.37it/s][0m 
 50%|████▉     | 532/1068 [06:27<06:29,  1.38it/s][0m 
 50%|████▉     | 533/1068 [06:27<06:28,  1.38it/s][0m 
 50%|█████     | 534/1068 [06:28<06:27,  1.38it/s][0m 


[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m {'loss': 0.5402, 'learning_rate': 9.9812734082397e-06, 'epoch': 0.5}


 50%|█████     | 535/1068 [06:28<05:24,  1.64it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Batch size = 16
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['5aee0b1a098d74ee6df8a92da8e188e116fecac5160742dadbb7507b'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


(pid=15938, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m {'eval_loss': 0.5167202353477478, 'eval_matthews_correlation': 0.40649809839684037, 'eval_runtime': 17.7492, 'eval_samples_per_second': 58.763, 'eval_steps_per_second': 3.718, 'epoch': 0.5}


 50%|█████     | 535/1068 [06:46<05:24,  1.64it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-05_16-10-01/TorchTrainer_56f64_00000_0_2023-09-05_16-10-01/checkpoint_00

(pid=15937, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['5aee0b1a098d74ee6df8a92da8e188e116fecac5160742dadbb7507b'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=15937, ip=10.0.24.75)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
 50%|█████     | 536/1068 [06:54<1:11:36,  8.08s/it]0m 
 50%|█████     | 537/1068 [06:55<51:54,  5.87s/it]  0m 
 50%|█████     | 538/1068 [06:55<38:09,  4.32s/it][0m 
 50%|█████     | 539/1068 [06:56<28:33,  3.24s/it][0m 
 51%|█████     | 540/1068 [06:57<21:50,  2.48s/it][0m 
 51%|█████     | 541

[2m[1m[36m(autoscaler +11m56s)[0m [workspace snapshot] New snapshot created successfully (size: 237.58 KB).


 85%|████████▍ | 903/1068 [11:22<02:00,  1.37it/s][0m 
 85%|████████▍ | 904/1068 [11:23<01:59,  1.37it/s][0m 
 85%|████████▍ | 905/1068 [11:23<01:58,  1.37it/s][0m 
 85%|████████▍ | 906/1068 [11:24<01:58,  1.37it/s][0m 
 85%|████████▍ | 907/1068 [11:25<01:57,  1.37it/s][0m 
 85%|████████▌ | 908/1068 [11:25<01:56,  1.37it/s][0m 
 85%|████████▌ | 909/1068 [11:26<01:55,  1.37it/s][0m 
 85%|████████▌ | 910/1068 [11:27<01:55,  1.37it/s][0m 
 85%|████████▌ | 911/1068 [11:28<01:54,  1.37it/s][0m 
 85%|████████▌ | 912/1068 [11:28<01:53,  1.37it/s][0m 
 85%|████████▌ | 913/1068 [11:29<01:53,  1.37it/s][0m 
 86%|████████▌ | 914/1068 [11:30<01:52,  1.37it/s][0m 
 86%|████████▌ | 915/1068 [11:31<01:51,  1.37it/s][0m 
 86%|████████▌ | 916/1068 [11:31<01:50,  1.37it/s][0m 
 86%|████████▌ | 917/1068 [11:32<01:49,  1.37it/s][0m 
 86%|████████▌ | 918/1068 [11:33<01:49,  1.37it/s][0m 
 86%|████████▌ | 919/1068 [11:33<01:48,  1.37it/s][0m 
 86%|████████▌ | 920/1068 [11:34<01:47,  1.37it/

[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m {'loss': 0.3826, 'learning_rate': 0.0, 'epoch': 1.5}


100%|██████████| 1068/1068 [13:16<00:00,  1.50it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m   Batch size = 16
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['5aee0b1a098d74ee6df8a92da8e188e116fecac5160742dadbb7507b'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=15938, ip=10.0.24.75)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


(pid=15938, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m {'eval_loss': 0.5601133108139038, 'eval_matthews_correlation': 0.4373748188096333, 'eval_runtime': 15.9315, 'eval_samples_per_second': 65.468, 'eval_steps_per_second': 4.143, 'epoch': 1.5}


100%|██████████| 1068/1068 [13:32<00:00,  1.50it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1068
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/config.json
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/pytorch_model.bin
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/tokenizer_config.json
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/special_tokens_map.json


[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m {'train_runtime': 818.7099, 'train_samples_per_second': 20.872, 'train_steps_per_second': 1.304, 'train_loss': 0.46155189485585646, 'epoch': 1.5}


[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-05_16-10-01/TorchTrainer_56f64_00000_0_2023-09-05_16-10-01/checkpoint_000001)
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m 
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m 
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m 
[2m[36m(RayTrainWorker pid=15887, ip=10.0.24.75)[0m 
100%|██████████| 1068/1068 [13:38<00:00,  1.30it/s][0m 
2023-09-05 16:24:00,650	INFO tune.py:1154 -- Total run time: 839.61 seconds (839.19 seconds for the tuning loop).


You can use the returned `Result` object to access metrics and the Ray Train `Checkpoint` associated with the last iteration.

In [17]:
result

Result(
  metrics={'loss': 0.3826, 'learning_rate': 0.0, 'epoch': 1.5, 'step': 1068, 'eval_loss': 0.5601133108139038, 'eval_matthews_correlation': 0.4373748188096333, 'eval_runtime': 15.9315, 'eval_samples_per_second': 65.468, 'eval_steps_per_second': 4.143},
  path='/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-05_16-10-01/TorchTrainer_56f64_00000_0_2023-09-05_16-10-01',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-05_16-10-01/TorchTrainer_56f64_00000_0_2023-09-05_16-10-01/checkpoint_000001)
)

### Tune hyperparameters with Ray Tune <a name="predict"></a>

If we would like to tune any hyperparameters of the model, we can do so by simply passing our `TorchTrainer` into a `Tuner` and defining the search space.

We can also take advantage of the advanced search algorithms and schedulers provided by Ray Tune. In this example, we will use an `ASHAScheduler` to aggresively terminate underperforming trials.

In [25]:
from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler

tune_epochs = 4
tuner = Tuner(
    trainer,
    param_space={
        "train_loop_config": {
            "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune_epochs,
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=1,
        scheduler=ASHAScheduler(
            max_t=tune_epochs,
        )
    ),
    run_config=RunConfig(
        name="tune_transformers",
        storage_path="/mnt/cluster_storage/ray_results",
        checkpoint_config=CheckpointConfig(num_to_keep=1, checkpoint_score_attribute="eval_loss", checkpoint_score_order="min")
    ),
)

2023-09-05 16:35:56,242	INFO tuner_internal.py:508 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.


In [26]:
tune_results = tuner.fit()

0,1
Current time:,2023-09-05 17:02:17
Running for:,00:26:13.42
Memory:,8.7/62.0 GiB

Trial name,status,loc,train_loop_config/le arning_rate,iter,total time (s),loss,learning_rate,epoch
TorchTrainer_fac06_00000,TERMINATED,10.0.19.101:55234,2e-05,4,1537.99,0.1941,0.0,3.25
TorchTrainer_fac06_00001,TERMINATED,10.0.9.214:21972,0.0002,1,400.381,0.6217,0.000149906,0.25
TorchTrainer_fac06_00002,TERMINATED,10.0.42.45:21888,0.002,1,391.176,0.6488,0.00149906,0.25
TorchTrainer_fac06_00003,TERMINATED,10.0.24.75:22226,0.02,1,402.832,1.165,0.0149906,0.25


[2m[36m(TrainTrainable pid=21888, ip=10.0.42.45)[0m 2023-09-05 16:36:07.866977: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[2m[36m(TrainTrainable pid=21888, ip=10.0.42.45)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(TrainTrainable pid=21888, ip=10.0.42.45)[0m 2023-09-05 16:36:08.011580: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2m[36m(TrainTrainable pid=21888, ip=10.0.42.45)[0m 2023-09-05 16:36:08.735652: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dyna

[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Is CUDA available: True


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(RayTrainWorker pid=21951, ip=10.0.

[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m max_steps_per_epoch:  534


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m max_steps is given, it will override any value given in num_train_epochs


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Starting training


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m ***** Running training *****
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Num examples = 34176
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Num Epochs = 9223372036854775807
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Instantaneous batch size per device = 16
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Total train batch size (w. parallel, distributed & accumulation) = 16
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Gradient Accumulation steps = 1
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Total optimization steps = 2136
  0%|          | 0/2136 [00:00<?, ?it/s]0.0.42.45)[0m 
[2m[36m(SplitCoordinator pid=21995, ip=10.0.42.45)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=21995, ip=10.0.42.45)[0m Execution config: ExecutionOptions(resourc

(pid=21995, ip=10.0.42.45) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=55340)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(RayTrainWorker pid=55340)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']


(pid=55407) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']


(pid=22082, ip=10.0.9.214) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 1/2136 [00:01<1:03:45,  1.79s/it][0m 


(pid=22338, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Is CUDA available: True[32m [repeated 3x across cluster][0m


[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).[32m [repeated 3x across cluster][0m
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).[32m [repeated 3x across cluster][0m
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.[32m [repeated 3x across cluster][0m
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m max_steps is given, it will override any value giv

[2m[1m[36m(autoscaler +26m55s)[0m [workspace snapshot] New snapshot created successfully (size: 351.82 KB).


  1%|▏         | 32/2136 [00:21<22:31,  1.56it/s][32m [repeated 32x across cluster][0m
  2%|▏         | 40/2136 [00:26<22:21,  1.56it/s][32m [repeated 32x across cluster][0m
  2%|▏         | 48/2136 [00:31<22:16,  1.56it/s][32m [repeated 32x across cluster][0m
  3%|▎         | 56/2136 [00:37<22:17,  1.56it/s][32m [repeated 32x across cluster][0m
  3%|▎         | 62/2136 [00:41<22:25,  1.54it/s][32m [repeated 30x across cluster][0m
  3%|▎         | 74/2136 [00:47<21:50,  1.57it/s][32m [repeated 32x across cluster][0m
  4%|▍         | 82/2136 [00:52<21:48,  1.57it/s][32m [repeated 32x across cluster][0m
  4%|▍         | 90/2136 [00:57<21:48,  1.56it/s][32m [repeated 32x across cluster][0m
  5%|▍         | 98/2136 [01:03<21:42,  1.56it/s][32m [repeated 31x across cluster][0m
  5%|▍         | 106/2136 [01:08<21:37,  1.56it/s][32m [repeated 31x across cluster][0m
  5%|▌         | 111/2136 [01:12<21:53,  1.54it/s][32m [repeated 31x across cluster][0m
  5%|▌         | 11

[2m[1m[36m(autoscaler +31m55s)[0m [workspace snapshot] New snapshot created successfully (size: 301.79 KB).


 23%|██▎       | 485/2136 [05:20<17:49,  1.54it/s][32m [repeated 29x across cluster][0m
 23%|██▎       | 487/2136 [05:25<19:23,  1.42it/s][32m [repeated 30x across cluster][0m
 23%|██▎       | 501/2136 [05:30<17:38,  1.54it/s][32m [repeated 31x across cluster][0m
 24%|██▍       | 509/2136 [05:35<17:33,  1.54it/s][32m [repeated 30x across cluster][0m
 24%|██▍       | 509/2136 [05:41<19:01,  1.43it/s][32m [repeated 29x across cluster][0m
 24%|██▎       | 503/2136 [05:45<18:52,  1.44it/s][32m [repeated 31x across cluster][0m
 25%|██▍       | 533/2136 [05:51<17:20,  1.54it/s][32m [repeated 30x across cluster][0m


[2m[36m(RayTrainWorker pid=55340)[0m {'loss': 0.5428, 'learning_rate': 1.4990636704119851e-05, 'epoch': 0.25}
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m max_steps_per_epoch:  534[32m [repeated 3x across cluster][0m
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Starting training[32m [repeated 3x across cluster][0m


 25%|██▌       | 535/2136 [05:52<14:34,  1.83it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=55340)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=55340)[0m   Batch size = 16


(pid=55408) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55408)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55408)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55408)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
 24%|██▍       | 514/2136 [05:55<19:03,  1.42it/s][32m [repeated 25x across cluster][0m


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m {'loss': 0.6488, 'learning_rate': 0.0014990636704119852, 'epoch': 0.25}


 25%|██▌       | 535/2136 [05:59<15:45,  1.69it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m   Batch size = 16
[2m[36m(SplitCoordinator pid=21996, ip=10.0.42.45)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=21996, ip=10.0.42.45)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['28a51d1ed2d5d043099f7db6967c0090ba951b26d14a1a8b74a07da0'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=21996, ip=10.0.42.45)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


(pid=21996, ip=10.0.42.45) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

 25%|██▍       | 526/2136 [06:01<18:38,  1.44it/s][32m [repeated 18x across cluster][0m
 25%|██▍       | 529/2136 [06:05<18:53,  1.42it/s][32m [repeated 15x across cluster][0m
 25%|██▌       | 535/2136 [06:07<14:34,  1.83it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
[2m[36m(RayTrainWorker pid=55340)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json


[2m[36m(RayTrainWorker pid=55340)[0m {'eval_loss': 0.5093920230865479, 'eval_matthews_correlation': 0.3945310342011491, 'eval_runtime': 15.1752, 'eval_samples_per_second': 68.731, 'eval_steps_per_second': 4.349, 'epoch': 0.25}
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m {'loss': 0.6217, 'learning_rate': 0.0001499063670411985, 'epoch': 0.25}


 25%|██▌       | 535/2136 [06:06<15:33,  1.71it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m   Batch size = 16
[2m[36m(SplitCoordinator pid=22083, ip=10.0.9.214)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=22083, ip=10.0.9.214)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['045a49b457285380c7d8fae0cf544c1151019793d4d27776f0ebe029'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=22083, ip=10.0.9.214)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


(pid=22083, ip=10.0.9.214) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=55340)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
[2m[36m(RayTrainWorker pid=55340)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
[2m[36m(RayTrainWorker pid=55340)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json


(pid=22339, ip=10.0.24.75) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(RayTrainWorker pid=55340)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00000_0_learning_rate=0.0000_2023-09-05_16-36-04/checkpoint_000000)
 25%|██▌       | 534/2136 [06:09<18:49,  1.42it/s][32m [repeated 6x across cluster][0m
 25%|██▌       | 535/2136 [06:09<15:50,  1.68it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m   Batch size = 16
[2m[36m(SplitCoordinator pid=22339, ip=10.0.24.75)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=22339, ip=10.0.24.75)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['5aee0b1a098d74ee6df8a92da8e188e116fecac5160

(pid=55407) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55407)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55407)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55407)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m {'loss': 1.165, 'learning_rate': 0.01499063670411985, 'epoch': 0.25}


 25%|██▌       | 535/2136 [06:15<15:45,  1.69it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m {'eval_loss': 0.6183559894561768, 'eval_matthews_correlation': 0.0, 'eval_runtime': 16.0909, 'eval_samples_per_second': 64.819, 'eval_steps_per_second': 4.102, 'epoch': 0.25}


[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
 25%|██▌       | 542/2136 [06:18<37:20,  1.41s/it][32m [repeated 7x across cluster][0m
[2m[36m(RayTrainWorker pid=21951, ip=10.0.42.45)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00002_2_learning_rate=0.0020_2023-09-05_16-36-04/checkpoint_000000)
 26%|██▌       | 550/2136 [06:23<18:16,  1.45it/s][32m [repeated 8x across cluster][0m


[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m {'eval_loss': 0.6180469393730164, 'eval_matthews_correlation': 0.0, 'eval_runtime': 16.3166, 'eval_samples_per_second': 63.922, 'eval_steps_per_second': 4.045, 'epoch': 0.25}


 25%|██▌       | 535/2136 [06:23<15:33,  1.71it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json


[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m {'eval_loss': 0.6182646155357361, 'eval_matthews_correlation': 0.0, 'eval_runtime': 16.2647, 'eval_samples_per_second': 64.127, 'eval_steps_per_second': 4.058, 'epoch': 0.25}


 26%|██▌       | 558/2136 [06:28<17:06,  1.54it/s][32m [repeated 8x across cluster][0m
 25%|██▌       | 535/2136 [06:25<15:50,  1.68it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
[2m[36m(RayTrainWorker pid=22035, ip=10.0.9.214)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00001_1_learning_rate=0.0002_2023-09-05_16-36-04/checkpoint_000000)
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
[2m[36m(RayTrainWorker pid=22291, ip=10.0.24.75)[0m Specia

[2m[1m[36m(autoscaler +36m56s)[0m [workspace snapshot] New snapshot created successfully (size: 327.28 KB).


 43%|████▎     | 910/2136 [10:24<14:10,  1.44it/s][32m [repeated 8x across cluster][0m
 43%|████▎     | 918/2136 [10:29<14:06,  1.44it/s][32m [repeated 8x across cluster][0m
 43%|████▎     | 926/2136 [10:35<14:01,  1.44it/s][32m [repeated 8x across cluster][0m
 44%|████▎     | 934/2136 [10:41<13:57,  1.44it/s][32m [repeated 8x across cluster][0m
 44%|████▍     | 942/2136 [10:46<13:53,  1.43it/s][32m [repeated 8x across cluster][0m
 44%|████▍     | 950/2136 [10:52<13:49,  1.43it/s][32m [repeated 8x across cluster][0m
 45%|████▍     | 958/2136 [10:57<13:43,  1.43it/s][32m [repeated 8x across cluster][0m
 45%|████▌     | 966/2136 [11:03<13:37,  1.43it/s][32m [repeated 8x across cluster][0m
 46%|████▌     | 974/2136 [11:09<13:31,  1.43it/s][32m [repeated 8x across cluster][0m
 46%|████▌     | 982/2136 [11:14<13:29,  1.43it/s][32m [repeated 8x across cluster][0m
 46%|████▋     | 990/2136 [11:20<13:20,  1.43it/s][32m [repeated 8x across cluster][0m
 47%|████▋     | 998/

[2m[36m(RayTrainWorker pid=55340)[0m {'loss': 0.3729, 'learning_rate': 9.9812734082397e-06, 'epoch': 1.25}


 50%|█████     | 1070/2136 [12:15<10:20,  1.72it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=55340)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=55340)[0m   Batch size = 16
 50%|█████     | 1069/2136 [12:15<12:17,  1.45it/s][32m [repeated 7x across cluster][0m


(pid=55408) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55408)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55408)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55408)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
 50%|█████     | 1070/2136 [12:31<10:20,  1.72it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070
[2m[36m(RayTrainWorker pid=55340)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json


[2m[36m(RayTrainWorker pid=55340)[0m {'eval_loss': 0.5405573844909668, 'eval_matthews_correlation': 0.4801193217859798, 'eval_runtime': 16.2513, 'eval_samples_per_second': 64.18, 'eval_steps_per_second': 4.061, 'epoch': 1.25}


[2m[36m(RayTrainWorker pid=55340)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin
[2m[36m(RayTrainWorker pid=55340)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json
[2m[36m(RayTrainWorker pid=55340)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json
[2m[36m(RayTrainWorker pid=55340)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00000_0_learning_rate=0.0000_2023-09-05_16-36-04/checkpoint_000001)
[2m[36m(SplitCoordinator pid=55407)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55407)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_

(pid=55407) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

 50%|█████     | 1071/2136 [12:38<2:11:12,  7.39s/it]
 50%|█████     | 1072/2136 [12:39<1:35:23,  5.38s/it]
 50%|█████     | 1073/2136 [12:40<1:10:21,  3.97s/it]
 50%|█████     | 1074/2136 [12:40<52:50,  2.98s/it]  
 50%|█████     | 1075/2136 [12:41<40:35,  2.30s/it]
 50%|█████     | 1076/2136 [12:42<32:01,  1.81s/it]
 50%|█████     | 1077/2136 [12:42<26:02,  1.48s/it]
 50%|█████     | 1078/2136 [12:43<21:51,  1.24s/it]
 51%|█████     | 1079/2136 [12:44<18:54,  1.07s/it]
 51%|█████     | 1080/2136 [12:44<16:51,  1.04it/s]
 51%|█████     | 1081/2136 [12:45<15:25,  1.14it/s]
 51%|█████     | 1082/2136 [12:46<14:23,  1.22it/s]
 51%|█████     | 1083/2136 [12:46<13:41,  1.28it/s]
 51%|█████     | 1084/2136 [12:47<13:10,  1.33it/s]
 51%|█████     | 1085/2136 [12:48<12:49,  1.37it/s]
 51%|█████     | 1086/2136 [12:48<12:34,  1.39it/s]
 51%|█████     | 1087/2136 [12:49<12:24,  1.41it/s]
 51%|█████     | 1088/2136 [12:50<12:16,  1.42it/s]
 51%|█████     | 1089/2136 [12:51<12:12,  1.43it/s]
 51%

[2m[1m[36m(autoscaler +41m57s)[0m [workspace snapshot] New snapshot created successfully (size: 339.99 KB).


 61%|██████    | 1305/2136 [15:21<09:41,  1.43it/s]
 61%|██████    | 1306/2136 [15:22<09:40,  1.43it/s]
 61%|██████    | 1307/2136 [15:23<09:40,  1.43it/s]
 61%|██████    | 1308/2136 [15:23<09:40,  1.43it/s]
 61%|██████▏   | 1309/2136 [15:24<09:39,  1.43it/s]
 61%|██████▏   | 1310/2136 [15:25<09:38,  1.43it/s]
 61%|██████▏   | 1311/2136 [15:25<09:36,  1.43it/s]
 61%|██████▏   | 1312/2136 [15:26<09:37,  1.43it/s]
 61%|██████▏   | 1313/2136 [15:27<09:35,  1.43it/s]
 62%|██████▏   | 1314/2136 [15:27<09:35,  1.43it/s]
 62%|██████▏   | 1315/2136 [15:28<09:33,  1.43it/s]
 62%|██████▏   | 1316/2136 [15:29<09:33,  1.43it/s]
 62%|██████▏   | 1317/2136 [15:30<09:33,  1.43it/s]
 62%|██████▏   | 1318/2136 [15:30<09:32,  1.43it/s]
 62%|██████▏   | 1319/2136 [15:31<09:32,  1.43it/s]
 62%|██████▏   | 1320/2136 [15:32<09:30,  1.43it/s]
 62%|██████▏   | 1321/2136 [15:32<09:30,  1.43it/s]
 62%|██████▏   | 1322/2136 [15:33<09:29,  1.43it/s]
 62%|██████▏   | 1323/2136 [15:34<09:28,  1.43it/s]
 62%|██████▏

[2m[36m(RayTrainWorker pid=55340)[0m {'loss': 0.2565, 'learning_rate': 4.971910112359551e-06, 'epoch': 2.25}


 75%|███████▌  | 1605/2136 [18:49<04:55,  1.80it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=55340)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=55340)[0m   Batch size = 16


(pid=55408) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55408)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55408)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55408)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
 75%|███████▌  | 1605/2136 [19:05<04:55,  1.80it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605
[2m[36m(RayTrainWorker pid=55340)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json


[2m[36m(RayTrainWorker pid=55340)[0m {'eval_loss': 0.6462501287460327, 'eval_matthews_correlation': 0.5410039366652665, 'eval_runtime': 15.4093, 'eval_samples_per_second': 67.686, 'eval_steps_per_second': 4.283, 'epoch': 2.25}


[2m[36m(RayTrainWorker pid=55340)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin
[2m[36m(RayTrainWorker pid=55340)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json
[2m[36m(RayTrainWorker pid=55340)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json
[2m[36m(RayTrainWorker pid=55340)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00000_0_learning_rate=0.0000_2023-09-05_16-36-04/checkpoint_000002)


(pid=55407) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55407)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55407)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55407)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
 75%|███████▌  | 1606/2136 [19:12<1:02:59,  7.13s/it]
 75%|███████▌  | 1607/2136 [19:12<45:43,  5.19s/it]  
 75%|███████▌  | 1608/2136 [19:13<33:39,  3.82s/it]
 75%|███████▌  | 1609/2136 [19:14<25:13,  2.87s/it]
 75%|███████▌  | 1610/2136 [19:14<19:19,  2.20s/it]
 75%|███████▌  | 1611/2136 [19:15<15:12,  1.74s/it]
 75%|███████▌  | 1612/2136 [1

[2m[1m[36m(autoscaler +46m57s)[0m [workspace snapshot] New snapshot created successfully (size: 381.44 KB).


 80%|████████  | 1713/2136 [20:21<04:37,  1.52it/s]
 80%|████████  | 1714/2136 [20:22<04:35,  1.53it/s]
 80%|████████  | 1715/2136 [20:23<04:34,  1.53it/s]
 80%|████████  | 1716/2136 [20:23<04:33,  1.54it/s]
 80%|████████  | 1717/2136 [20:24<04:32,  1.54it/s]
 80%|████████  | 1718/2136 [20:25<04:31,  1.54it/s]
 80%|████████  | 1719/2136 [20:25<04:30,  1.54it/s]
 81%|████████  | 1720/2136 [20:26<04:30,  1.54it/s]
 81%|████████  | 1721/2136 [20:27<04:29,  1.54it/s]
 81%|████████  | 1722/2136 [20:27<04:28,  1.54it/s]
 81%|████████  | 1723/2136 [20:28<04:28,  1.54it/s]
 81%|████████  | 1724/2136 [20:29<04:27,  1.54it/s]
 81%|████████  | 1725/2136 [20:29<04:26,  1.54it/s]
 81%|████████  | 1726/2136 [20:30<04:26,  1.54it/s]
 81%|████████  | 1727/2136 [20:30<04:25,  1.54it/s]
 81%|████████  | 1728/2136 [20:31<04:24,  1.54it/s]
 81%|████████  | 1729/2136 [20:32<04:24,  1.54it/s]
 81%|████████  | 1730/2136 [20:32<04:23,  1.54it/s]
 81%|████████  | 1731/2136 [20:33<04:22,  1.54it/s]
 81%|███████

[2m[36m(RayTrainWorker pid=55340)[0m {'loss': 0.1941, 'learning_rate': 0.0, 'epoch': 3.25}


100%|██████████| 2136/2136 [25:05<00:00,  1.44it/s]***** Running Evaluation *****
[2m[36m(RayTrainWorker pid=55340)[0m   Num examples: Unknown
[2m[36m(RayTrainWorker pid=55340)[0m   Batch size = 16


(pid=55408) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(SplitCoordinator pid=55408)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(preprocess_function)] -> OutputSplitter[split(1, equal=True)]
[2m[36m(SplitCoordinator pid=55408)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['d18a14d46721e86f86679bc8635753b27b5c4e17470461fc2a2fc741'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
[2m[36m(SplitCoordinator pid=55408)[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
100%|██████████| 2136/2136 [25:22<00:00,  1.44it/s]Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2136
[2m[36m(RayTrainWorker pid=55340)[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2136/config.json


[2m[36m(RayTrainWorker pid=55340)[0m {'eval_loss': 0.7526667714118958, 'eval_matthews_correlation': 0.5396512550123742, 'eval_runtime': 16.5097, 'eval_samples_per_second': 63.175, 'eval_steps_per_second': 3.998, 'epoch': 3.25}


[2m[36m(RayTrainWorker pid=55340)[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2136/pytorch_model.bin
[2m[36m(RayTrainWorker pid=55340)[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2136/tokenizer_config.json
[2m[36m(RayTrainWorker pid=55340)[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2136/special_tokens_map.json


[2m[1m[36m(autoscaler +52m2s)[0m [workspace snapshot] New snapshot created successfully (size: 407.01 KB).


[2m[36m(RayTrainWorker pid=55340)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/tune_transformers/TorchTrainer_fac06_00000_0_learning_rate=0.0000_2023-09-05_16-36-04/checkpoint_000003)
[2m[36m(RayTrainWorker pid=55340)[0m 
[2m[36m(RayTrainWorker pid=55340)[0m 
[2m[36m(RayTrainWorker pid=55340)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[2m[36m(RayTrainWorker pid=55340)[0m 
[2m[36m(RayTrainWorker pid=55340)[0m 
100%|██████████| 2136/2136 [25:27<00:00,  1.40it/s]


[2m[36m(RayTrainWorker pid=55340)[0m {'train_runtime': 1527.6333, 'train_samples_per_second': 22.372, 'train_steps_per_second': 1.398, 'train_loss': 0.34186658252044566, 'epoch': 3.25}


Syncing will be retried.
2023-09-05 17:02:17,774	INFO tune.py:1154 -- Total run time: 1573.45 seconds (1545.66 seconds for the tuning loop).


We can view the results of the tuning run as a dataframe, and obtain the best result.

In [27]:
tune_results.get_dataframe().sort_values("eval_loss")

Unnamed: 0,loss,learning_rate,epoch,step,eval_loss,eval_matthews_correlation,eval_runtime,eval_samples_per_second,eval_steps_per_second,timestamp,...,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,checkpoint_dir_name,config/train_loop_config/learning_rate,config/train_loop_config/epochs,logdir
1,0.6217,0.00015,0.25,535,0.618047,0.0,16.3166,63.922,4.045,1693957372,...,400.380528,21972,ip-10-0-9-214,10.0.9.214,400.380528,1,checkpoint_000000,0.0002,4,fac06_00001
3,1.165,0.014991,0.25,535,0.618265,0.0,16.2647,64.127,4.058,1693957375,...,402.832207,22226,ip-10-0-24-75,10.0.24.75,402.832207,1,checkpoint_000000,0.02,4,fac06_00003
2,0.6488,0.001499,0.25,535,0.618356,0.0,16.0909,64.819,4.102,1693957362,...,391.176487,21888,ip-10-0-42-45,10.0.42.45,391.176487,1,checkpoint_000000,0.002,4,fac06_00002
0,0.1941,0.0,3.25,2136,0.752667,0.539651,16.5097,63.175,3.998,1693958509,...,1537.989638,55234,ip-10-0-19-101,10.0.19.101,1537.989638,4,checkpoint_000003,2e-05,4,fac06_00000


In [28]:
best_result = tune_results.get_best_result()

### Share the model <a name="share"></a>

To be able to share your model with the community, there are a few more steps to follow.

We have conducted the training on the Ray cluster, but share the model from the local enviroment - this will allow us to easily authenticate.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS. Uncomment the following instructions:

In [23]:
# !apt install git-lfs

Now, load the model and tokenizer locally, and recreate the 🤗 Transformers `Trainer`:

In [None]:
from ray.train.huggingface import LegacyTransformersCheckpoint

checkpoint = LegacyTransformersCheckpoint.from_checkpoint(result.checkpoint)
hf_trainer = checkpoint.get_model(model=AutoModelForSequenceClassification)

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
hf_trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

## Next steps

- {ref}`End-to-end: Offline Batch Inference <batch_inference_home>`