Source: https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt

https://youtu.be/Dh9CL8fyG80

Now we'll see how to achieve the same results as we did in the last section without using the Trainer class. Again, we assume you have done the data processing in section 2. Here is a short summary covering evreything you will need:

## Prepare for training

Before actually writing our training loop, we will need to define a few objectives. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for use automatically. Specifically, we need to:

- Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
- Rename the column label to labels (because the model expects the argument to be named labels).
- Set the format of the datasets so they return PyTorch tensors instead of lists.

Ou tokenized_datasets has one method for each of those steps:

We can then check that the result only has columns that our model will accept:

Now that this is done, we can easily define our dataloaders:

To quickly check there is no mistake in the data processing, we can inspect a batch like this:

Note that the actual shapes will probably be slightly different for you since we set shuffle=True for the training dataloader and we are padding to the maximum length inside the batch.

Now that we're completely finished the data preprocessing (a satisfying yet elusive goal for any ML practitioner), let's turn to the model. We instantiate it exactly as we did in the previous section:

To make sure that everything will go smoothly during training, we pass our batch to this model:

All 🤗 Transformers models will return the loss when labels are provided, and we also get the logits (two for each input in our batch, so a tensor of size 8 x 2).

We're almost ready to write our training loop! We're just missing two things: an optimizer and a learning rate scheduler. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults. The optimizer used by the Trainer is AdamW, which is the same as Adam, but with a twist for weight decay regularization (see Decoupled Weight Decay Regularization by Ilya Loshchilov and Frank Hutter: https://arxiv.org/abs/1711.05101)

Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that:

## The training loop

One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we define a device we will put our model and our batches on:

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

You can see that the core of the training loops looks a lot like the one in the introduction. We didn't ask for any reporting, so this training loop will not tell us anything about how the model fares. We need to add an evaluation loop for that.

## The evaluation loop

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch(). Once we ahve accumulated all the batches, we can get the final result with metric.compute(). Here's how to implement all of this in an evaluation loop:

Again, your results will be slightly different because of the randomness in the model head initialization and the data shuffling, but they should be in the same ballpark.

## Supercharge your training loop with 🤗 Accelerate

https://youtu.be/s7dy8QRgjJ0

The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate (https://github.com/huggingface/accelerate) library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. Starting from the creation of the training and validation dataloaders, here is what our manual trianing loop looks like:

And here are the changes:

The first line to add is the import line. The second line instantiates the Accelerator object that will look athe environment and intializizes the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use accelerator.device instead of device).

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimzer to accelerator.prepare(). This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and replacing loss.backward() with accelerator.backward(loss).

If you'd like to copy and past it to play around, here's what the complete training loop looks like with 🤗 Accelerate:

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

which will launch the distributed training.

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a training_function() and run a last cell with:

You can find more examples in the 🤗 Accelerate repo: https://github.com/huggingface/accelerate/tree/main/examples