# A complete Hugging Face tutorial: how to build and train a vision transformer

This article serves as an all-in tutorial of the Hugging Face ecosystem. We will explore the different libraries developed by the Hugging Face team such as transformers and datasets. We will see how they can be used to develop and train transformers with minimum boilerplate code. To better elaborate the basic concepts, we will showcase the entire pipeline of building and training a Vision Transformer (ViT).

I assume that you already are familiar with the architecture so we won’t analyze much about it. A few things to remember are:

1. In ViT, we represent an image as a sequence of patches .

1. The architecture resembles the original Transformer from the famous “Attention is all you need” paper.

1. The model is trained using a labeled dataset following a fully-supervised paradigm.

1. It is usually fine-tuned on the downstream dataset for image classification.

If you are interested in a holistic view of the ViT architecture, visit one of our previous articles on the topic: [How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16x16 words.](https://theaisummer.com/vision-transformer/)

![source.gif](source.gif)

Back to Hugging face which is the main objective of the article. We will strive to present the fundamental principles of the libraries covering the entire ML pipeline: from data loading to training and evaluation.

Shall we begin


## Datasets
The datasets library by Hugging Face is a collection of ready-to-use datasets and evaluation metrics for NLP. At the moment of writing this, the datasets hub counts over 900 different datasets. Let’s see how we can use it in our example.

To load a dataset, we need to import the `load_dataset` function and load the desired dataset like below:

In [2]:
# blocks output in Colab 💄


from datasets import load_dataset

train_ds, test_ds = load_dataset('cifar10', split=['train[:5000]', 'test[:2000]'])

Reusing dataset cifar10 (/root/.cache/huggingface/datasets/cifar10/plain_text/1.0.0/447d6ec4733dddd1ce3bb577c7166b986eaa4c538dcd9e805ba61f35674a9de4)


  0%|          | 0/2 [00:00<?, ?it/s]

Notice that here we load only a portion of the CIFAR10 dataset. Using `load_dataset`, we can download datasets from the Hugging Face Hub, read from a local file, or load from in-memory data. We can also configure it to use a custom script containing the loading functionality.

Typically, the dataset will be returned as a `datasets.Dataset` object which is nothing more than a table with rows and columns. Querying a row will return a python dictionary with keys corresponding to the column names and values to the value in this particular row-column cell. In other words, each row corresponds to a data-point and each column to a feature. We can get the entire structure of the dataset using `datasets.features`.

A `Dataset` object is behaving like a Python list so we can query as we’d normally do with Numpy or Pandas:

1. A single row is `dataset[3]`

1. A batch is `dataset:[3:6]`

1. A column is `dataset[‘feature_1’]`

Everything is a Python object but that doesn’t mean that it can’t be converted into NumPy, pandas, PyTorch or TensorFlow. This can be very easily accomplished using `datasets.Dataset.set_format()`, where the format is one of `'numpy'`, `'pandas'`, `'torch'`, `'tensorflow'`.

No need to say that there is also support for all types of operations. To name a few: `sort`, `shuffle`, `filter`, `train_test_split`, `shard`, `cast`, `flatten` and `map` . `map` is , of course, the main function to perform transformations and as you’d expect is parallelizable.

In our example, we first need to split the training data into a training and a validation dataset:



In [3]:
splits = train_ds.train_test_split(test_size=0.1)

train_ds = splits['train']

val_ds = splits['test']

### Metrics

The datasets library also provides a wide list of metrics that can be used when training models. The main object here is a `datasets.Metric` and can be utilized into two ways:

1. We can either load an existing metric from the Hub using `datasets.load_metric(‘metric_name’)`

1. Or we can define a custom metric in a separate script and load it using: `load_metric('PATH/TO/MY/METRIC/SCRIPT')`

In [4]:
from datasets import load_metric

metric = load_metric("accuracy")

# Transformers
Transformers is the main library by Hugging Face. It provides intuitive and highly abstracted functionalities to build, train and fine-tune transformers. It comes with almost 10000 pretrained models that can be found on the Hub. These models can be built in Tensorflow, Pytorch or JAX (a very recent addition) and anyone can upload his own model.

Alongside with our example code, we will dive a little deeper into the main classes and features of the transformers library.

## Pipelines
The `pipeline` abstraction is an intuitive and easy way to use a model for inference. They abstract most of the code from the library and provide a dedicated API for a variety of tasks. Examples include: `AutomaticSpeechRecognitionPipeline`, `QuestionAnsweringPipeline` , `TranslationPipeline` and more.

The `pipeline` object lets us also define the pretrained model as well as the tokenizer, the feature extractor, the underlying framework and more. Tokenizer and feature extractors? What are those? Hold that thought for the next section.

In our case, we can use the `transformers.ImageClassificationPipeline` as below:

In [5]:
from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

model.eval()

ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0): ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_features=768, out_

The model can now be used for inference. All we have to do is feed an image and we are good to go.

However, in many cases, we also need to train or fine tune a model. Perhaps we also want better control on the entire pipeline. Therefore, we might need to develop the code ourselves. For educational purposes, this is what we’ll do here.

# Preparing the dataset
The first step to any ML lifecycle is to transform the dataset. In our case, we need to preprocess the CIFAR10 images so that we can feed them to our model. Hugging Face has two basic classes for data processing. Tokenizers and feature extractors.

## Tokenizers
In most NLP tasks, a `tokenizer` is our go-to solution. A tokenizer is mapping the text into tokens and then into numerical inputs that can be fed into the model. Each model comes with its own tokenizer that is based on the `PreTrainedTokenizer` class.

Since we are dealing with images, we will not use a `Tokenizer` here. We will cover them more extensively in a future tutorial.

## Feature Extractors
However, we will make use of another class called feature extractors. A feature extractor is usually responsible for preparing input features for models that don’t fall into the standard NLP models. They are in charge of things such as processing audio files and manipulating images. Most vision models come with a complementary feature extractor.

In [7]:
from transformers import ViTFeatureExtractor

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')

This feature extractor will resize every image to the resolution that the model expects and normalize the channels.
Now we can define the entire processing functionality as depicted below:

In [8]:
single_intance = train_ds['img'][100]
type(single_intance)

PIL.PngImagePlugin.PngImageFile

In [9]:
import numpy as np
def preprocess_images(examples):

    images = examples['img']
    images = [np.array(image, dtype=np.uint8) for image in images]
    images = [np.moveaxis(image, source=-1, destination=0) for image in images]
    inputs = feature_extractor(images=images)
    examples['pixel_values'] = inputs['pixel_values']

    return examples

from datasets import Features, ClassLabel, Array3D

features = Features({
    'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']),
    'img': Array3D(dtype="int64", shape=(3,32,32)),
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
})

preprocessed_train_ds = train_ds.map(preprocess_images, batched=True, features=features)
preprocessed_val_ds = val_ds.map(preprocess_images, batched=True, features=features)
preprocessed_test_ds = test_ds.map(preprocess_images, batched=True, features=features)

  0%|          | 0/5 [00:00<?, ?ba/s]

ArrowTypeError: Could not convert <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32 at 0x7F34F68A7E20> with type PngImageFile: was not a sequence or recognized null for conversion to list type