# Chapter 5: Building Your First Hugging Face Dataset

Installation Notes
To run this notebook on Google Colab, you will need to install the following libraries: transformers and datasets.

In Google Colab, you can run the following command to install these libraries:

In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

## 5.2 Learning Objectives

By the end of this chapter, you should be able to:
- build and use Hugging Face Datasets
- understand the role of batch normalization in deep learning models
- assess different alternatives for training models using higher-level libraries

## 5.3 A New Dataset

In this chapter, and in the second lab, we'll use a different dataset: [100,000 UK Used Car Dataset](https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes) from Kaggle. It contains scraped data of used car listings split into CSV files according to the manufacturer: Audi, BMW, Ford, Hyundai, Mercedes, Skoda, Toyota, Vauxhall, and VW. It also contains a few extra files of particular models (`cclass.csv`, `focus.csv`, `unclean_cclass.csv`, and `unclean_focus.csv`) that we won't be using.

Each file has nine columns with the car's attributes: model, year, price, transmission, mileage, fuel type, road tax, fuel consumption (mpg), and engine size. Transmission, fuel type, and year are discrete/categorical attributes, the others are continous. Our goal here is to predict the car's price based on its other attributes.

To download the dataset, you'll need to create a Kaggle account. In the following sections, we're assuming the dataset was downloaded and unzipped to a local folder named car_prices. Alternatively, you can download it from the following link:

https://raw.githubusercontent.com/lftraining/LFD273-code/main/data/100KUsedCar/car_prices.zip

In Colab, you can run the following commands to download and unzip the dataset:

In [2]:
!wget https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
!unzip car_prices.zip -d car_prices

--2025-03-12 10:00:18--  https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip [following]
--2025-03-12 10:00:19--  https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/data/100KUsedCar/car_prices.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1152744 (1.1M) [application/zip]
Saving to: ‘car_prices.zip’


2025-03-12 10:00:19 (28.1 MB/s) - ‘car_prices.zip’ saved [1152744/1152744]

Archive:  car_prices.zip
  inflating: 

First, let's assemble a list of file paths containing only those files we're interested in loading:

In [3]:
import os

def filter_for_data(filename):
    return ("unclean" not in filename) and ("focus" not in filename) and ("cclass" not in filename) and filename.endswith(".csv")

folder = './car_prices'
data_files = sorted([os.path.join(folder, fname)
                     for fname in os.listdir(folder)
                     if filter_for_data(fname)])
data_files

['./car_prices/audi.csv',
 './car_prices/bmw.csv',
 './car_prices/ford.csv',
 './car_prices/hyundi.csv',
 './car_prices/merc.csv',
 './car_prices/skoda.csv',
 './car_prices/toyota.csv',
 './car_prices/vauxhall.csv',
 './car_prices/vw.csv']

Now we only have nine filenames, one for each manufacturer.

### 5.3.1 Hugging Face Datasets

The Datasets library implements Hugging Face Datasets, a powerful and easy-to-use drop-in replacement for PyTorch's own Dataset. While we'll be using HF's Dataset to build our own dataset, we're only using a fraction of its capabilities. For a comprehensive overview of its functionalities, please check its documentation:

- [Quickstart](https://huggingface.co/docs/datasets/main/en/quickstart)
- [Know Your Dataset](https://huggingface.co/docs/datasets/main/en/access)
- [Loading a Dataset]()
Also, for a complete list of every dataset available, check the [Hugging Face Hub](https://huggingface.co/datasets).



Our goal is to build a datasets that returns a dictionary with three keys in it: `label` (containing the prices we want to predict), `cont_X` (an array of the continuous attributes), and `cat_X` (an array of sequentially-encoded categorical attributes).

#### 5.3.1.1 Loading CSV Files

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

Next, we need to actually open and parse these CSV files. We're skipping the first line (it contains the headers) and we'll provide the column names ourselves.

We can use the load_dataset() method to easily load several CSV files at once. Its first argument (named path) can be misleading, as it actually determines the path to the corresponding processing script, not the actual CSV files. Since CSV files are fairly standard, there's a default script to handle them, so we only need to set this argument to "[csv](https://huggingface.co/docs/datasets/en/tabular_load#csv-files)" and it will automatically call the corresponding script under the hood. The actual CSV files should be provided in the data_files argument. Moreover, the split argument can be used to specify which set (Split.TRAIN, Split.VALIDATION, Split.TEST, or Split.ALL) the data refers to.

It is also possible to control the parsing and reading of the CSV files through a series of typical arguments such as quotechar, sep, column_names, skip_rows, quoting, and more. For more details, please check the documentation on the supported [configuration arguments](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.packaged_modules.csv.CsvConfig).

The load_dataset() method can also be used to load data from other formats, such as [JSON files](https://huggingface.co/docs/datasets/en/loading#json), [text files](https://huggingface.co/docs/datasets/en/nlp_load#load-text-data), [Python dictionaries](https://huggingface.co/docs/datasets/main/loading#python-dictionary), and [Pandas dataframes](https://huggingface.co/docs/datasets/main/loading#pandas-dataframe).

In [4]:
from datasets import load_dataset, Split

colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer']

dataset = load_dataset(path="csv",
                       data_files=data_files,
                       sep=',',
                       skiprows=1,
                       column_names=colnames,
                       split=Split.ALL)

Generating train split: 0 examples [00:00, ? examples/s]

The HF Dataset has many attributes, such as features, num_columns and shape:

In [5]:
dataset.features, dataset.num_columns, dataset.shape

({'model': Value(dtype='string', id=None),
  'year': Value(dtype='int64', id=None),
  'price': Value(dtype='int64', id=None),
  'transmission': Value(dtype='string', id=None),
  'mileage': Value(dtype='int64', id=None),
  'fuel_type': Value(dtype='string', id=None),
  'road_tax': Value(dtype='int64', id=None),
  'mpg': Value(dtype='float64', id=None),
  'engine_size': Value(dtype='float64', id=None),
  'manufacturer': Value(dtype='float64', id=None)},
 10,
 (99187, 10))

The dataset can be indexed just like a Python list:

In [6]:
dataset[:3]

{'model': [' A1', ' A6', ' A1'],
 'year': [2017, 2016, 2016],
 'price': [12500, 16500, 11000],
 'transmission': ['Manual', 'Automatic', 'Manual'],
 'mileage': [15735, 36203, 29946],
 'fuel_type': ['Petrol', 'Diesel', 'Petrol'],
 'road_tax': [150, 20, 30],
 'mpg': [55.4, 64.2, 55.4],
 'engine_size': [1.4, 2.0, 1.4],
 'manufacturer': [None, None, None]}

And its columns can be accessed as a dictionary too:

In [7]:
dataset['transmission']

['Manual',
 'Automatic',
 'Manual',
 'Automatic',
 'Manual',
 'Automatic',
 'Automatic',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Automatic',
 'Manual',
 'Manual',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Automatic',
 'Manual',
 'Automatic',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Automatic',
 'Manual',
 'Manual',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Automatic',
 'Manual',
 'Manual',
 'Automatic',
 'Automatic',
 'Automatic',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Automatic',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Manual',
 'Automatic',
 'Automatic',
 'Manual',
 'Manual',
 'Manu

The HF Dataset also has many methods, like unique(), map(), filter(), shuffle(), and train_test_split() (for a comprehensive list of operations, check HF's [documentation](https://huggingface.co/docs/datasets/process)).

We can use train_test_split() to split our dataset in two. By default, these splits are named train and test. The splits are actually a dataset dictionary:

In [8]:
train_test = dataset.train_test_split(train_size=0.8)
train_test

DatasetDict({
    train: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 79349
    })
    test: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 19838
    })
})

At this point, we may want to split it further in order to get a validation set as well. So, we can simply split the test set in half:

However, the method returns two sets with their default names. We can ignore these names and build our own dataset dictionary using DatasetDict directly:

In [9]:
val_test = train_test['test'].train_test_split(train_size=0.5)
val_test

DatasetDict({
    train: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 9919
    })
    test: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 9919
    })
})

In [10]:
from datasets import DatasetDict
datasets = DatasetDict({'train': train_test['train'],  # training set from first split
                       'val': val_test['train'],      # test set from first split, split further and renamed
                       'test': val_test['test']})     # test set from first split, split further
datasets

DatasetDict({
    train: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 79349
    })
    val: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 9919
    })
    test: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer'],
        num_rows: 9919
    })
})

We have our three splits conveniently organized in a dataset dictionary.

#### 5.3.1.2 Encoding Categorical Attributes

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Now, it is time to encode categorical attributes, just like we did before using Scikit-Learn's OrdinalEncoder. This time, however, we will take a different approach to it.

So, let's take a step back and imagine that we had already discussed the problem at length, and we're aware of all valid unique values for each categorical attribute. It makes sense: if we're building an app that estimates car prices, we'll eventually ask the end user to provide the characteristics of their car, most likely using dropdowns so they can choose from a list of predefined values. For example, you shouldn't let the user enter some made-up fuel type, they must choose among "petrol", "diesel", "hybrid", "other", or "electric". Each item in the dropdown corresponds to an integer value, so "petrol" is zero, "diesel" is one, and so on. That's the same as sequentially encoding the fuel type.

Of course, we won't be actually designing the frontend of an app in this lab, so let's just pretend we did it, and cheat a little bit by looking at the whole data first and building dictionaries that perform the encoding described above.

The Hugging Face Dataset has many useful methods, such as unique(), which returns a list of unique values in a given column of the dataset:

In [11]:
datasets['train'].unique('fuel_type')

Flattening the indices:   0%|          | 0/79349 [00:00<?, ? examples/s]

['Petrol', 'Diesel', 'Hybrid', 'Other', 'Electric']

Next, we build dictionaries that work as "dropdowns" for our categorical attributes:

In [12]:
cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']
cat_attr = ['model', 'transmission', 'fuel_type']

def gen_encoder_dict(dataset, col):
    values = sorted(dataset.unique(col))
    values += ['UNKNOWN']
    return dict(zip(values, range(len(values))))

dropdown_encoders = {col: gen_encoder_dict(datasets['train'], col) for col in cat_attr}

Moreover, we're appending an extra UNKNOWN value as a catch-all for those cases where the input was not seen by the model during training. This will be especially relevant for the model attribute. This solution is sub-optimal - since the corresponding encodings won't be properly trained - but it's enough for our purpose.

###The zip() Function
The zip() function in Python is an easy and convenient way to pair up two lists together, making each element of the resulting (paired) iterator a tuple containing the two corresponding elements from the original lists. An example can probably illustrate the concept better:

odd = [1, 3, 5]
even = [2, 4, 6]
list(zip(odd, even))

If we materialize the iterator by turning it into a list, we'll get three pairs of elements, as expected:

[(1, 2), (3, 4), (5, 6)]

Let's check out one of the dropdowns:

In [13]:
dropdown_encoders['fuel_type']

{'Diesel': 0,
 'Electric': 1,
 'Hybrid': 2,
 'Other': 3,
 'Petrol': 4,
 'UNKNOWN': 5}

#### 5.3.1.3 Row Output

Now that we have dropdown encoders, we need to apply them to the existing columns, that is, we need to preprocess the dataset and generate a proper output.

Remember, Hugging Face Datasets actually return a dictionary whenever they're sliced or indexed:

In [14]:
datasets['train'][0]

{'model': ' Astra',
 'year': 2017,
 'price': 8250,
 'transmission': 'Manual',
 'mileage': 48478,
 'fuel_type': 'Petrol',
 'road_tax': 145,
 'mpg': 62.8,
 'engine_size': 1.0,
 'manufacturer': None}

Moreover, these datasets can be easily modified (e.g. transforming the existing columns or adding new ones, but not removing them) using the map() method. This method applies a function - that must return a dictionary as output - to every row (a dictionary) in the dataset and merges the function's output to the original row. It is, in fact, simply merging two dictionaries.

####The map() Method
The map() method of a Hugging Face Dataset can be used to apply a function (the argument of the method) to every row of the original dataset. The resulting object, just like the original row, is a dictionary as well. We can create a dummy Hugging Face Dataset out of a typical Python range like this:

In [25]:
from datasets import Dataset
dummy = Dataset.from_dict({'seq': list(range(10))})
dummy[:]

{'seq': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

If we want to multiply every number in the seq column by two and assign the result to a new column, we can simply call its map() method with the corresponding function as its argument:

In [26]:
double_dummy = dummy.map(lambda row: {'double': row['seq'] * 2})
double_dummy[:]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

{'seq': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'double': [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]}

In [29]:
list_dummy = dummy.map(lambda row: {'seq': [row['seq'], row['seq'] * 2]})
list_dummy[:]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

{'seq': [[0, 0],
  [1, 2],
  [2, 4],
  [3, 6],
  [4, 8],
  [5, 10],
  [6, 12],
  [7, 14],
  [8, 16],
  [9, 18]]}

So, let's use our knowledge about the data to build a preprocessing function that takes a row as input - containing nine columns of data - and produces a dictionary as output. The keys in the output dictionary will be new columns in the dataset:

In [15]:
import numpy as np

def preproc(row):
    colnames = ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size']#, 'manufacturer']

    cat_attr = ['model', 'transmission', 'fuel_type']#, 'manufacturer']
    cont_attr = ['year', 'mileage', 'road_tax', 'mpg', 'engine_size']
    target = 'price'

    cont_X = [float(row[name]) for name in cont_attr]
    cat_X = [dropdown_encoders[name].get(row[name], dropdown_encoders[name]['UNKNOWN']) for name in cat_attr]

    return {'label': np.array([float(row[target])], dtype=np.float32),
            'cont_X': np.array(cont_X, dtype=np.float32),
            'cat_X': np.array(cat_X, dtype=int)}

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

Now, let's apply the function above to every element in our dataset. Notice that will apply the same function to every split, train, validation, and test, in the dataset:

In [16]:
datasets = datasets.map(preproc)
datasets

Map:   0%|          | 0/79349 [00:00<?, ? examples/s]

Map:   0%|          | 0/9919 [00:00<?, ? examples/s]

Map:   0%|          | 0/9919 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer', 'label', 'cont_X', 'cat_X'],
        num_rows: 79349
    })
    val: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer', 'label', 'cont_X', 'cat_X'],
        num_rows: 9919
    })
    test: Dataset({
        features: ['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'road_tax', 'mpg', 'engine_size', 'manufacturer', 'label', 'cont_X', 'cat_X'],
        num_rows: 9919
    })
})

At this point, we don't need the original columns anymore, so we can call select_columns() to keep only the newly created columns:

In [17]:
datasets = datasets.select_columns(['label', 'cont_X', 'cat_X'])
datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'cont_X', 'cat_X'],
        num_rows: 79349
    })
    val: Dataset({
        features: ['label', 'cont_X', 'cat_X'],
        num_rows: 9919
    })
    test: Dataset({
        features: ['label', 'cont_X', 'cat_X'],
        num_rows: 9919
    })
})

Let's slice our training set and check the output:

In [18]:
datasets['train'][:2]

{'label': [[8250.0], [32990.0]],
 'cont_X': [[2017.0, 48478.0, 145.0, 62.79999923706055, 1.0],
  [2020.0, 5000.0, 145.0, 50.400001525878906, 2.0]],
 'cat_X': [[25, 1, 4], [24, 3, 0]]}

It's all good and well, but we're still one step short of having our dataset ready to be used for training a model in PyTorch. We need to make the dataset produce PyTorch tensors instead of regular Python lists. There's no need to manually convert the columns inside the preprocessing function, though. We can simply call the dataset's with_format() method specifying the desired output format, torch, in our case:

In [19]:
datasets = datasets.with_format('torch')
datasets['train'][:2]

{'label': tensor([[ 8250.],
         [32990.]]),
 'cont_X': tensor([[2.0170e+03, 4.8478e+04, 1.4500e+02, 6.2800e+01, 1.0000e+00],
         [2.0200e+03, 5.0000e+03, 1.4500e+02, 5.0400e+01, 2.0000e+00]]),
 'cat_X': tensor([[25,  1,  4],
         [24,  3,  0]])}

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

Our dataset is ready. We can create data loaders, one for each split, as we have done with PyTorch's own datasets:

In [20]:
from torch.utils.data import DataLoader

dataloaders = {}
dataloaders['train'] = DataLoader(dataset=datasets['train'], batch_size=128, drop_last=True, shuffle=True)
dataloaders['val'] = DataLoader(dataset=datasets['val'], batch_size=128)
dataloaders['test'] = DataLoader(dataset=datasets['test'], batch_size=128)

Let's take a mini-batch from our training set:

In [21]:
next(iter(dataloaders['train']))

{'label': tensor([[57991.],
         [11895.],
         [23880.],
         [19682.],
         [11099.],
         [11000.],
         [ 8495.],
         [ 9490.],
         [ 7091.],
         [13998.],
         [29498.],
         [ 9798.],
         [13599.],
         [16899.],
         [15491.],
         [ 9499.],
         [10290.],
         [12918.],
         [12996.],
         [19499.],
         [20006.],
         [33840.],
         [12900.],
         [10499.],
         [ 9150.],
         [ 7500.],
         [12000.],
         [17250.],
         [14950.],
         [18500.],
         [21489.],
         [16390.],
         [ 8500.],
         [14498.],
         [15500.],
         [17490.],
         [17399.],
         [31995.],
         [17300.],
         [ 8498.],
         [25780.],
         [27495.],
         [16250.],
         [11299.],
         [12980.],
         [22750.],
         [28995.],
         [18750.],
         [14100.],
         [ 5490.],
         [33899.],
         [14499.],
   

Nice! We got the desired dictionary back, each key has a tensor with 32 rows (our mini-batch size), and the categorical attributes are encoded as integers.

At this point, you're probably wondering why we didn't bother at all to standardize/scale the continuous attributes.

### 5.3.2 BatchNorm for Continuous Attributes

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

Now, let's discuss what to do with the continuous attributes.

Before, we fitted an instance of the StandardScaler on the whole training data. However, we don't necessarily need to standardize the data using statistics (mean and standard deviation) computed on the whole training set. We can standardize them using running mini-batch statistics instead! That's what [batch normalization](https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/) does.

Let's see how it works by, first, retrieving a mini-batch of data and computing the statistics of its continuous attributes:

In [22]:
import torch.nn as nn

batch = next(iter(dataloaders['train']))
batch['cont_X'].mean(axis=0), batch['cont_X'].std(axis=0, unbiased=False)

(tensor([2.0173e+03, 2.3868e+04, 1.1273e+02, 5.6383e+01, 1.7055e+00]),
 tensor([1.5049e+00, 1.9836e+04, 5.6709e+01, 1.2982e+01, 4.8819e-01]))

Now, let's create an instance of a batch norm layer, use our mini-batch as input, and compute statistics on the output:

In [23]:
bn_layer = nn.BatchNorm1d(num_features=len(cont_attr))

normalized_cont = bn_layer(batch['cont_X'])
normalized_cont.mean(axis=0), normalized_cont.std(axis=0, unbiased=False)

(tensor([6.0376e-05, 4.0978e-08, 7.4506e-09, 1.7881e-07, 2.3842e-07],
        grad_fn=<MeanBackward1>),
 tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000], grad_fn=<StdBackward0>))

There we go! The continuous attributes of our mini-batch were standardized (or normalized, following the technique's name) so they are zero-centered and have unit standard deviation. As it turns out, the batch normalization layer keeps track of running statistics, so after seeing this one mini-batch of data, it will have some statistics of its own already:

In [24]:
bn_layer.state_dict()

OrderedDict([('weight', tensor([1., 1., 1., 1., 1.])),
             ('bias', tensor([0., 0., 0., 0., 0.])),
             ('running_mean',
              tensor([2.0173e+02, 2.3868e+03, 1.1273e+01, 5.6383e+00, 1.7055e-01])),
             ('running_var',
              tensor([1.1282e+00, 3.9658e+07, 3.2503e+02, 1.7886e+01, 9.2402e-01])),
             ('num_batches_tracked', tensor(1))])

When in training mode, it keeps updating running statistics so, after one epoch, it will have collected statistics over the whole training set. At this point, it will have statistics very close to those we would get if we had computed them over the whole training set in the first place.

Once the model is switched to evaluation mode, the batch norm layer doesn't update its internal statistics anymore, but it still normalizes new data points using those it learned during training. Batch norm layers, together with dropout layers, are a classical example of having distinct behaviors depending on which mode the model was set to.

All we have to do now is to add one of these layers to normalize the inputs of our model and, optionally, after every hidden layer as well. That may raise a question: should we place the batch normalization before or after the activation function? On a theoretical level, it makes more sense to place it after the activation function, so the outputs are zero-centered. However, successful models such as Inception V3 place it before the activation function. Unfortunately, there's no straight answer to this question, the choice is yours to make.

## 5.4 Lab 2: Price Prediction

## 5.5 Tour of High-Level Libraries

So far, we've been implementing everything ourselves, including a lot of boilerplate code such as the training loop and the early stopping.

However, there are several high-level libraries built on top of PyTorch whose goal is, in general, to remove boilerplate and/or allow users to more easily leverage advanced capabilities such as mixed precision, distributed training, and more.

Let's take a quick look at the most popular available libraries: HuggingFace Accelerate, Ignite, Catalyst, PyTorch Lightning, fast.ai, and Skorch.

### 5.5.1 HuggingFace Accelerate

[HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index) is a library that allows you to leverage parallelization and distributed training with only a few lines of extra code added to your existing PyTorch workflow.

Here is a short example from its documentation (the plus signs indicated the lines added to the original code):

```python
+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for batch in training_dataloader:
      optimizer.zero_grad()
      inputs, targets = batch
      inputs = inputs.to(device)
      targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
+     accelerator.backward(loss)
      optimizer.step()
      scheduler.step()
```

For more details, check Accelerate's [migration](https://huggingface.co/docs/accelerate/basic_tutorials/migration) documentation.

### 5.5.2 Ignite

[PyTorch Ignite](https://pytorch-ignite.ai/) is a library focused on three high-level features: an engine and event system, out-of-the-box metrics for evaluation, and built-in handlers to composing pipelines, saving artifacts, and logging. Since it focuses on the training and validation pipelines, it means that your models, datasets, and optimizers remain in pure PyTorch.

Here's a short example from its documentation:

```python
# Setup training engine:
def train_step(engine, batch):
    # Users can do whatever they need on a single iteration
    # Eg. forward/backward pass for any number of models, optimizers, etc
    # ...

trainer = Engine(train_step)

# Setup single model evaluation engine
evaluator = create_supervised_evaluator(model, metrics={"accuracy": Accuracy()})

def validation():
    state = evaluator.run(validation_data_loader)
    # print computed metrics
    print(trainer.state.epoch, state.metrics)

# Run model's validation at the end of each epoch
trainer.add_event_handler(Events.EPOCH_COMPLETED, validation)

# Start the training
trainer.run(training_data_loader, max_epochs=100)
```

For more details, check Ignite's [migration](https://pytorch-ignite.ai/how-to-guides/02-convert-pytorch-to-ignite/) documentation and [code generator](https://code-generator.pytorch-ignite.ai/).

### 5.5.3 Catalyst

[Catalyst](https://catalyst-team.com/) focuses on reproducibility and rapid experimentation. It removes boilerplate code, improves readability, and offers scalability to any hardware without code changes. It is a deep learning framework and its basic building block is the Runner class, which takes care of the training loop.

Here's a short example from its documentation:
```python
runner = dl.SupervisedRunner(
    input_key="features", output_key="logits", target_key="targets", loss_key="loss"
)

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    num_epochs=1,
    callbacks=[
        dl.AccuracyCallback(input_key="logits", target_key="targets", topk=(1, 3, 5)),
        dl.PrecisionRecallF1SupportCallback(input_key="logits", target_key="targets"),
    ],
    logdir="./logs",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    verbose=True,
)
```

For more details, check Catalyst's [quick start](https://catalyst-team.github.io/catalyst/getting_started/quickstart.html) documentation.

### 5.5.4 PyTorch Lightning

[PyTorch Lightning](https://www.pytorchlightning.ai/index.html) takes care of the engineering aspects of building and training a model in PyTorch. It is a framework itself, and its basic building block is the Lightning Module class, which acts as a model "recipe" that specifies all training details, and inherits from the typical PyTorch Module class. This means that, if you already have an implemented PyTorch workflow, your code will need to be refactored.

Here is a short example from its documentation:

```python
class LitAutoEncoder(pl.LightningModule):
	def __init__(self):
		super().__init__()
		self.encoder = nn.Sequential(
              nn.Linear(28 * 28, 64),
              nn.ReLU(),
              nn.Linear(64, 3))
		self.decoder = nn.Sequential(
              nn.Linear(3, 64),
              nn.ReLU(),
              nn.Linear(64, 28 * 28))

	def forward(self, x):
		embedding = self.encoder(x)
		return embedding

	def configure_optimizers(self):
		optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
		return optimizer

	def training_step(self, train_batch, batch_idx):
		x, y = train_batch
		x = x.view(x.size(0), -1)
		z = self.encoder(x)    
		x_hat = self.decoder(z)
		loss = F.mse_loss(x_hat, x)
		self.log('train_loss', loss)
		return loss

	def validation_step(self, val_batch, batch_idx):
		x, y = val_batch
		x = x.view(x.size(0), -1)
		z = self.encoder(x)
		x_hat = self.decoder(z)
		loss = F.mse_loss(x_hat, x)
		self.log('val_loss', loss)
```

For more details, check [this](https://github.com/Lightning-AI/lightning#pytorch-lightning-train-and-deploy-pytorch-at-scale) example of refactoring native PyTorch code into PyTorch Lightning.

### 5.5.5 fast.ai

[Fast.ai](https://docs.fast.ai/) is library that provides both high- and low- level components for practitioners to be rapidly productive and for researchers to hack it and configure it. Its high-level components include data loaders and learners, and fast.ai applications follow the same basic steps: creating data loaders, creating a learner, calling its `fit()` method, and making predictions.

Here is a short example from its documentation:

```python
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

img = PILImage.create('images/cat.jpg')
is_cat,_,probs = learn.predict(img)
print(f"Is this a cat?: {is_cat}.")
print(f"Probability it's a cat: {probs[1].item():.6f}")
```

For more details, check fast.ai's [migration](https://docs.fast.ai/#migrating-from-other-libraries) documentation.

### 5.5.6 Skorch

[Skorch](https://github.com/skorch-dev/skorch) is a Scikit-Learn-compatible wrapper for PyTorch models. Its goal is to make it possible to use PyTorch with Sciki-Learn. It offers classes such as `NeuralNetClassifier` and `NeuralNetRegressor` to wrap your models that can then be used and trained like any other Scikit-Learn model.

Here'a a short example from its documentation:
```python
net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    # Shuffle training data on each epoch
    iterator_train__shuffle=True,
)

net.fit(X, y)
y_proba = net.predict_proba(X)
```

For more details, check Skorch's [documentation](https://skorch.readthedocs.io/en/latest/).