# Using The CachedDataset In Resource Constrained Environments

Many times a ML model training happen on exactly the same dataset, with exactly the same transofrmations happening on the raw data.

When the transformations applied to the raw data require considerable amount of CPU and/or RAM resources, and when the environment is scarse on those resources, it is possible to trade CPU/RAM with storage/network by using a *CachedDataset*.

A *CachedDataset* wraps any existing *PyTorch* *Dataset*, by transparently caching the training samples, so that after the dataset is fully cached, there won't be any more CPU/RAM resources used to process it.

A *CachedDataset* can also reveal itself useful even in cases where there is enough CPU/RAM available, as if the raw data processing performed from the input pipeline is heavy, there will still benefit in loading from storage the cooked data.


In [None]:
VERSION = "1.11"  #@param ["1.11", "nightly", "20220315"]  # or YYYYMMDD format
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION
import os 
os.environ['LD_LIBRARY_PATH']='/usr/local/lib'
!echo $LD_LIBRARY_PATH

!sudo ln -s /usr/local/lib/libmkl_intel_lp64.so /usr/local/lib/libmkl_intel_lp64.so.1
!sudo ln -s /usr/local/lib/libmkl_intel_thread.so /usr/local/lib/libmkl_intel_thread.so.1
!sudo ln -s /usr/local/lib/libmkl_core.so /usr/local/lib/libmkl_core.so.1

!ldconfig
!ldd /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so

A *CachedDataset* can be used transparently, by wrapping an existing *PyTorch* *Dataset*:

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.utils.cached_dataset as xcd
import torch_xla.distributed.xla_multiprocessing as xmp
from torchvision import datasets, transforms

def _mp_fn(index):
  train_dataset = datasets.MNIST(
      '/tmp/mnist-data',
      train=True,
      download=True,
      transform=transforms.Compose(
              [transforms.ToTensor(),
               transforms.Normalize((0.1307,), (0.3081,))]))
  train_dataset = xcd.CachedDataset(train_dataset, '/tmp/cached-mnist-data')
  # Here it follow the normal model code ...


xmp.spawn(_mp_fn, args=(), start_method='fork', nprocs=1)

Example use of populating a CachedDataset whose cache folder can be exported to other locations:

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.utils.cached_dataset as xcd
import torch_xla.distributed.xla_multiprocessing as xmp
from torchvision import datasets, transforms

def _mp_fn(index):
  train_dataset = datasets.MNIST(
      '/tmp/mnist-data',
      train=True,
      download=True,
      transform=transforms.Compose(
              [transforms.ToTensor(),
               transforms.Normalize((0.1307,), (0.3081,))]))
  cached_dataset = xcd.CachedDataset(train_dataset, '/tmp/cached-mnist-data')
  print('Warming up ...')  
  cached_dataset.warmup()
  print('Done!')


xmp.spawn(_mp_fn, args=(), start_method='fork', nprocs=1)

The *CachedDataset* generated in **/tmp/cached-mnist-data** can then be packed and use in other setups.

A *CachedDataset* uses the PyTorch serialization to save samples, so it is portable in every machine where PyTorch is.

Simply use *tar* to pack it:

In [None]:
!tar czf cached-mnist.tar.gz /tmp/cached-mnist-data/

The fully cached *CachedDataset* can then be used in other machines, even without the need of instantiating the existing *Dataset* (simply pass *None* as source *Dataset* object):

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.utils.cached_dataset as xcd
import torch_xla.distributed.xla_multiprocessing as xmp

def _mp_fn(index):
  train_dataset = xcd.CachedDataset(None, '/tmp/cached-mnist-data')
  # Here it follow the normal model code ...


xmp.spawn(_mp_fn, args=(), start_method='fork', nprocs=1)

The XLA CachedDataset implementation natively supports GCS (Google Cloud Storage) as storage destination/source.

Simply prefix the paths with gs:// and make sure the proper environment is setup to access GCS:

In [None]:
!export GOOGLE_APPLICATION_CREDENTIALS=/PATH/TO/CREDENTIALS_JSON

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.utils.cached_dataset as xcd
import torch_xla.distributed.xla_multiprocessing as xmp
from torchvision import datasets, transforms

def _mp_fn(index):
  train_dataset = datasets.MNIST(
      '/tmp/mnist-data',
      train=True,
      download=True,
      transform=transforms.Compose(
              [transforms.ToTensor(),
               transforms.Normalize((0.1307,), (0.3081,))]))
  train_dataset = xcd.CachedDataset(train_dataset, 'gs://my_bucket/cached-mnist-data')
  # Here it follow the normal model code ...


xmp.spawn(_mp_fn, args=(), start_method='fork', nprocs=1)