Installation
Requirements:
*   python >= 3.7

We highly recommend CUDA when using torchRec. If using CUDA:
*   cuda >= 11.0

In [None]:
# install conda to make installying pytorch with cudatoolkit 11.3 easier. 
!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.9.2-Linux-x86_64.sh
!bash ./Miniconda3-py37_4.9.2-Linux-x86_64.sh -b -f -p /usr/local

# install pytorch with cudatoolkit 11.3
!conda install pytorch cudatoolkit=11.3 -c pytorch-nightly -y

# install torchrec
!pip3 install torchrec-nightly

The following steps are needed for the Colab runtime to detect the added shared libraries. The runtime searches for shared libraries in /usr/lib, so we copy over the libraries which were installed in /usr/local/lib/. **This is a very necessary step, only in the colab runtime.**

In [None]:
!cp /usr/local/lib/lib* /usr/lib/

*Restart your runtime at this point for the newly installed packages to be seen.* Run the step below immediately after restarting so that python knows where to look for packages. Always run this step after restarting the runtime.

In [None]:
import sys
sys.path = ['', '/env/python', '/usr/local/lib/python37.zip', '/usr/local/lib/python3.7', '/usr/local/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/site-packages']

# Distributed Setup

We setup our environment with torch.distributed. For more info on distributed, see this [tutorial](https://pytorch.org/tutorials/beginner/dist_overview.html).

Here, we use one rank (the colab process) corresponding to our 1 colab GPU.

In [None]:
import os
import torch
import torchrec
import torch.distributed as dist

os.environ["RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29500"

# Note - you will need a V100 or A100 to run tutorial!
# If using an older GPU (such as colab free K80), 
# you will need to compile fbgemm with the appripriate CUDA architecture
# or run with "gloo" on CPUs 
dist.init_process_group(backend="nccl")

# Unified Virtual Memory (UVM)
[UVM](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) supports many interesting features. In this tutorial, we are interested in over-subscribing capability. We first construct embedding table, which will be 2x larger than what GPU can support. (e.g. for 40GB A100, we allocate a 80GB embedding table)

In [None]:
gpu_device=torch.device("cuda")
hbm_cap_2x = 2 * torch.cuda.get_device_properties(gpu_device).total_memory

embedding_dim = 8
# By default, each element is FP32, hence, we divide by sizeof(FP32) == 4.
num_embeddings = hbm_cap_2x // 4 // embedding_dim
ebc = torchrec.EmbeddingBagCollection(
    device="meta",
    tables=[
        torchrec.EmbeddingBagConfig(
            name="large_table",
            embedding_dim=embedding_dim,
            num_embeddings=num_embeddings,
            feature_names=["my_feature"],
            pooling=torchrec.PoolingType.SUM,
        ),
    ],
)

One can enforce to use UVM to avoid out-of-memory issue. To do that, one can add a constraint to force the planner to use only the BATCHED_FUSED_UVM compute kernel. DistributedModelParallel will shard according to this constraint. BATCHED_FUSED_UVM kernel puts the embedding table on UVM. UVM allocates the table in the host memory, but not on the GPU memory. When GPU tries to access the embedding table, GPU fetches the table at page granularity on-demand to serve the access. One can expect that performance will be slower than having the table entirely on the GPU memory.

In [None]:
from typing import cast

from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
from torchrec.distributed.planner.types import ParameterConstraints
from torchrec.distributed.embedding_types import EmbeddingComputeKernel
from torchrec.distributed.embeddingbag import EmbeddingBagCollectionSharder
from torchrec.distributed.types import ModuleSharder


topology = Topology(world_size=1, compute_device="cuda")
constraints = {
    "large_table": ParameterConstraints(
        sharding_types=["table_wise"],
        compute_kernels=[EmbeddingComputeKernel.BATCHED_FUSED_UVM.value],
    )
}
plan = EmbeddingShardingPlanner(
    topology=topology, constraints=constraints
).plan(
    ebc, [cast(ModuleSharder[torch.nn.Module], EmbeddingBagCollectionSharder())]
)

uvm_model = torchrec.distributed.DistributedModelParallel(
    ebc,
    device=torch.device("cuda"),
    plan=plan
)

# Notice "batched_fused_uvm" in compute_kernel.
print(uvm_model.plan)

# Notice ShardedEmbeddingBagCollection.
print(uvm_model)

Let's test whether we can run the model even if table size is larger than GPU memory.

In [None]:
mb = torchrec.KeyedJaggedTensor(
    keys = ["my_feature"],
    values = torch.tensor([101, 202, 303, 404, 505, 606]).cuda(),
    lengths = torch.tensor([2, 0, 1, 1, 1, 1], dtype=torch.int64).cuda(),
)

Because ShardedEmbeddingBagCollection returns EmbeddingCollectionAwaitable, wait() should be called to obtain tensor.

In [None]:
uvm_result = uvm_model(mb).wait()
print(uvm_result)

# UVM Caching
Default behavior when distributing in model parallel fashion is to use UVM caching when embedding table size exceeds GPU memory. UVM caching adds a software managed cache on GPU, which stores at a table row granularity. If the same table row is accessed frequently back-to-back, a cache hit will occur, hence achieving GPU memory performance. Otherwise, a cache miss will occur, and the table row needs to be fetched from host memory showing UVM performance.

In [None]:
default_model = torchrec.distributed.DistributedModelParallel(
    ebc,
    device=torch.device("cuda"),
)

# Notice "batched_fused_uvm_caching" in compute_kernel.
print(default_model.plan)

In [None]:
# Same as UVM case. EmbeddingCollectionAwaitable is returned, hence, wait() should be called to obtain tensor.
default_result = default_model(mb).wait()
print(default_result)

One can also control UVM caching's caching ratio. By default, it is 0.2, which means that software managed cache size is 20% of embedding table size. If one wants to reduce it further, it can be provided as constraint.

In [None]:
uvm_caching_constraints = {
    "large_table": ParameterConstraints(
        sharding_types=["table_wise"],
        compute_kernels=[EmbeddingComputeKernel.BATCHED_FUSED_UVM_CACHING.value],
        caching_ratio=0.05,
    )
}
uvm_caching_plan = EmbeddingShardingPlanner(
    topology=topology, constraints=uvm_caching_constraints
).plan(
    ebc, [cast(ModuleSharder[torch.nn.Module], EmbeddingBagCollectionSharder())]
)

uvm_caching_model = torchrec.distributed.DistributedModelParallel(
    ebc,
    device=torch.device("cuda"),
    plan=uvm_caching_plan
)

print(uvm_caching_model.plan)

In [None]:
uvm_caching_result = uvm_caching_model(mb).wait()
print(uvm_caching_result)