### Welcome to the functorch tutorial on ensembling models, in colab.

## Configuring your colab to run functorch 


**Getting setup** - running functorch currently requires Pytorch Nightly.  
Thus we'll go through a pytorch nightly install and build functorch. 

After that and a restart, you'll be ready to run the tutorial here on colab.

Let's setup a restart function:

In [1]:
def colab_restart():
  print("--> Restarting colab instance") 
  get_ipython().kernel.do_shutdown(True)

Next, let's confirm that we have a gpu.  
(If not, select Runtime -> Change Runtime type above,
 and select GPU under Hardward Accelerator )

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0


Let's remove the default PyTorch install:

In [3]:
!pip uninstall -y torch

Found existing installation: torch 1.10.0+cu111
Uninstalling torch-1.10.0+cu111:
  Successfully uninstalled torch-1.10.0+cu111


And install the relevant nightly version.  (this defaults to 11.1 Cuda which works on most colabs). 

In [4]:
cuda_version = "cu111" # optionally - cu113 (for 11.3) is an option as well if you have 11.3 listed above in the nvcc output. 

In [5]:
!pip install --pre torch -f https://download.pytorch.org/whl/nightly/{cuda_version}/torch_nightly.html --upgrade

Looking in links: https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu111/torch-1.12.0.dev20220217%2Bcu111-cp37-cp37m-linux_x86_64.whl (1923.7 MB)
[K     |█████████████▉                  | 834.1 MB 133.5 MB/s eta 0:00:09tcmalloc: large alloc 1147494400 bytes == 0x555621e3c000 @  0x7f9e1d70c615 0x5555e92683bc 0x5555e934918a 0x5555e926b1cd 0x5555e935db3d 0x5555e92df458 0x5555e92da02f 0x5555e926caba 0x5555e92df2c0 0x5555e92da02f 0x5555e926caba 0x5555e92dbcd4 0x5555e935e986 0x5555e92db350 0x5555e935e986 0x5555e92db350 0x5555e935e986 0x5555e92db350 0x5555e926cf19 0x5555e92b0a79 0x5555e926bb32 0x5555e92df1dd 0x5555e92da02f 0x5555e926caba 0x5555e92dbcd4 0x5555e92da02f 0x5555e926caba 0x5555e92daeae 0x5555e926c9da 0x5555e92db108 0x5555e92da02f
[K     |█████████████████▋              | 1055.7 MB 1.8 MB/s eta 0:08:01tcmalloc: large alloc 1434370048 bytes == 0x555666492000 @  0x7f9e1d70c615 0x5555e92683bc 0x5

Let's install Ninja to accelerate the functorch building process:

In [6]:
!pip install ninja

Collecting ninja
  Downloading ninja-1.10.2.3-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB)
[?25l[K     |███                             | 10 kB 15.0 MB/s eta 0:00:01[K     |██████                          | 20 kB 9.2 MB/s eta 0:00:01[K     |█████████                       | 30 kB 6.1 MB/s eta 0:00:01[K     |████████████▏                   | 40 kB 3.6 MB/s eta 0:00:01[K     |███████████████▏                | 51 kB 3.5 MB/s eta 0:00:01[K     |██████████████████▏             | 61 kB 4.2 MB/s eta 0:00:01[K     |█████████████████████▏          | 71 kB 4.4 MB/s eta 0:00:01[K     |████████████████████████▎       | 81 kB 5.0 MB/s eta 0:00:01[K     |███████████████████████████▎    | 92 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████████▎ | 102 kB 4.2 MB/s eta 0:00:01[K     |████████████████████████████████| 108 kB 4.2 MB/s 
[?25hInstalling collected packages: ninja
Successfully installed ninja-1.10.2.3


Next we'll install and build functorch (eta is ~6 minutes):

In [7]:
!pip install --user "git+https://github.com/pytorch/functorch.git"

Collecting git+https://github.com/pytorch/functorch.git
  Cloning https://github.com/pytorch/functorch.git to /tmp/pip-req-build-fu9dydpo
  Running command git clone -q https://github.com/pytorch/functorch.git /tmp/pip-req-build-fu9dydpo
Building wheels for collected packages: functorch
  Building wheel for functorch (setup.py) ... [?25l[?25hdone
  Created wheel for functorch: filename=functorch-0.2.0a0+8915608-cp37-cp37m-linux_x86_64.whl size=21388647 sha256=7df53c10ec474b040d0ab8d774b29c400af196b80303a8f0341d21732c46e28d
  Stored in directory: /tmp/pip-ephem-wheel-cache-r01u5lf5/wheels/b0/a9/4a/ffec50dda854c8d9f2ba21e4ffc0f2489ea97946cb1102c5ab
Successfully built functorch
Installing collected packages: functorch
Successfully installed functorch-0.2.0a0+8915608


Finally - restart colab and after that - just skip directly down to the '-- Tutorial Start --' section to get underway.

In [8]:
colab_restart() 

--> Restarting colab instance


## -- Tutorial Start -- 



In [1]:
# Confirm we are ready to start.  
# If this errs, please make sure you have completed the 'configuring your colab' steps above first and then return here.

import functorch    

# Model Ensembling

This example illustrates how to vectorize model ensembling, using vmap.





**What is model ensembling?**

Model ensembling combines the predictions from multiple models together. 

Traditionally this is done by running each model on some inputs separately and then combining the predictions. 

However, if you’re running models with the same architecture, then it may be possible to combine them together using vmap. vmap is a function transform that maps functions across dimensions of the input tensors. 

One of its use cases is eliminating for-loops and speeding them up through vectorization.



Let’s demonstrate how to do this using an ensemble of simple CNNs.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
torch.manual_seed(0);

In [4]:
# Here's a simple CNN

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        output = x
        return output

Let’s generate a batch of dummy data and pretend that we’re working with an MNIST dataset.  

Thus, the dummy images are 28 by 28, and we have a minibatch of size 64.

Furthermore, lets say we want to combine the predictions from 10 different models. 


In [5]:
device = 'cuda'

num_models = 10

data = torch.randn(100, 64, 1, 28, 28, device=device)
targets = torch.randint(10, (6400,), device=device)

models = [SimpleCNN().to(device) for _ in range(num_models)]


We have a couple of options for generating predictions. 

Maybe we want to give each model a different randomized minibatch of data. 

Alternatively, maybe we want to run the same minibatch of data through each model (e.g. if we were testing the effect of different model initializations).





Option 1: different minibatch for each model

In [10]:
minibatches = data[:num_models]
predictions_diff_minibatch_loop = [model(minibatch) for model, minibatch in zip(models, minibatches)]

Option 2: Same minibatch

In [7]:
minibatch = data[0]
predictions2 = [model(minibatch) for model in models]

# Using vmap to vectorize the ensemble






Let’s use vmap to speed up the for-loop. We must first prepare the models for use with vmap.

First, let’s combine the states of the model together by stacking each parameter. For example, model[i].fc1.weight has shape [9216, 128]; we are going to stack the .fc1.weight of each of the 10 models to produce a big weight of shape [10, 9216, 128].

functorch offers the 'combine_state_for_ensemble' convenience function to do that. It returns a stateless version of the model (fmodel) and stacked parameters and buffers.



In [8]:
from functorch import combine_state_for_ensemble

fmodel, params, buffers = combine_state_for_ensemble(models)
[p.requires_grad_() for p in params];


Option 1: get predictions using a different minibatch for each model. 

By default, vmap maps a function across the first dimension of all inputs to the passed-in function. 

After using the combine_state_for_ensemble, each of the params and buffers have an additional dimension of size 'num_models' at the front, and minibatches has a dimension of size 'num_models'.






In [12]:
print([p.size(0) for p in params]) # show the leading 'num_models' dimension

assert minibatches.shape == (num_models, 64, 1, 28, 28) # verify minibatch has leading dimension of size 'num_models'

[10, 10, 10, 10, 10, 10, 10, 10]


In [11]:
from functorch import vmap

predictions1_vmap = vmap(fmodel)(params, buffers, minibatches)

# verify the vmap predictions match the 
assert torch.allclose(predictions1_vmap, torch.stack(predictions_diff_minibatch_loop), atol=1e-3, rtol=1e-5)

Option 2: get predictions using the same minibatch of data.

vmap has an in_dims arg that specifies which dimensions to map over. 

By using None, we tell vmap we want the same minibatch to apply for all of the 10 models.




In [13]:
predictions2_vmap = vmap(fmodel, in_dims=(0, 0, None))(params, buffers, minibatch)

assert torch.allclose(predictions2_vmap, torch.stack(predictions2), atol=1e-3, rtol=1e-5)

A quick note: there are limitations around what types of functions can be transformed by vmap. 

The best functions to transform are ones that are pure functions: a function where the outputs are only determined by the inputs that have no side effects (e.g. mutation). 

vmap is unable to handle mutation of arbitrary Python data structures, but it is able to handle many in-place PyTorch operations.



In general, vectorization with vmap should be faster than running a function in a for-loop and competitive with manual batching. 

There are some exceptions though, like if we haven’t implemented the vmap rule for a particular operation or if the underlying kernels weren’t optimized for older hardware (GPUs). 

If you see any of these cases, please let us know by opening an issue at our [GitHub](https://github.com/pytorch/functorch)!

