# Goal 
Lets look at an example of how to train a model using `torch.compile` and  measure the performance improvements we observe

## Model & DataSet
We will train a ResNet model to classify cats and dogs

We use `datatsets` from [HuggingFace](https://huggingface.co/docs/datasets/index) to load the cats and dogs dataset
```
pip install -r requirements.txt
```

# Hardware

The current notebook has been run on Nvidia A10 GPU. 
Please adjust the batch_size based on the GPU being used

# Measuring Performance

To measure performance gains with `torch.compile`, we write a helper function

```python
def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    torch.cuda.synchronize()
    end.record()
    return result, start.elapsed_time(end) / 1000
```

We call the `train` and `test` functions with the `timed` function

As a first step, we train the model without `torch.compile` and measure performance

# Train the model without torch.compile

In [3]:
!python main.py

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() disabled
Training Time: 45.50344921875

Test set: Average loss: 0.0000, Accuracy: 4622/4682 (99%)

Evaluation Time: 17.61987109375
#########################################################
Training Time: 41.850609375

Test set: Average loss: 0.0000, Accuracy: 4631/4682 (99%)

Evaluation Time: 14.20183203125
#########################################################
Training Time: 41.91703515625

Test set: Average loss: 0.0000, Accuracy: 4628/4682 (99%)

Evaluation Time: 14.158279296875
#########################################################
Training Time: 41.964234375

Test set: Average loss: 0.0000, Accuracy: 4631/4682 (99%)

Evaluation Time: 14.139111328125
#########################################################
#########################################################
Total training 

We observe that the training time is around 41.9 seconds

# Train the model with torch.compile

In [10]:
!python main.py --torch-compile

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() enabled
Training Time: 63.32085546875

Test set: Average loss: 0.0000, Accuracy: 4610/4682 (98%)

Evaluation Time: 22.271330078125
#########################################################
Training Time: 42.25797265625

Test set: Average loss: 0.0000, Accuracy: 4619/4682 (99%)

Evaluation Time: 13.9923232421875
#########################################################
Training Time: 42.20574609375

Test set: Average loss: 0.0000, Accuracy: 4624/4682 (99%)

Evaluation Time: 14.1473486328125
#########################################################
Training Time: 41.861015625

Test set: Average loss: 0.0000, Accuracy: 4632/4682 (99%)

Evaluation Time: 13.977349609375
#########################################################
#########################################################
Total trai

### We observe that the training time is around 42 seconds whether we use `torch.compile` or not

We also observe that the first epoch is slower. That is expected and can be ignored for comparison purposes

## What's happening? Why is torch.compile not speeding up the model training?

Lets profile the forward pass, backward pass and data loading times to see what's happening

In [11]:
!python main.py --torch-compile --profile

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() enabled
Median forward time (ms) 8591.79 | backward time (ms) 3444.59 | dataloader time (ms) 291.37
Median forward time (ms) 40.39 | backward time (ms) 71.37 | dataloader time (ms) 275.78
Total forward time (s) 20.09 | backward time (s) 16.65 | dataloader time (s) 40.32
Training Time: 77.14421875

Test set: Average loss: 0.0000, Accuracy: 4610/4682 (98%)

Evaluation Time: 22.210673828125
#########################################################
Median forward time (ms) 41.04 | backward time (ms) 71.45 | dataloader time (ms) 266.17
Median forward time (ms) 40.44 | backward time (ms) 71.41 | dataloader time (ms) 272.34
Total forward time (s) 5.92 | backward time (s) 10.45 | dataloader time (s) 39.79
Training Time: 56.24069140625

Test set: Average loss: 0.0000, Accuracy: 4616/4682 (99%)

Eva

We observe that dataloading is the bottleneck. Dataloading has a median time of 270 ms vs Forward pass of 40ms

Current Dataloader config

```python
train_kwargs = {'batch_size': args.batch_size, 'shuffle': True}
train_loader = torch.utils.data.DataLoader(train_ds,**train_kwargs)
```

Lets add the following args to the DataLoader
```python
opt_kwargs = {'num_workers': 4,
              'pin_memory': True}
        train_kwargs.update(opt_kwargs)
```

You can read more about memory pinning [here](https://pytorch.org/docs/stable/data.html#memory-pinning)

Now we train the model with the dataloader optimizations

# Train the model without torch.compile and DataLoader Optimizations

In [3]:
!python main.py --dl-opt

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() disabled
Training Time: 21.717208984375

Test set: Average loss: 0.0000, Accuracy: 4622/4682 (99%)

Evaluation Time: 10.4892626953125
#########################################################
Training Time: 18.212275390625

Test set: Average loss: 0.0000, Accuracy: 4625/4682 (99%)

Evaluation Time: 6.2697470703125
#########################################################
Training Time: 18.23256640625

Test set: Average loss: 0.0000, Accuracy: 4628/4682 (99%)

Evaluation Time: 6.308853515625
#########################################################
Training Time: 18.175515625

Test set: Average loss: 0.0000, Accuracy: 4627/4682 (99%)

Evaluation Time: 6.28328466796875
#########################################################
#########################################################
Total tr

# Train the model with torch.compile and DataLoader Optimizations

In [4]:
!python main.py --torch-compile --dl-opt

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() enabled
Training Time: 86.2654453125

Test set: Average loss: 0.0000, Accuracy: 4614/4682 (99%)

Evaluation Time: 16.06041015625
#########################################################
Training Time: 16.964841796875

Test set: Average loss: 0.0000, Accuracy: 4625/4682 (99%)

Evaluation Time: 6.28181396484375
#########################################################
Training Time: 16.94748046875

Test set: Average loss: 0.0000, Accuracy: 4622/4682 (99%)

Evaluation Time: 6.27323193359375
#########################################################
Training Time: 16.995251953125

Test set: Average loss: 0.0000, Accuracy: 4623/4682 (99%)

Evaluation Time: 6.270625
#########################################################
#########################################################
Total training 

### Now we see that the training time per epoch has reduced from 18.2 seconds to 16.9 seconds (7% speedup)

# torch.compile(mode="reduce-overhead")

We use `reduce-overhead` when we have smaller batch size or small sequence lengths in case of language models. Basically, if operations are more CPU bound, using this mode will speed up training

# Train the model without torch.compile and batch-size 8

In [19]:
!python main.py --dl-opt --batch-size 8

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() disabled
Training Time: 28.857875

Test set: Average loss: 0.0001, Accuracy: 4600/4682 (98%)

Evaluation Time: 10.028962890625
#########################################################
Training Time: 27.211642578125

Test set: Average loss: 0.0000, Accuracy: 4623/4682 (99%)

Evaluation Time: 6.24068017578125
#########################################################
Training Time: 27.135138671875

Test set: Average loss: 0.0000, Accuracy: 4639/4682 (99%)

Evaluation Time: 6.2850078125
#########################################################
Training Time: 27.312443359375

Test set: Average loss: 0.0000, Accuracy: 4637/4682 (99%)

Evaluation Time: 6.26892578125
#########################################################
#########################################################
Total training 

# Train the model with torch.compile and batch-size 8

In [17]:
!python main.py --torch-compile --dl-opt --batch-size 8

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() enabled
Training Time: 86.616671875

Test set: Average loss: 0.0001, Accuracy: 4585/4682 (98%)

Evaluation Time: 14.732513671875
#########################################################
Training Time: 30.162486328125

Test set: Average loss: 0.0000, Accuracy: 4632/4682 (99%)

Evaluation Time: 6.24964501953125
#########################################################
Training Time: 30.56647265625

Test set: Average loss: 0.0000, Accuracy: 4614/4682 (99%)

Evaluation Time: 6.253646484375
#########################################################
Training Time: 30.461517578125

Test set: Average loss: 0.0000, Accuracy: 4635/4682 (99%)

Evaluation Time: 6.21996826171875
#########################################################
#########################################################
Total tra

We notice that training time per epoch has increased from 28 seconds to 30 seconds in case of small batch sizes

Lets train with `reduce-overhead` mode

In [18]:
!python main.py --torch-compile --reduce-overhead --dl-opt --batch-size 8

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet18
torch.compile() enabled
Mode 'reduce-overhead' enabled
Training Time: 36.42646484375

Test set: Average loss: 0.0000, Accuracy: 4609/4682 (98%)

Evaluation Time: 15.48576953125
#########################################################
Training Time: 25.4383828125

Test set: Average loss: 0.0000, Accuracy: 4618/4682 (99%)

Evaluation Time: 6.231134765625
#########################################################
Training Time: 25.42283984375

Test set: Average loss: 0.0000, Accuracy: 4634/4682 (99%)

Evaluation Time: 6.22245068359375
#########################################################
Training Time: 25.480828125

Test set: Average loss: 0.0000, Accuracy: 4640/4682 (99%)

Evaluation Time: 6.29118603515625
#########################################################
########################################

We notice that training time per epoch reduces from 28 seconds to 25 seconds

# Train ResNet152 model

Lets look at how much performance improvement we get when we train a bigger model

# Train the model without torch.compile and batch-size 64

In [2]:
!python main.py  --dl-opt --resnet152 --batch-size 64

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet152
torch.compile() disabled
Training Time: 134.3393125

Test set: Average loss: 0.0000, Accuracy: 4636/4682 (99%)

Evaluation Time: 24.82583984375
#########################################################
Training Time: 129.023953125

Test set: Average loss: 0.0000, Accuracy: 4642/4682 (99%)

Evaluation Time: 14.32539453125
#########################################################
Training Time: 129.015703125

Test set: Average loss: 0.0000, Accuracy: 4647/4682 (99%)

Evaluation Time: 14.0163388671875
#########################################################
Training Time: 129.0133359375

Test set: Average loss: 0.0000, Accuracy: 4648/4682 (99%)

Evaluation Time: 13.930224609375
#########################################################
#########################################################
Total training

# Train the model with torch.compile and batch-size 64

In [1]:
!python main.py --torch-compile  --dl-opt --resnet152 --batch-size 64

Found cached dataset cats_vs_dogs (/home/ubuntu/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb)
Using ResNet152
torch.compile() enabled
Training Time: 362.74515625

Test set: Average loss: 0.0000, Accuracy: 4648/4682 (99%)

Evaluation Time: 57.85243359375
#########################################################
Training Time: 116.170125

Test set: Average loss: 0.0000, Accuracy: 4658/4682 (99%)

Evaluation Time: 11.5616328125
#########################################################
Training Time: 116.1730703125

Test set: Average loss: 0.0000, Accuracy: 4658/4682 (99%)

Evaluation Time: 11.6061318359375
#########################################################
Training Time: 116.1636328125

Test set: Average loss: 0.0000, Accuracy: 4658/4682 (99%)

Evaluation Time: 11.26404296875
#########################################################
#########################################################
Total training tim

### Now we see that the training time per epoch has reduced from 129 seconds to 116 seconds (11% speedup)

### We see  more speedup when the model is bigger