# **Part 1: Run MobileNet on GPU**

In this tutorial, we will explore how to train a neural network with PyTorch.

### Setup (5%)

We will first install a few packages that will be used in this tutorial and also define the path of CUDA library:

In [1]:
!pip install torchprofile 1>/dev/null
!ldconfig /usr/lib64-nvidia 2>/dev/null
!pip install onnx 1>/dev/null
!pip install onnxruntime 1>/dev/null
!pip install tqdm 1>/dev/null

We will then import a few libraries:

In [2]:
import random

import numpy as np
import torch
import torchvision
from torch import nn
from torch.optim import *
from torch.optim.lr_scheduler import *
from torch.utils.data import DataLoader
from torchprofile import profile_macs
from torchvision.datasets import *
from torchvision.transforms import *
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(torch.__version__)
print(torchvision.__version__)

2.6.0+cu124
0.21.0+cu124


To ensure the reproducibility, we will control the seed of random generators:

In [4]:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x75377bff85d0>

We must decide the HYPER-parameter before training the model:

In [5]:
NUM_CLASSES = 10

# TODO:
# Decide your own hyper-parameters
BATCH_SIZE = 128
LEARNING_RATE = 0.0015
NUM_EPOCH = 40

### Data  (5%)

In this lab, we will use CIFAR-10 as our target dataset. This dataset contains images from 10 classes, where each image is of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

Before using the data as input, we can do data pre-processing with transform function:

In [6]:
# TODO:
# Resize images to 224x224, i.e., the input image size of MobileNet,
# Convert images to PyTorch tensors, and
# Normalize the images with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


dataset = {}
for split in ["train", "test"]:
  dataset[split] = CIFAR10(
    root="data/cifar10",
    train=(split == "train"),
    download=True,
    transform=transform,
  )

To train a neural network, we will need to feed data in batches.

We create data loaders with the batch size determined previously in setup section:

In [7]:
dataflow = {}
for split in ['train', 'test']:
  dataflow[split] = DataLoader(
    dataset[split],
    batch_size=BATCH_SIZE,
    shuffle=(split == 'train'),
    num_workers=0,
    pin_memory=True,
    drop_last=True
  )

We can print the data type and shape from the training data loader:

In [8]:
for inputs, targets in dataflow["train"]:
  print(f"[inputs] dtype: {inputs.dtype}, shape: {inputs.shape}")
  print(f"[targets] dtype: {targets.dtype}, shape: {targets.shape}")
  break

[inputs] dtype: torch.float32, shape: torch.Size([128, 3, 224, 224])
[targets] dtype: torch.int64, shape: torch.Size([128])


### Model (10%)

In this tutorial, we will import MobileNet provided by torchvision, and use the pre-trained weight:

In [9]:
# TODO:
# Load pre-trained MobileNetV2
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
print(model)



MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

You should observe that the output dimension of the classifier does not match the number of cleasses in CIFAR-10.

Now change the output dimension of the classifer to number of classes:

In [10]:
# TODO:
# Change the output dimension of the classifer to number of classes
model.classifier[1] = nn.Linear(1280, NUM_CLASSES)
print(model)

# Send the model from cpu to gpu
# model = model.cuda()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)  # Moves model to CPU or GPU dynamically

MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

Now the output dimension of the classifer matches.

As this course focuses on efficiency, we will then inspect its model size and (theoretical) computation cost.


* The model size can be estimated by the number of trainable parameters:

In [11]:
num_params = 0
for param in model.parameters():
  if param.requires_grad:
    num_params += param.numel()
print("#Params:", num_params)

#Params: 2236682


* The computation cost can be estimated by the number of [multiply–accumulate operations (MACs)](https://en.wikipedia.org/wiki/Multiply–accumulate_operation) using [TorchProfile](https://github.com/zhijian-liu/torchprofile), we will further use this profiling tool in the future labs .

In [12]:
num_macs = profile_macs(model, torch.zeros(1, 3, 224, 224).cuda())
print("#MACs:", num_macs)

#MACs: 306186464


This model has 2.2M parameters and requires 306M MACs for inference. We will work together in the next few labs to improve its efficiency.

### Optimization (10%)

As we are working on a classification problem, we will apply [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) as our loss function to optimize the model:

In [13]:
# TODO:
# Apply cross entropy as our loss function
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

We should decide an optimizer for the model:

In [14]:
# TODO:
# Choose an optimizer.
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)

(Optional) We can apply a learning rate scheduler during the training:

In [15]:
# TODO(optional):
scheduler = CosineAnnealingLR(optimizer, T_max=NUM_EPOCH)

### Training (25%)

We first define the function that optimizes the model for one batch:

In [16]:
def train_one_batch(
  model: nn.Module,
  criterion: nn.Module,
  optimizer: Optimizer,
  inputs: torch.Tensor,
  targets: torch.Tensor,
  scheduler
) -> None:

    # TODO:
    # Step 1: Reset the gradients (from the last iteration)
    optimizer.zero_grad()
    # Step 2: Forward inference
    outputs = model(inputs)
    # Step 3: Calculate the loss
    loss = criterion(outputs, targets)
    # Step 4: Backward propagation
    loss.backward()
    # Step 5: Update optimizer
    optimizer.step()
    # (Optional Step 6: scheduler)
    scheduler.step()
    # if scheduler is not None:
    #   scheduler.step(loss)


We then define the training function:

In [17]:
def train(
    model: nn.Module,
    dataflow: DataLoader,
    criterion: nn.Module,
    optimizer: Optimizer,
    scheduler: LRScheduler
):

  model.train()

  for inputs, targets in tqdm(dataflow, desc='train', leave=False):
    # Move the data from CPU to GPU
    inputs = inputs.cuda()
    targets = targets.cuda()

    # Call train_one_batch function
    train_one_batch(model, criterion, optimizer, inputs, targets, scheduler)

Last, we define the evaluation function:

In [18]:
def evaluate(
  model: nn.Module,
  dataflow: DataLoader
) -> float:

    model.eval()
    num_samples = 0
    num_correct = 0

    with torch.no_grad():
        for inputs, targets in tqdm(dataflow, desc="eval", leave=False):
            # TODO:
            # Step 1: Move the data from CPU to GPU
            inputs = inputs.to(device)
            targets = targets.to(device)
            # Step 2: Forward inference
            outputs = model(inputs)
            # Step 3: Convert logits to class indices (predicted class)
            predicts = torch.argmax(outputs, dim=1)
            # Update metrics
            num_samples += targets.size(0)
            num_correct += (predicts == targets).sum()

    return (num_correct / num_samples * 100).item()

With training and evaluation functions, we can finally start training the model!

If the training is done properly, the accuracy should simply reach higher than 0.925:

***Please screenshot the output model accuracy, hand in as YourID_acc_1.png***

In [19]:
for epoch_num in tqdm(range(1, NUM_EPOCH + 1)):
  train(model, dataflow["train"], criterion, optimizer, scheduler)
  acc = evaluate(model, dataflow["test"])
  print(f"epoch {epoch_num}:", acc)

print(f"final accuracy: {acc}")

  2%|▎         | 1/40 [00:55<36:12, 55.71s/it]

epoch 1: 90.49479675292969


  5%|▌         | 2/40 [01:50<34:59, 55.26s/it]

epoch 2: 91.2459945678711


  8%|▊         | 3/40 [02:45<34:04, 55.26s/it]

epoch 3: 92.22756958007812


 10%|█         | 4/40 [03:41<33:07, 55.21s/it]

epoch 4: 92.31771087646484


 12%|█▎        | 5/40 [04:36<32:12, 55.21s/it]

epoch 5: 90.9755630493164


 15%|█▌        | 6/40 [05:31<31:16, 55.21s/it]

epoch 6: 88.93229675292969


 18%|█▊        | 7/40 [06:26<30:22, 55.21s/it]

epoch 7: 86.65865325927734


 20%|██        | 8/40 [07:21<29:27, 55.23s/it]

epoch 8: 89.74359130859375


 22%|██▎       | 9/40 [08:17<28:33, 55.26s/it]

epoch 9: 91.32612609863281


 25%|██▌       | 10/40 [09:12<27:38, 55.27s/it]

epoch 10: 92.7784423828125


 28%|██▊       | 11/40 [10:07<26:43, 55.30s/it]

epoch 11: 92.28765869140625


 30%|███       | 12/40 [11:03<25:48, 55.31s/it]

epoch 12: 92.60816955566406


 32%|███▎      | 13/40 [11:58<24:53, 55.30s/it]

epoch 13: 91.58654022216797


 35%|███▌      | 14/40 [12:53<23:58, 55.32s/it]

epoch 14: 91.29607391357422


 38%|███▊      | 15/40 [13:49<23:02, 55.31s/it]

epoch 15: 91.54647827148438


 40%|████      | 16/40 [14:44<22:07, 55.32s/it]

epoch 16: 92.00721740722656


 42%|████▎     | 17/40 [15:39<21:13, 55.35s/it]

epoch 17: 92.65824890136719


 45%|████▌     | 18/40 [16:35<20:18, 55.37s/it]

epoch 18: 92.62820434570312


 48%|████▊     | 19/40 [17:30<19:22, 55.36s/it]

epoch 19: 92.69831848144531


 50%|█████     | 20/40 [18:26<18:27, 55.37s/it]

epoch 20: 92.33773803710938


 52%|█████▎    | 21/40 [19:21<17:32, 55.38s/it]

epoch 21: 91.9971923828125


 55%|█████▌    | 22/40 [20:16<16:37, 55.40s/it]

epoch 22: 91.2159423828125


 57%|█████▊    | 23/40 [21:12<15:43, 55.52s/it]

epoch 23: 91.45632934570312


 60%|██████    | 24/40 [22:08<14:47, 55.48s/it]

epoch 24: 91.83694458007812


 62%|██████▎   | 25/40 [23:03<13:51, 55.44s/it]

epoch 25: 92.75841522216797


 65%|██████▌   | 26/40 [23:58<12:55, 55.41s/it]

epoch 26: 93.25921630859375


 68%|██████▊   | 27/40 [24:54<11:59, 55.38s/it]

epoch 27: 93.01882934570312


 70%|███████   | 28/40 [25:49<11:04, 55.39s/it]

epoch 28: 92.48798370361328


 72%|███████▎  | 29/40 [26:44<10:09, 55.39s/it]

epoch 29: 91.9571304321289


 75%|███████▌  | 30/40 [27:40<09:13, 55.35s/it]

epoch 30: 91.70673370361328


 78%|███████▊  | 31/40 [28:35<08:18, 55.43s/it]

epoch 31: 91.51642608642578


 80%|████████  | 32/40 [29:31<07:23, 55.39s/it]

epoch 32: 91.76683044433594


 82%|████████▎ | 33/40 [30:26<06:27, 55.38s/it]

epoch 33: 92.70833587646484


 85%|████████▌ | 34/40 [31:21<05:32, 55.37s/it]

epoch 34: 92.88862609863281


 88%|████████▊ | 35/40 [32:17<04:36, 55.36s/it]

epoch 35: 92.45793151855469


 90%|█████████ | 36/40 [33:12<03:41, 55.35s/it]

epoch 36: 92.74839782714844


 92%|█████████▎| 37/40 [34:07<02:46, 55.37s/it]

epoch 37: 92.50801086425781


 95%|█████████▌| 38/40 [35:03<01:50, 55.36s/it]

epoch 38: 92.3477554321289


 98%|█████████▊| 39/40 [35:58<00:55, 55.36s/it]

epoch 39: 91.69671630859375


100%|██████████| 40/40 [36:53<00:00, 55.35s/it]

epoch 40: 92.54808044433594
final accuracy: 92.54808044433594





Save the weight of the model as "model.pt":

In [20]:
# TODO:
# Save the model weight
torch.save(model.state_dict(), "model.pt")


You will find "model.pt" in the current folder.

### Export Model (5%)

We can also save the model weight in [ONNX Format](https://pytorch.org/docs/stable/onnx_torchscript.html):

In [21]:
import torch.onnx

# TODO:
# Specify the input shape
dummy_input = torch.randn(1, 3, 224, 224, device=device)  # Specify input shape
onnx_path = 'model.onnx'

# TODO:
# Export the model to ONNX format
torch.onnx.export(model, dummy_input, onnx_path, export_params=True, opset_version=11, do_constant_folding=True, input_names=['input'], output_names=['output'])
print(f"Model exported to {onnx_path}")

Model exported to model.onnx


In onnx format, we can observe the model structure using [Netron](https://netron.app/).

***Please download the model structure, hand in as YourID_onnx.png.***

### Inference (10%)

Load the saved model weight:



In [22]:
# TODO:
# Step 1: Get the model structure (mobilenet_v2 and the classifier)
loaded_model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
loaded_model.classifier[1] = nn.Linear(1280, NUM_CLASSES)

# Step 2: Load the model weight from "model.pt".
loaded_model.load_state_dict(torch.load("model.pt"))
# Step 3: Send the model from cpu to gpu
loaded_model = loaded_model.to(device)

Run inference with the loaded model weight and check the accuracy

***Please screenshot the output model accuracy, hand in as YourID_acc_2.png***

In [23]:
acc = evaluate(loaded_model, dataflow["test"])
print(f"accuracy: {acc}")

                                                     

KeyboardInterrupt: 

If the accurracy is the same as the accuracy before saved, you have completed PART 1.

Congratulations!

# **Part 2: LLM with torch.compile**

In part 2, we will compare the inference speed of the LLM whether we use torch.compile.

```torch.compile``` is a new feature in PyTorch 2.0.

The following tutorial will help you get to know the usage.

[Introduction to torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)

We will choose ```Llama-3.2-1B-Instruct``` as our LLM model.

Make sure you have access to llama before starting Part 2.

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

### Loading LLM (20%)

We will first install huggingface and login with your token

In [None]:
!pip install -U "huggingface_hub[cli]"
!huggingface-cli login

Collecting InquirerPy==0.3.4 (from huggingface_hub[cli])
  Downloading InquirerPy-0.3.4-py3-none-any.whl.metadata (8.1 kB)
Collecting pfzy<0.4.0,>=0.3.1 (from InquirerPy==0.3.4->huggingface_hub[cli])
  Downloading pfzy-0.3.4-py3-none-any.whl.metadata (4.9 kB)
Downloading InquirerPy-0.3.4-py3-none-any.whl (67 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pfzy-0.3.4-py3-none-any.whl (8.5 kB)
Installing collected packages: pfzy, InquirerPy
Successfully installed InquirerPy-0.3.4 pfzy-0.3.4

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    

We choose LLaMa 3.2 1B Instruct as our LLM model and load the pretrained model.

Model ID: **"meta-llama/Llama-3.2-1B-Instruct"**


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# TODO:
# Load the LLaMA 3.2 1B Instruct model
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).cuda()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

First we need to decide our prompt to feed into LLM and the maximum token length as well.

You can also change the iteration times of testing for the following tests.

In [None]:
# TODO:
# Input prompt
# You can change the prompt whatever you want, e.g. "How to learn a new language?", "What is Edge AI?"

prompt = "What is Edge AI?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
max_token_length = 256
iter_times = 10

### Inference with torch.compile (10%)


Let's define a timer function to compare the speed up of ```torch.compile```

In [None]:
def timed(fn):
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  start.record()
  result = fn()
  end.record()
  torch.cuda.synchronize()
  return result, start.elapsed_time(end) / 1000

After everything is set up, let's start!

We first simply run the inference without ```torch.compile```


In [None]:
original_times = []

# Timing without torch.compile
for i in range(iter_times):
  with torch.no_grad():
    original_output, original_time = timed(lambda: model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  original_times.append(original_time)
  print(f"Time taken without torch.compile: {original_time} seconds")

# Decode the output
output_text = tokenizer.decode(original_output[0], skip_special_tokens=True)
print(f"Output without torch.compile: {output_text}")

Time taken without torch.compile: 6.387107421875 seconds
Time taken without torch.compile: 6.0771552734375 seconds
Time taken without torch.compile: 5.47630615234375 seconds
Time taken without torch.compile: 6.02340576171875 seconds
Time taken without torch.compile: 5.4421572265625 seconds
Time taken without torch.compile: 5.9348212890625 seconds
Time taken without torch.compile: 5.45027001953125 seconds
Time taken without torch.compile: 5.95339453125 seconds
Time taken without torch.compile: 5.40166748046875 seconds
Time taken without torch.compile: 5.9234091796875 seconds
Output without torch.compile: What is Edge AI? Edge AI refers to Artificial Intelligence (AI) that is processed and executed on the edge of the network, rather than in the cloud or on a server. This means that the AI models are deployed directly on the devices or sensors, reducing latency and improving real-time processing.

Edge AI is used in a variety of applications, including:

1. **Industrial Automation**: Edge

Before using ```torch.compile```, we need to access the model's ```generation_config``` attribute and set the ```cache_implementation``` to "static".

To use ```torch.compile```, we need to call ```torch.compile``` on the model to compile the forward pass with the static kv-cache.

Reference: https://huggingface.co/docs/transformers/llm_optims?static-kv=basic+usage%3A+generation_config

In [None]:
compile_times = []

# Remind that whenever you use torch.compile, you need to use torch._dynamo.reset() to clear all compilation caches and restores the system to its initial state.
import torch._dynamo
torch._dynamo.reset()

# TODO:
# Compile the model
model.generation_config.cache_implementation = "static"
compiled_model = torch.compile(model)

# Timing with torch.compile
for i in range(iter_times):
  with torch.no_grad():
    compile_output, compile_time = timed(lambda: compiled_model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  compile_times.append(compile_time)
  print(f"Time taken with torch.compile: {compile_time} seconds")

# Decode output
output_text = tokenizer.decode(compile_output[0], skip_special_tokens=True)
print(f"\nOutput with torch.compile: {output_text}")

The 'batch_size' attribute of StaticCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.
Using `torch.compile`.


Time taken with torch.compile: 31.77128125 seconds
Time taken with torch.compile: 3.374816162109375 seconds
Time taken with torch.compile: 3.382734375 seconds
Time taken with torch.compile: 3.397470947265625 seconds
Time taken with torch.compile: 3.4481865234375 seconds
Time taken with torch.compile: 3.3929892578125 seconds
Time taken with torch.compile: 3.406665771484375 seconds
Time taken with torch.compile: 3.481469482421875 seconds
Time taken with torch.compile: 3.423632568359375 seconds
Time taken with torch.compile: 3.44004541015625 seconds

Output with torch.compile: What is Edge AI? Edge AI is a type of artificial intelligence that is designed to run on the edge of the network, as opposed to traditional cloud-based AI that is processed in the cloud. This is done to reduce latency and improve real-time processing capabilities. Edge AI is typically used in applications such as autonomous vehicles, smart cities, and industrial automation.

Edge AI is based on the concept of "edge 

We can easily observe that after the first inference, the inference time drops a lot!

Below code can tell you how much faster did ```torch.compile``` did.

***Please screenshot the inference time and speedup below, hand in as YourID_speedup.png***

In [None]:
import numpy as np
original_med = np.median(original_times)
compile_med = np.median(compile_times)
speedup = original_med / compile_med
print(f"Original median: {original_med},\nCompile median: {compile_med},\nSpeedup: {speedup}x")

Original median: 5.929115234375,
Compile median: 3.4151491699218752,
Speedup: 1.736121890836948x


You've finished part 2.

Congratulations!