# Distilling knowledge in models pretrained on CIFAR-10/100 datasets, using ***torchdistill***

## 1. Make sure you have access to GPU/TPU
Google Colab: *Runtime* -> *Change runtime type* -> *Hardware accelarator*: "GPU" or "TPU"

In [1]:
!nvidia-smi

Sun Jan 10 20:25:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2. Install dependencies and ***torchdistill***
As of January 6, 2021, it seems that Google Colab requires us to use **CUDA ver. 10.1** for PyTorch.
Thus, install `torch` and `torchvision` with +cu101

In [2]:
!pip install pyyaml --upgrade
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
!pip install torchdistill

Collecting pyyaml
[?25l  Downloading https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz (269kB)
[K     |█▏                              | 10kB 24.9MB/s eta 0:00:01[K     |██▍                             | 20kB 31.4MB/s eta 0:00:01[K     |███▋                            | 30kB 25.1MB/s eta 0:00:01[K     |████▉                           | 40kB 21.6MB/s eta 0:00:01[K     |██████                          | 51kB 19.1MB/s eta 0:00:01[K     |███████▎                        | 61kB 15.3MB/s eta 0:00:01[K     |████████▌                       | 71kB 15.7MB/s eta 0:00:01[K     |█████████▊                      | 81kB 15.2MB/s eta 0:00:01[K     |███████████                     | 92kB 14.9MB/s eta 0:00:01[K     |████████████▏                   | 102kB 15.1MB/s eta 0:00:01[K     |█████████████▍                  | 112kB 15.1MB/s eta 0:00:01[K     |██████████████▋                 | 122kB 15.1MB/s eta 0:0

## 3. Clone ***torchdistill*** repository to use its example code and configuration files

In [3]:
!git clone https://github.com/yoshitomo-matsubara/torchdistill.git

Cloning into 'torchdistill'...
remote: Enumerating objects: 447, done.[K
remote: Counting objects: 100% (447/447), done.[K
remote: Compressing objects: 100% (224/224), done.[K
remote: Total 4091 (delta 294), reused 341 (delta 212), pack-reused 3644[K
Receiving objects: 100% (4091/4091), 822.43 KiB | 19.58 MiB/s, done.
Resolving deltas: 100% (2518/2518), done.


## 4. Distill knowledge in models pretrained on CIFAR-10

Note that the hyperparameters of ResNet, WRN (Wide ResNet), and DenseNet-BC were chosen based on either train/val (splitting 50k samples into train:val = 45k:5k) or cross-validation, according to the original papers.  
For the final run (once the hyperparameters are finalized), the authors used all the training images (50k samples).  
- ResNet: https://github.com/facebookarchive/fb.resnet.torch
- WRN (Wide ResNet): https://github.com/szagoruyko/wide-residual-networks
- DenseNet-BC: https://github.com/liuzhuang13/DenseNet

The following examples demonstrate how to 1) tune hyperparameter and 2) do final-run with ResNet-20 on CIFAR-10 dataset, respectively.

### 4.1 Hyperparameter tuning based on train:val = 45k:5k
Let's start with a small **student model**, ResNet-20, with a pretrained DenseNet-BC (k=12, depth=100) as a **teacher model** for tutorial.  

Open `torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.yaml` and update hyperparameters as you wish e.g., number of epochs (*num_epochs*), batch size (*batch_size* in *train_data_loader* entry), learning rate (*lr* within *optimizer* entry), and so on.  
By default, the hyperparameters in the example config are identical to those in the final run config.
  
You will find a lot of module names from [PyTorch documentation](https://pytorch.org/docs/stable/index.html) and [torchvision](https://pytorch.org/docs/stable/torchvision/) such as [`SGD`](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD), [`MultiStepLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.MultiStepLR), [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss), [`CIFAR10`](https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.CIFAR10), [`RandomCrop`](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.RandomCrop) (, and more). You can update their parameters or replace such modules with other modules in the packages. For instance, `SGD` could be replaced with [`Adam`](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam), and then you will change the parameters under `params` (at least delete `momentum` entry as the parameter is not for `Adam`). 

In [4]:
!python torchdistill/examples/image_classification.py --config torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.yaml --log log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.log

2021/01/10 20:29:25	INFO	torchdistill.common.main_util	Not using distributed mode
2021/01/10 20:29:25	INFO	__main__	Namespace(adjust_lr=False, config='torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.yaml', device='cuda', dist_url='env://', log='log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.log', start_epoch=0, student_only=False, sync_bn=False, test_only=False, world_size=1)
2021/01/10 20:29:25	INFO	torchdistill.datasets.util	Loading dummy data
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./resource/dataset/cifar10/cifar-10-python.tar.gz
170500096it [00:01, 99619332.72it/s]                   
Extracting ./resource/dataset/cifar10/cifar-10-python.tar.gz to ./resource/dataset/cifar10
2021/01/10 20:29:30	INFO	torchdistill.datasets.util	Splitting `dummy` dataset (50000 samples in total)
2021/01/10 20:29:30	INFO	torchdistill.datasets.util	new dataset_id: `cifar10/train` (45000 samples

### 4.2 Final run with hyperparameters determinded by the above hyperparameter-tuning
Once you tune the hyperparameters, you can update the values in **a config file whose name ends with "-final_run.yaml"**. Notice that the only difference between default example configs for hyperparameter tuning and final run is datasets entry.

In [5]:
!python torchdistill/examples/image_classification.py --config torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.yaml --log log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.log

2021/01/10 22:47:35	INFO	torchdistill.common.main_util	Not using distributed mode
2021/01/10 22:47:35	INFO	__main__	Namespace(adjust_lr=False, config='torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.yaml', device='cuda', dist_url='env://', log='log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.log', start_epoch=0, student_only=False, sync_bn=False, test_only=False, world_size=1)
2021/01/10 22:47:35	INFO	torchdistill.datasets.util	Loading train data
Files already downloaded and verified
2021/01/10 22:47:36	INFO	torchdistill.datasets.util	0.9210293292999268 sec
2021/01/10 22:47:36	INFO	torchdistill.datasets.util	Loading val data
Files already downloaded and verified
2021/01/10 22:47:36	INFO	torchdistill.datasets.util	0.7336370944976807 sec
2021/01/10 22:47:36	INFO	torchdistill.datasets.util	Loading test data
Files already downloaded and verified
2021/01/10 22:47:37	INFO	torchdistill.datasets.util	0.7437591552734375 sec
2021/01/10 22:4

## 5. More sample configurations, models, datasets...
For CIFAR-10/100 datasets, you can find more [sample configurations](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/configs/sample/) and [models](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/torchdistill/models/classification) in the [***torchdistill***](https://github.com/yoshitomo-matsubara/torchdistill) repository.  
If you would like to use larger datasets e.g., **ImageNet** and **COCO** datasets and models in `torchvision` (or your own modules), refer to the [official configurations](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/configs/official) used in some published papers.  
Experiments with such large datasets and models will require you to use your own machine due to limited disk space and session time (12 hours for free version and 24 hours for Colab Pro) on Google Colab.


# Colab examples for training student models without teacher models
You can find Colab examples for training models without teachers in the [***torchdistill***](https://github.com/yoshitomo-matsubara/torchdistill) repository.