# Transfer Learning

## Plan

- Other resources for models:
    - Research papers and specific repos
    - Facebook AI Research / DeepMind and a-like repositories
- Model conversion
- Ways to finetune model

> Transfer Learning is an idea to reuse knowledge learned by models in other tasks

This approach allows us to spend less time on:
- coding and coming up with neural network architectures 
- collecting data (as large amounts of data were already used to train these networks)

Furthermore, it helps with:
- generalization (knowledge from similar domain can be easily transferable)
- training time (weights are initialized better)

And, maybe even more important, __allows us to use knowledge from datasets which are too large to train on one's machine__

# torchvision

> `torchvision` ([documentation](https://pytorch.org/docs/stable/torchvision/models.html)) provides SOTA (or close to State Of The Art) neural network models for computer vision tasks

Those models were (usually) trained on a well-known `ImageNet` dataset

## ImageNet

[ImageNet](http://image-net.org/) is not only a dataset, but also __a yearly held classification competition__.

Overview of the dataset:
- Over 1 million images
- Images are of different sizes (but usually those are cropped to `224x224` - `384x384`)
- `1000` classes (a lot, this task is hard!)

> __One should keep current best models on ImageNet in mind as those are often used as standalone/part of other models!__

- At this moment EfficientNet based architectures are current SOTA (original research paper [here](https://arxiv.org/abs/1905.11946))
- __Around 90% Top-1 accuracy achieved__ (and 98% Top-5) which means __we are getting closer to solving this dataset as we have "solved" MNIST or CIFAR__

## Using models

> Loading `torchvision` models is simple, use [source code of model](https://pytorch.org/vision/0.8/models.html#torchvision.models.resnet18) to see all available arguments!

In [None]:
import torchvision

model = torchvision.models.vgg11(pretrained=True)

# Vision models classes

Models provided by `torchvision` (and not only) can be divided into a few categories (`torchvision` addition if provided by the package):

## Classification

> Basic task, most of the models were trained on ImageNet (or sometimes pretrained with even larger datasets beforehand). 

Accuracy classification looks more or less like below (non comprehensive list and grouped by theme, full list [here](https://paperswithcode.com/sota/image-classification-on-imagenet)), sorted from best to last:

- EfficientNet family - [research](https://arxiv.org/abs/1905.11946) | `EfficientNet-BN`, `EfficientNet-LN` and their variations
- ResNet family - we saw basic idea standing behind it during convolution classes | `torchvision` | `ResNext`, `ResNet`, `Wide ResNe(X)t`
- Inception family | `torchvision` | `InceptionV3`, `Xception`
- MobileNets | `torchvision` | `MobileNetV{1, 2, 3}`, used as building block of EfficientNet
- Older models of historical importance:
    - VGG family | `torchvision` | VGG11, VGG19, large and inefficient in comparison
    - AlexNet | `torchvision` | First neural network winning ImageNet competition
    
> __There are a lot of other interesting ideas presented in ImageNet related papers, read them if you are curious!__

### Which model should I choose?

As always, that depends on your use case, but rough guidelines could be:

- __ResNets__:
    - battle tested
    - work really well in many tasks
    - fast and well optimized in many frameworks (perfect for GPUs)
    - __may not be the most efficient parameter-wise__
    - __go to for initial runs__
- __EfficientNets__:
    - current SOTA
    - may not be as general as ResNet (though research is ever growing)
    - may not be as as optimized (ever changing, __potentially faster than ResNets, sometimes much faster__)
    - __more efficient parameter-wise__ (smaller model than ResNet, on the order of `10`)
    - __test when you want to push your accuracy__
    - __test when you want to deploy to mobile and other constrained devices__ (and you need better results)
- __MobileNets__:
    - really fast (especially on CPU)
    - __battle tested for edge deployment & constrained environments__ (AWS Lambda, Mobile)
    - can be really really small (below `1KK` parameters) yet good enough
    - __use for mobile, may handle a lot of tasks good enough!__

In [None]:
# Number in ResNet tells us how many layers it has
resnet = torchvision.models.resnet32(pretrained=True) # Loading weights trained on ImageNet

# Interesting model between MobileNets and good accuracy
mnasnet = torchvision.models.mnasnet1_0(num_classes=100) # Choosing classes

## Other tasks

- [Semantic Segmentation](https://pytorch.org/vision/stable/models.html#semantic-segmentation)
- [Object Detection & Image Segmentation](https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection)

![](images/segmentation_vs_detection.png)

[Image Source](https://towardsdatascience.com/a-hitchhikers-guide-to-object-detection-and-instance-segmentation-ac0146fe8e11)

__We will not go into details about those models during this lesson__, but important things to keep in mind:
- Those models use `classification` models seen above as __backbone__ (feature creator for specific task), __recurring theme in vision!__
- Usually trained on large [`COCO` dataset](https://cocodataset.org/)

## Exercise

Try to beat old model with pretrained weights fine-tuned (we will later learn more about that) on CIFAR10 for `1` epoch!

- __Run first and second cell ONLY ONCE__
- __In the third cell:__
    - You can use any `torchvision` model
    - WITHOUT `pretrained` weights
    - Train for at most `5` epochs (you can check validation accuracy)

In [1]:
import tempfile

import torchvision
from pl_bolts.datamodules import CIFAR10DataModule

with tempfile.TemporaryDirectory() as data_dir:
    dm = CIFAR10DataModule(
        data_dir=data_dir, shuffle=True, num_workers=1, normalize=True, batch_size=64
    )
    train_dataloader = dm.train_dataloader()
    test_dataloader = dm.test_dataloader()
    validation_dataloader = dm.validation_dataloader()

In [None]:
def accuracy(logits, labels):
    return torch.sum(torch.argmax(logits, dim=-1) == labels)

# Your baseline to beat



In [None]:
# Your code 

# PyTorch Hub

> PyTorch provides hub from which one can simply download models ([page](https://pytorch.org/hub/) | [module](https://pytorch.org/docs/stable/hub.html))

It works in a similar fashion to `torchvision` and is currently being developed as __official source of PyTorch models__.

- Anyone can make their models work with PyTorch Hub
- `torchvision` models are available through it
- Other, non vision models are also provided (including NLP, Audio, Generative)

One can easily see available models in repository usuing `torch.hub.list`:

In [3]:
import torch

torch.hub.list(github="intel-isl/MiDaS")

Downloading: "https://github.com/intel-isl/MiDaS/archive/master.zip" to /home/vyz/.cache/torch/hub/master.zip


['MiDaS', 'MiDaS_small', 'MidasNet', 'MidasNet_small', 'transforms']

## How to find repositories?

- Official repositories are linked on [PyTorch Hub](https://pytorch.org/hub/) webpage
- Non-official and hosted by users can be found in some repositories (still not such a common practice), __look for `hubconf.py` at the root of github project__ (and see next sections)

## More PyTorch Hub commands

> Watch out, some models are really large!

There are more commands useful for exploration, let's see the cell below:

In [15]:
import tempfile

# This directory will be removed after we leave context manager
with tempfile.TemporaryDirectory() as directory:
    # Where model will be downloaded
    torch.hub.set_dir(directory)

    print(torch.hub.list("pytorch/vision"))

    print(torch.hub.help("pytorch/vision", model="mobilenet_v3_large"))

    model = torch.hub.load(
        "pytorch/vision", model="mobilenet_v3_large", pretrained=True, progress=True
    )

Downloading: "https://github.com/pytorch/vision/archive/master.zip" to /tmp/tmpaqud08rt/master.zip


['alexnet', 'deeplabv3_mobilenet_v3_large', 'deeplabv3_resnet101', 'deeplabv3_resnet50', 'densenet121', 'densenet161', 'densenet169', 'densenet201', 'fcn_resnet101', 'fcn_resnet50', 'googlenet', 'inception_v3', 'lraspp_mobilenet_v3_large', 'mnasnet0_5', 'mnasnet0_75', 'mnasnet1_0', 'mnasnet1_3', 'mobilenet_v2', 'mobilenet_v3_large', 'mobilenet_v3_small', 'resnet101', 'resnet152', 'resnet18', 'resnet34', 'resnet50', 'resnext101_32x8d', 'resnext50_32x4d', 'shufflenet_v2_x0_5', 'shufflenet_v2_x1_0', 'squeezenet1_0', 'squeezenet1_1', 'vgg11', 'vgg11_bn', 'vgg13', 'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn', 'wide_resnet101_2', 'wide_resnet50_2']

    Constructs a large MobileNetV3 architecture from
    `"Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>`_.

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    


Using cache found in /tmp/tmpaqud08rt/pytorch_vision_master
Using cache found in /tmp/tmpaqud08rt/pytorch_vision_master
Downloading: "https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth" to /tmp/tmpaqud08rt/checkpoints/mobilenet_v3_large-8738ca79.pth


  0%|          | 0.00/21.1M [00:00<?, ?B/s]

In [12]:
# Finding out more aoubt downloaded methods

methods = dir(model) # available methods

# Info about specific model's method
help(model.xpu)

Help on method xpu in module torch.nn.modules.module:

xpu(device: Union[int, torch.device, NoneType] = None) -> ~T method of torchvision.models.mobilenetv3.MobileNetV3 instance
    Moves all model parameters and buffers to the XPU.
    
    This also makes associated parameters and buffers different objects. So
    it should be called before constructing optimizer if the module will
    live on XPU while being optimized.
    
    Arguments:
        device (int, optional): if specified, all parameters will be
            copied to that device
    
    Returns:
        Module: self



## PyTorch-Lightning

> Lightning Bolts provide a few well recognized models , see [documentation](https://pytorch-lightning-bolts.readthedocs.io/en/latest/#vision)

- Provides architectures, __rarely pretrained weights__ (or you have to get them on your own)
- Selection of models is not very large currently
- __Useful to understand specific models and see them implemented__

In [None]:
from pl_bolts.models.gans import GAN

# Basic GAN network
# IT DOES NOT HAVE PRETRAINED WEIGHTS!
model = GAN(input_channels=3, input_height=28, input_width=28, latent_dim=32)

# Other sources

What if we can't find a desirable model? There are a few available alternatives:
- [paperswithcode](https://paperswithcode.com/) - outline current SOTA results including only papers with available source code (__quality of implementation not measured!__)
- [arxiv](https://arxiv.org/) - except research models, links to GitHub repositories are __sometimes__ provided, __usually in the abstract__
- GitHub accounts of respected research labs (also includes interesting technical solutions):
    - [Facebook Research](https://github.com/facebookresearch) - General | Vision
    - [DeepMind](https://github.com/deepmind) - General | Reinforcement Learning
    - [Google Research](https://github.com/google-research) - General | Health, Business use
    - [OpenAI](https://github.com/openai/) - General | NLP, large networks
    - [Microsoft Research](https://github.com/MicrosoftResearch) (more technical and less DL based, General)
    - [NVIDIA Research](https://github.com/NVlabs) - General | GANs, large scale networks
- [Distill.pub](https://distill.pub/) - research reviews & other publications, sometimes with code

# Model Conversion

> Some models are implemented in different frameworks (usually Tensorflow). We can use `ONNX` to make a conversion

# ONNX

> [ONNX](https://github.com/onnx/onnx) provides an open source format for AI models, both deep learning and traditional ML

- Transform models into open exchange framework `.onnx`
- Supported by major frameworks/tools

## Downsides

- Not all operations between frameworks are interchange'able
- For SOTA models conversion might be hard
- Puts constrains on some of the frameworks (e.g. PyTorch)

> We will see another way to export models for usage in different than Python environments later during `torchscript` lesson

> __`ONNX` should be used with care and only for inter-framework conversions.__

> We don't want you to know `ONNX` in and out, just keep this tool in mind when the right time comes!

## Why would I leave my framework?

PyTorch is great, but there are a few cases you might encounter were you need to switch, including:
- Part of team (or another team) uses different technology
- PyTorch does not support some form of deployment (which Tensorflow might)
- Hardware specific optimization is required and not possible in PyTorch
- Other parts of the pipeline are implemented in different framework

> Above (and many more) reasons also apply to other deep learning/machine learnig frameworks

## PyTorch front end

Let's see how we can export our PyTorch models to `ONNX` format using `torch.onnx` module:

In [None]:
import torch
import torchvision

dummy_input = torch.randn(10, 3, 224, 224, device="cuda")
model = torchvision.models.alexnet(pretrained=True).cuda()

# Providing input and output names sets the display names for values
# within the model's graph. Setting these does not change the semantics
# of the graph; it is only for readability.
#
# The inputs to the network consist of the flat list of inputs (i.e.
# the values you would pass to the forward() method) followed by the
# flat list of parameters. You can partially specify names, i.e. provide
# a list here shorter than the number of inputs to the model, and we will
# only set that subset of names, starting from the beginning.
input_names = ["actual_input_1"] + ["learned_%d" % i for i in range(16)]
output_names = ["output1"]

# Transfer learning

> Transfer learning is a process of reusing model(s) taught on another task and adjusting to our needs

## Per-domain models

There are some rough guidelines for different tasks:
- Vision:
    - ImageNet models (classification)
    - COCO pretrained models (with pretrained backbones from ImageNet classification)
- NLP:
    - Pretrained word embeddings
    - Large Transformer based architectures (usually BERT and it's variations)
    - __Still emerging approach__
    
For other tasks (e.g. reinforcement learning, one shot learning, GANs) transfer learning is not yet so widespread.

> Probably more pretrained models for different domains will emerge, as we have seen with vision and NLP tasks after that

Aforementioned domain-specific models use pretrained networks from vision (most often) as part of their model though.

## How to finetune?

We will focus on vision and classification tasks, though similar approach is used for NLP.

## Weight freezing

> Weight freezing means freezing __backbone__ (layers creating features) so __those will not learn anything__ and __only enabling last layer to learn on provided data__

### Pros

- __The more you freeze, the faster your neural network will run and less memory it will take!__
- Easier to finetune and "get right"
- We surely will not "destroy" weights learned on other task (which may sometimes occur at the beginning of training due to random initialization of layer)

### Cons

- Representational power is limited (as we cannot change frozen weights)
- We usually will not get best possible result (though we will get it faster)

### Tips

- There is no strict rule, you may unfreeze more parts of the network (though less common)
- You may start with weight freezing, unfreeze afterwards and finish with small learning rate (or disciminative learning rate), though __this will make the optimization procedure significantly harder__ to implement and reason about

## Discriminative learning rates
    
> Discriminative learning mean setting different learning rates for different part of the neural network

### Pros

- Larger representation space
- Probably better accuracy score
- We won't destroy pretrained weights (as their learning rate is smaller

### Cons

- __Way longer__ time to train as the whole network is used
- __Harder to finetune__ and "get right"

### Tips

- Divide your neural networks into few regions:
    - head should have standard learning rate
    - middle of the network should have it the same, but divided by `10`
    - first layers (finding general features) should have it the same, but divided by `10`
- `10` is not a strict rule but seems to work well in practice

# Exercise

> __THIS EXERCISE HAS TIME LIMIT AND BEST ACCURACY WINS (time limit will be set by instructor)__

> __DO NOT WRITE ANYTHING AT THE START. YOU HAVE 5 MINUTES TO READ INSTRUCTIONS AND COME UP WITH STRATEGY (MAKE IT A TEAM WORK!)__

- Implement `freeze` function taking in neural network  and setting `requires_grad_(False)` on each parameter
- Do the same for `unfreeze` but set parameter's gradient to `True`
- Load any model you want from `torchvision`:
    - The larger the better, but may not fit on the GPU
    - Use knowledge from the beginning when choosing it
- Print the model to get a little info about it's structure (backbone, bottleneck etc.)
- Write training loop for CIFAR100 (or use PyTorch Lightning/any other framework you feel most comfortable with)

Now you can follow one of two ways (or mix them), take into consideration both ways and choose wisely!

## Weight freezing

- Use `freeze` to freeze part of the module, up to you how many layers will be frozen (start easy, like freezing whole backbone and training only head)
- Train your neural network (remember about the time!)

## Discriminative learning rates

- Set different learning rates for different parts of the network (any optimizer/scheduler you want and think can fit in time)
- Train your neural network (remember about the time!)



In [None]:
def freeze(module: torch.nn.Module):
    ...
    
def unfreeze(module: torch.nn.Module):
    ...

## Challenges

### Assessment

- What is "knowledge distillation"? Where is it used and what are the reasons?
- What is "quantization"? Why is it useful? When should we use it? Read about it in [PyTorch documentation](https://pytorch.org/docs/stable/quantization.html)

### Non-assessment

- Read about necessary steps to publish your models to PyTorch Hub [here](https://pytorch.org/hub/)
- What are [Adapters](https://arxiv.org/pdf/1902.00751.pdf)? 