# Knowledge Distillation in AutoMM
:label:`sec_automm_distillation_multilingual`

Pretrained foundation models are becoming increasingly large. However, these models are difficult to deploy due to 
limited resources available in deployment scenarios. To benefit from large models under this constraint, 
you transfer the knowledge from the large-scale teacher models to the student model, with knowledge distillation.
In this way, the small student model can be practically deployed under real-world scenarios,
while the performance will be better than training the student model from scratch thanks to the teacher.

In this tutorial, we introduce how to adopt `MultiModalPredictor` for knowledge distillation. For the purpose of demonstration, we use the [Question-answering NLI](https://paperswithcode.com/dataset/qnli) dataset, 
which comprises 104,743 question, answer pairs sampled from question answering datasets. We will demonstrate how to use a large model to guide the learning and improve the performance of a small model in AutoGluon.

## Load Dataset

The [Question-answering NLI](https://paperswithcode.com/dataset/qnli) dataset contains 
sentence pairs in English. In the label column, `0` means that the sentence is not related to the question and `1` means that the sentence is related to the question.

In [1]:
from datasets import load_dataset

dataset = load_dataset("glue", "qnli")

Reusing dataset glue (/home/ubuntu/.cache/huggingface/datasets/glue/qnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
dataset['train']

Dataset({
    features: ['question', 'sentence', 'label', 'idx'],
    num_rows: 104743
})

In [12]:
from sklearn.model_selection import train_test_split

train_valid_df = dataset["train"].to_pandas()[["question", "sentence", "label"]].sample(1000, random_state=123)
train_df, valid_df = train_test_split(train_valid_df, test_size=0.2)
test_df = dataset["validation"].to_pandas()[["question", "sentence", "label"]].sample(1000, random_state=123)

## Load the Teacher Model

In our example, we will directly load a teacher model with the [google/bert_uncased_L-12_H-768_A-12](https://huggingface.co/google/bert_uncased_L-12_H-768_A-12) backbone that has been trained on QNLI and distill it into a student model with the [google/bert_uncased_L-6_H-768_A-12](https://huggingface.co/google/bert_uncased_L-6_H-768_A-12) backbone.

In [4]:
!wget --quiet https://automl-mm-bench.s3.amazonaws.com/unit-tests/distillation_sample_teacher.zip -O distillation_sample_teacher.zip
!unzip -q -o distillation_sample_teacher.zip -d .

In [5]:
from autogluon.multimodal import MultiModalPredictor

teacher_predictor = MultiModalPredictor.load("ag_distillation_sample_teacher/")

             This can result in unexpected behavior including runtime errors.
             Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.
  from pandas import MultiIndex, Int64Index


## Distill to Student

Training the student model is straight forward. You may just add the `teacher_predictor` argument when calling `.fit()`.

In [6]:
student_predictor = MultiModalPredictor(label="label")
student_predictor.fit(
    train_df,
    tuning_data=valid_df,
    teacher_predictor=teacher_predictor,
    hyperparameters={
        "model.hf_text.checkpoint_name": "google/bert_uncased_L-6_H-768_A-12",
        "optimization.max_epochs": 2,
    }
)

Global seed set to 123
  rank_zero_warn(
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name                         | Type                         | Params
------------------------------------------------------------------------------
0 | student_model                | HFAutoModelForTextPrediction | 67.0 M
1 | teacher_model                | HFAutoModelForTextPrediction | 109 M 
2 | validation_metric            | AUROC                        | 0     
3 | hard_label_loss_func         | CrossEntropyLoss             | 0     
4 | soft_label_loss_func         | CrossEntropyLoss             | 0     
5 | softmax_regression_loss_func | MSELoss                      | 0     
6 | output_feature_loss_func     | MSELoss                      | 0     
7 | output_feature_ada

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Epoch 0, global step 3: 'val_roc_auc' reached 0.66559 (best 0.66559), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092236/epoch=0-step=3.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 0, global step 7: 'val_roc_auc' reached 0.77510 (best 0.77510), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092236/epoch=0-step=7.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 10: 'val_roc_auc' reached 0.78236 (best 0.78236), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092236/epoch=1-step=10.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 14: 'val_roc_auc' reached 0.78215 (best 0.78236), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092236/epoch=1-step=14.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=2` reached.


Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f4fbd36ea00>

In [7]:
print(student_predictor.evaluate(data=test_df))

Predicting: 0it [00:00, ?it/s]

{'roc_auc': 0.7796779677967796}


## Comparing with Direct Finetuning

We then finetune a small model [mMiniLMv2](https://arxiv.org/abs/2012.15828) without distillation. 
We can still load the multilingual MiniLMv2 model from Huggingface/Transformers, 
with the key as [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](ahttps://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large). 
To simplify the experiment, we also just finetune for 4 epochs.

In [8]:
nodistill_predictor = MultiModalPredictor(label="label")
nodistill_predictor.fit(
    train_df,
    tuning_data=valid_df,
    hyperparameters={
        "model.hf_text.checkpoint_name": "google/bert_uncased_L-6_H-768_A-12",
        "optimization.max_epochs": 2,
    }
)

Global seed set to 123
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 67.0 M
1 | validation_metric | AUROC                        | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
133.913   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Epoch 0, global step 3: 'val_roc_auc' reached 0.64551 (best 0.64551), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092404/epoch=0-step=3.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 0, global step 7: 'val_roc_auc' reached 0.75955 (best 0.75955), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092404/epoch=0-step=7.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 10: 'val_roc_auc' reached 0.77047 (best 0.77047), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092404/epoch=1-step=10.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 14: 'val_roc_auc' reached 0.76861 (best 0.77047), saving model to '/home/ubuntu/xjshi/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_092404/epoch=1-step=14.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=2` reached.


Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f4f3c303cd0>

In [9]:
print(nodistill_predictor.evaluate(data=test_df))

Predicting: 0it [00:00, ?it/s]

{'roc_auc': 0.7889288928892888}


We can find that via knowledge distillation, the performance of `student_predictor` is better than `nodistill_predictor`.

## More about Knowledge Distillation

To learn how to customize distillation, see the distillation examples 
and README in [AutoMM Distillation Examples](https://github.com/awslabs/autogluon/tree/master/examples/automm/distillation).
Especially the [multilingual distillation example](https://github.com/awslabs/autogluon/tree/master/examples/automm/distillation/automm_distillation_pawsx.py) with more details and customization.

## Other Examples

You may go to [AutoMM Examples](https://github.com/awslabs/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.

## Customization
To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.