In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Working with pretrained models



We recall from the previous notebook that transfer learning is the process of two steps:

* start with a **pre-trained model checkpoint**, trained on an adjacent task

* **fine tune** the model with a few epochs of further training, using our task specific training dataset


In this notebook, we will play with the HuggingFace api to load and use the pre-trained models.

The easiest way to do this is to use the `Auto<Class>.from_pretrained` feature of huggingface. For example, to tokenize, use `AutoTokenizer`.



In [2]:
example = "The tokenizer does tokenization. It does this to have fun with tokens."

In [3]:
from transformers import AutoTokenizer
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
tokens = tokenizer(example)
print(tokens)

{'input_ids': [101, 1996, 19204, 17629, 2515, 19204, 3989, 1012, 2009, 2515, 2023, 2000, 2031, 4569, 2007, 19204, 2015, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


#### For image processing

For image-processing, use the `AutoImageProcessor`.

In [5]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

Downloading (…)rocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

#### Audio processing

In [6]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

Downloading (…)rocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

#### Multimodal processing

For multimodal tasks, since we have data in two formats, we need internally two different processors. For example, it may require an image processor for images, and a tokenizer for text.

In [7]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

Downloading (…)rocessor_config.json:   0%|          | 0.00/135 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

#### Pretrained models

We can load a pre-trained model for a specific task; sometimes, we can use the same checkpoint for another task too. See the two segments below using the same checkpoint.

In [8]:
from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification

checkpoint = "distilbert-base-uncased"

# For sequence classification
sequence_classifier = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# For token classification
token_classifier = AutoModelForTokenClassification.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
