# A Gentle Intro to `huggingface` and `transformers`

By: Dr. Jie Tao

ver: 0.1

Transformers, as the latest development in Deep Learning, have been widely applied in CV, NLP, and other (multi-)modality domains. The idea of transformers is that a large (with hundreds of billions of parameters, see GPT-4) model is pre-trained on a large corpus, and can be used in downstream tasks (classification, generation, etc.). Remember how we used `VGG` or `ResNet` in CV? This is a similar idea.

Huggingface is an API/wrapper that makes using transformers much easier. An analogy would be if you consider the original transformers to be like `tensorflow`, then `huggingface` is like `keras` to make your life easier.

Some notable characteristics regarding `huggingface` include:

- **NLP tasks**: Transformers can be used for a wide range of NLP tasks, including text classification, sentiment analysis, language translation, and question answering.
- **Pre-trained models**: Transformers provides access to a wide range of pre-trained language models, including `BERT`, `GPT-2`, and `RoBERTa`, which can be fine-tuned for specific NLP tasks.
- **Easy-to-use API**: Hugging Face provides an easy-to-use API that allows developers to quickly integrate Transformers into their NLP projects.
- **Community-driven development**: Hugging Face and Transformers are community-driven projects, which means that anyone can contribute to the development and improvement of the libraries.
- **Model deployment**: Hugging Face also provides a model serving platform, called "Hugging Face Hub," which allows users to deploy their pre-trained models to the cloud and share them with others.

__NOTE__: `huggingface` supports both `tensorflow` and `torch`, but has native support for `torch`. Since we already know `torch`, this tutorial is built on it.

## Setup `transformers`

Colab, or Anaconda, does not ship with `transformers`, so we need to install it before using it.

- on Colab you need to install every time you connect to a runtime

In [None]:
!pip install -U transformers accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [3

You should also consider using the GPU, since most of the models are large.

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda:0") ## you can specify which GPU to use if you have more than one, for intance `cuda:0` is the first GPU

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## First Glance of `transformers`

As discussed above, `huggingface` is here to make your life easier. How easy? You can use it with a couple lines of code.

In [None]:
from transformers import pipeline

sentiment_classifier = pipeline("sentiment-analysis", device=device)
sentiment_classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9598046541213989}]

Pretty good results and makes sense, right?

We can even pass several sentences!

In [None]:
sents = ["I've been waiting for a HuggingFace course my whole life.",
         "Other sentiment analysis tools are not so great.",
         "I hate it when the Warriors lost to the Lakers!"]
sentiment_classifier(sents)

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9995458722114563},
 {'label': 'NEGATIVE', 'score': 0.9988825917243958}]

Results are still good, right? Our model also seems to be very confident.

Note the warning above? That's because we didn't specify a model and its version, so the __proper__ way of initializing our model should be:
```python
pipeline('sentiment-analysis', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased', device=device)
```

You can find all huggingface models [here](https://huggingface.co/models).

Huggingface provided a variety of pipelines for different tasks, including but not limited to:
- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

Let's see a few other examples.

In [None]:
gen_pipe = pipeline('text2text-generation', model='google/flan-t5-base', tokenizer='google/flan-t5-base', device=device)
### note that pipelines return a list of dicts
gen_pipe("question: What is 42 ? context: 42 is the answer to life, the universe and everything")[0]['generated_text']

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json: 0.00B [00:00, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json: 0.00B [00:00, ?B/s]

'the answer to life'

In [None]:
generator = pipeline("text-generation", model='gpt2', device=device)
generator("In this course, we will teach you how to")

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create a simple command line tool, and we will show you how to create a simple GUI or application which will look like the following:\n\nThe commands will be easy to use with a simple GUI'}]

See? Now we have our own little ChatGPT!

Another use case is _zero-shot classification_, meaning we are using a model (without any finetuning) on a task it was not trained on. It also means no label exists for your data - which is the __biggest__ drawback of supervised learning.

__NOTE__: fine tuning means training a pre-trained model using task specific data so the model can respond more sensitively to the task.

In [None]:
zs_clf = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1", device=device)
zs_clf(
    "deep learning is widely used in different aspects of our world",
    candidate_labels=["technology", "education", "business"], # you provide the labels
)

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/890M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'sequence': 'deep learning is widely used in different aspects of our world',
 'labels': ['technology', 'education', 'business'],
 'scores': [0.9781833291053772, 0.016946464776992798, 0.004870184697210789]}

Are you curious how were `transformers` pre-trained? They were trained on a task called __masked langugage modeling__ (more info [here](https://www.scaler.com/topics/nlp/masked-language-model-explained/)).

We can see an example of it below (not for training models, just to use it for fun!)

In [None]:
unmasker = pipeline("fill-mask", model="distilroberta-base", device = device)
unmasker("This course will teach you all about <mask> models.", top_k=2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

[{'score': 0.19619779288768768,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.040527261793613434,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

### DO IT YOURSELF

1. Add `num_return_sequences` and `max_length` arguments to the `"text-generation"` pipeline so it generates `3` sequences of `50` tokens each.

In [None]:
my_generator = pipeline("text-generation", model='gpt2', device=device, num_return_sequences = 3, max_length=50)
my_generator("Are you cheating on me? ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Are you cheating on me? \xa0I'm sure many of you have. \xa0Oh yeah, I am. Well... it all started this week. \xa0I just got the news, from you. \xa0A week ago, I"},
 {'generated_text': "Are you cheating on me? ____ _. ____\n\nThe more you cheat on the world, the more you gain, and the more you gain at your own expense.\n\nAnd it's a great one. I'll make sure"},
 {'generated_text': 'Are you cheating on me? "\n\n[Worf: "Mm, thanks for the support."\n\nMorden: "Good morning, Valkia. Couldn\'t figure out the name of that beautiful female slave girl'}]

2. Build a `"zero-shot-learning"` pipeline to classify the following reviews in these categories:
`["candy", "pet food", "snack"]`

In [None]:
revs = ["I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also.", ## candy
        "I love this candy.  After weight watchers I had to cut back but still have a craving for it.", ## candy
        "This is a very healthy dog food. Good for their digestion. Also good for small puppies. ",  ## pet food
        "This is great dog food, my dog has severs allergies and this brand is the only one that we can feed him.", ## pet food
        "I started buying this after I noticed my 1 year old cat was already starting to lose his 'spunk' so I decided it was time to start buying him 'real' cat food...", ## pet food
        "I bought this for our office to give people something sweet to snack on. ", ## snack
        "This is one of the best salsas that I have found in a long time but stay away from the variety pack. "] ## snack

In [None]:
rev_clf = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1", device=device)
rev_clf(
    revs,
    candidate_labels=["candy", "pet food", "snack"], # you provide the labels
)

[{'sequence': 'I bought these for my husband who is currently overseas. He loves these, and apparently his staff likes them also.',
  'labels': ['snack', 'pet food', 'candy'],
  'scores': [0.7626631259918213, 0.15090307593345642, 0.0864337682723999]},
 {'sequence': 'I love this candy.  After weight watchers I had to cut back but still have a craving for it.',
  'labels': ['candy', 'snack', 'pet food'],
  'scores': [0.7265016436576843, 0.27287203073501587, 0.0006263519753701985]},
 {'sequence': 'This is a very healthy dog food. Good for their digestion. Also good for small puppies. ',
  'labels': ['pet food', 'snack', 'candy'],
  'scores': [0.9470902681350708, 0.05198989808559418, 0.0009198420448228717]},
 {'sequence': 'This is great dog food, my dog has severs allergies and this brand is the only one that we can feed him.',
  'labels': ['pet food', 'snack', 'candy'],
  'scores': [0.9785345196723938, 0.020760230720043182, 0.0007052454748190939]},
 {'sequence': "I started buying this aft

3. Try a different `zero-shot-learning` models from [here](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads) on above data and observe the change.

For more examples of `pipeline`, refer to [this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter1/section3.ipynb).



---





## Fundamentals of Transformers

Transformers built on the idea of __transfer learning__, which typically contains two phases, __pre-training__ and __fine-tuning__.

 - Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.


<img src = "https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/pretraining.svg"/>

This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

- Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train directly for the final task? There are a couple of reasons:

  + The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  + Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  + For the same reason, the amount of time and resources needed to get good results are much lower.

The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term _transfer learning_.

<img src = "https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/finetuning.svg"/>

Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.



### General architecture
In this section, we’ll go over the general architecture of the Transformer model. Don’t worry if you don’t understand some of the concepts; there are detailed sections later covering each of the components.


The model is primarily composed of two blocks:

- Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
- Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

<img src = "https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks.svg" />

Each of these parts can be used independently, depending on the task:

- Encoder-only models: Good for tasks that require understanding of the input, such as **sentence classification** and **named entity recognition**.
- Decoder-only models: Good for generative tasks such as **text generation**.
- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as **translation** or summarization.


### Attention layers

A key feature of Transformer models is that they are built with special layers called _attention layers_.

To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.



---




## What happened behind the scenes of `pipeline`?

Each pipeline contains three parts:

<img src = "https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg" />



### Text Preprocessing: Toeknizer

Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer (similar to OHE)
- Adding additional inputs that may be useful to the model (e.g., BOS/EOS)

Tokenizers are associated with corresponding models (same name). We can use the `AutoTokenizer` class to load any tokenizer (note you can use tokenizers for specific tasks, e.g., `TokenizersforSequenceClassification`) but who doesn't like the one-size-fit-all one?


In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
raw_inputs = ["I've been waiting for a HuggingFace course my whole life.",
         "Other sentiment analysis tools are not so great.",
         "I hate it when the Warriors lost to the Lakers!"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs.to(device)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2060, 15792,  4106,  5906,  2024,  2025,  2061,  2307,  1012,
           102,     0,     0,     0,     0,     0],
        [  101,  1045,  5223,  2009,  2043,  1996,  6424,  2439,  2000,  1996,
         18264,   999,   102,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]], device='cuda:0')}

In above code:
- `padding` and `truncation` make sure the different sequences ends up with the same length;
- `return_tensors` makes sure the results are in `torch.Tensors` instead of `NumPy.Array`.

The output itself is a dictionary containing two keys, input_ids and attention_mask. input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

`Attention masks` are tensors with the exact same shape as the `input IDs` tensor, filled with `0`s and `1`s: `1`s indicate the corresponding tokens **should be attended** to, and `0`s indicate the corresponding tokens **should not be attended** to (i.e., they should be ignored by the attention layers of the model).

For instance, all the `padding_tokens` have `attention masks` of `0`, meaning the attention layer(s) ignore(s) them, which makes sense since they have no meaning - they are just there to make sure the model works.

#### More details about Tokenization

Tokenization actually happens in three steps:
- splitting text into tokens and sub-tokens
- map tokens to input IDs
- adding other required tokens.

Let's take these steps one by one to observe.

In [None]:
## split
tokens = tokenizer.tokenize(raw_inputs[0]) ## note all tokens are convert to lower case since we are using an "uncased" model
tokens

['i',
 "'",
 've',
 'been',
 'waiting',
 'for',
 'a',
 'hugging',
 '##face',
 'course',
 'my',
 'whole',
 'life',
 '.']

Observed how `tokenizer` split huggingface into two sub-tokens? It is because the token is not popular in the data used in pre-training.

__PRO-TIP__: if you think "huggingface" is an important token, you can add it to the tokenizer.

In [None]:
tokenizer.add_tokens(["huggingface", "Lakers"])
tokens = tokenizer.tokenize(raw_inputs[0])
tokens

['i',
 "'",
 've',
 'been',
 'waiting',
 'for',
 'a',
 'huggingface',
 'course',
 'my',
 'whole',
 'life',
 '.']

Now we can convert tokens to input IDs.

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 30522, 2607, 2026, 2878, 2166, 1012]

In [None]:
## total size of the vocabulary
## you can see since "huggingface" is a word we created
## the ID is very large
len(tokenizer.vocab)

30524

Keep in mind any token you added to the `tokenizer` only has a random embedding. To make use of them, fine tuning is required.

Of course you can revert back from input IDs to tokens.

In [None]:
tokenizer.decode(ids)

"i've been waiting for a huggingface course my whole life."

There are other things about tokenizer that you might want to know:
1. **Padding strategies**: you can do `"longest"` (padding to the longest sequence), `"max_length"` (padding to the max length supported by the model, for instance `512` in the BERT family) or a specific length (e.g., `16`, but has to be lower than the max length supported by the model).
2. __Truncation strategies__: if you do not specify `"max_length"`, it is truncated to the max length supported by the model.
2. __Return types__: can be `"pt"` (`torch.Tensors`), `"tf"` (`tensorflow.tensors`) or `"np"` (`NumPy.Arrays`).


#### Special Tokens

Special tokens are very important in transformer models, specifically we care about three of them:
- `[CLS]`: beginning of sequence
- `[SEP]`: separator of sub-sequences, e.g., multiple sentences in the same input
- `[EOS]`: end of sequence, not popularly used.

If you use `tokenizer.tokenize()`, no special token is added, but if you use the `tokenizer` object, you will see the difference.

In [None]:
model_inputs = tokenizer(raw_inputs[0])
print(model_inputs["input_ids"])

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 30522, 2607, 2026, 2878, 2166, 1012, 102]


In [None]:
ids

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 30522, 2607, 2026, 2878, 2166, 1012]

Notice the `101` and `102` tokens added? Let's see what are they?

In [None]:
print(tokenizer.decode(model_inputs["input_ids"]))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]


We need special tokens because a lot of the models are pre-trained with them.

__PRO-TIP__: the `[CLS]` token is extremely important since most of the models use its embedding as the embedding of the whole sequence. It is not like other models that use the average or concatenation of the token embeddings, which is a big improvement in many cases.

### Going through the model

We can also use the `AutoModel` class to load the corresponding model.

In [None]:
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint).to(device) # move to GPU

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The output from the model is usually a high-dimensional vector. It generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model input.


In [None]:
outputs = model(**inputs) ## since inputs is a dict, we can use keyword arguments (**kwargs) for this
print(outputs.last_hidden_state.shape)

torch.Size([3, 16, 768])


You should read above results as: we have `3` sequences (batch size), each sequences contains `16` tokens (sequence length after `padding` and `truncating`), and each token is represented in a `768` real-valued vector (hidden state).

__PRO-TIP__: we can customize `max_length`, but usually we cannot customize the embedding/hidden size (`768`).

### So what have we done?

The above `outputs` are the embeddings from our model, you can think them as the output from the `Embedding` layer in your `keras` model.

__PRO-TIP__: These vectors can be used to calculate semantic similarity, visualizing sentences in a language space, or used for topic modeling.

But usually we need a downstream task (e.g., sentence classification), so we typically add a _head_ to the generic model.

<img src = "https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg" />

Here is a non-exhaustive list of different heads:

- *Model (retrieve the hidden states)
- *ForCausalLM
- *ForMaskedLM
- *ForMultipleChoice
- *ForQuestionAnswering
- *ForSequenceClassification
- *ForTokenClassification
- and others 🤗

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([3, 2])


In [None]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.2703, -3.4263],
        [ 3.7515, -3.0441]], device='cuda:0', grad_fn=<AddmmBackward0>)


### Post-processing: make sense of the logits

We can apply `SoftMax` on the `logits` to get the predictions.

In [None]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9955e-01, 4.5414e-04],
        [9.9888e-01, 1.1174e-03]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


Then we can look into the labels in the `model`.

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can map those labels to our `predictions`.

In [None]:
[model.config.id2label.get(pl) for pl in torch.argmax(predictions, dim=1).cpu().numpy()]

['POSITIVE', 'NEGATIVE', 'NEGATIVE']

What if we only have one input, for instance the `ids` thing we created? Can we pass that to our model?

In [None]:
model(ids)

`transformer` is complaining about data type. If you remember we want it to tbe `torch.Tensors`. So let's try again.

In [None]:
input_ids = torch.tensor(ids).to(device)
model(input_ids)

Now the problem is we are missing a dimension. Even if you only have one input, it is still a batch (size of 1). So below code would work:
```python
input_ids1 = torch.tensor([ids]).to(device)
model(input_ids1)
```



---




In [None]:
input_ids1 = torch.tensor([ids]).to(device)
model(input_ids1)