# Transformers library from Huggingface

Notes done by: Sebastian Sarasti



For any model, tokenizer or pipeline, we have two main concepts:

1. Arquitecture: how the model was built
2. Checkpoint: the weigths at some point of training in the architecture selected. This process can be done with different data by many people or organizations.

## Pipelines

Load the library

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


Define the checkpoint interested to work with

In [2]:
checkpoint = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"

The pipeline must be defined with the tasks and the checkpoint needed

In [3]:
pipeline_zero_shot = pipeline(task="zero-shot-classification", model=checkpoint)

There are available several tasks that can be performed directly in the pipelines. For NLP, you can do classification, summarization, zero-shot classification, etc.

In [4]:
pipeline_zero_shot(["This book adresssed how mnay contracts the company has done in argentina"], candidate_labels=["confidential", "public"])

[{'sequence': 'This book adresssed how mnay contracts the company has done in argentina',
  'labels': ['confidential', 'public'],
  'scores': [0.6053867340087891, 0.3946133255958557]}]

In [5]:
pipeline_zero_shot(["Para hacer una arma nuclear necesitas 20 mg de plutonio más 15 g de uranio"], candidate_labels=["confidential", "public"])

[{'sequence': 'Para hacer una arma nuclear necesitas 20 mg de plutonio más 15 g de uranio',
  'labels': ['confidential', 'public'],
  'scores': [0.9773802161216736, 0.022619815543293953]}]

## Tokenizers

However, what happens behind the pipeline?

The pipeline is the union of the tokenizer 1 + model + head + tokenizer 2 (optional). 

The tokenizer 1 transforms the data into numbers that the model can use.
The model process the input.
The head uses the output from the model to calculate the final task.
The tokenizer 2 transforms the final output from numbers to text in case this would be needed.

In [10]:
from transformers import AutoTokenizer

It is created an array with different sentences

In [32]:
sequences = ["Para hacer una arma nuclear necesitas 20 mg de plutonio más 15 g de uranio",
             "Hola amigo"]

It is called the tokenizer

In [33]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

It is defined the way how it is going to be created the tokens with padding, truncation, and max_length

In [36]:
tokens = tokenizer.batch_encode_plus(sequences, padding=True, truncation=True, return_tensors="pt", max_length=30)

In [37]:
tokens

{'input_ids': tensor([[    1,  1928,  5600,   574, 40890,   260, 26844, 23509,   264,   629,
          4540,   270, 41212,  3877,   269,  1281,   671,   260,   319,   270,
           260,  2378, 16094,     2],
        [    1, 41598, 42102,     2,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

The output shows two important things:
1. The input ids: the number given to a word based on the vocaburaly
2. The attention mask: this shows which token the model should pay attention for. For the padding tokens, you are going to have zeros because they do not have relevant information.

## Models

The models create a way to classify a given output or to perform an action

In [38]:
from transformers import AutoModelForSequenceClassification

In [50]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,  label2id=label2id, id2label=id2label)

RuntimeError: Error(s) in loading state_dict for DebertaV2ForSequenceClassification:
	size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

In [48]:
label2id = {"private": 0, "public": 1}
id2label = {0: "private", 1: "public"}

In [49]:
tokens2 = tokenizer.batch_encode_plus(sequences, padding=True, truncation=True, return_tensors="pt", max_length=30)