Transformer models :


1.   GPT-like (also called auto-regressive Transformer models)
2.   BERT-like (also called auto-encoding Transformer models)
3.   BART/T5-like (also called sequence-to-sequence Transformer models)

Transformers are language models

They have been trained on large amounts of raw text in a self-supervised fashion. And since the model doesn't have any labels during training, it becomes very general. That's why, the general pretrained model then goes through a process called **transfer learning**. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

**Fine tuning :**  To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.

***Transformers architecture : ***

*Encoders : The encoder receives a text input and converts it into numerical representation(=embeddings = features).
It is bi-directionnal. And it uses the self attention mecanism.

*Decoders :  The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. It can also accept text inputs not only embeddings. It is uni-directional.

**Attention layers :**

A key feature of Transformer models is that they are built with special layers called attention layers.
 This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word, because for some words the context is important for their meaning.



*   For the encoder : The attention layers in the decoder can use all words in a sentence.
*   For the decoder : It has two attention layers. The first uses all past words passed from the encoder. The second adds to the words, the current word to predict. But the decoder can't "give attention" to the next words.



**Architecture vs model vs checkpoint :**

BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

#Encoders

**Encoder models :**
Encoders transform a sequence of words into feature vectors (1 per word) or a feature tensor.
The size of the vector depends on the model. The vector contains the representation of the word in the context. --> So it holds the meaning of the word within the text.

These bi-directional models, are good for ext-racting meaningful information : sequence classification, question-answering, masked , natural language understanding.

Ex families of models : BERT, RoBERT, ALBERT

In [None]:
# Classer du texte avec DistilBERT

import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("Hello, Nadal is my fav tennis player.", return_tensors="pt")

#La prédiction est faite sans calculer les gradients (torch.no_grad()), car nous n'effectuons pas d'entraînement, seulement une évaluation.
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id] #pour obtenir le label de la classe correspondante


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

'POSITIVE'

In [2]:
# ner

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

#Decoders

**Decoder models :**
Decoders also transform a sequence of words into feature vectors. Decoders use masked self-attention mecanisms to hide the right context of a word and use only the left context (the previous one) to transform the word into a vector.

--> Uni-directional architeture.

This architecture is good for building models for generative tasks.

**Auto regressive models** use their outputs as inputs for the next prediction.
Ex families of models : GPT

In [None]:
!pip install huggingface-cli
!huggingface-cli login

[31mERROR: Could not find a version that satisfies the requirement huggingface-cli (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for huggingface-cli[0m[31m
[0m
    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_h

In [None]:
#!pip uninstall -y transformers torch
#!pip install torch==2.0.1 transformers==4.31.0

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "meta-llama/Llama-2-7b-chat-hf"

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Create a pipeline for text generation
# Remove the 'device' argument since it's already handled by 'device_map="auto"'
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "You are a pirate chatbot who always responds in pirate speak! Tell me about your treasure."

# Generate text
outputs = generator(prompt, max_length=256, do_sample=True, top_k=50, temperature=0.7)

# Print the output
print(outputs[0]['generated_text'])


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
!pip uninstall -y torch torchvision torchaudio

Found existing installation: torch 2.4.1+cu118
Uninstalling torch-2.4.1+cu118:
  Successfully uninstalled torch-2.4.1+cu118
Found existing installation: torchvision 0.19.1+cu121
Uninstalling torchvision-0.19.1+cu121:
  Successfully uninstalled torchvision-0.19.1+cu121
Found existing installation: torchaudio 2.4.1+cu121
Uninstalling torchaudio-2.4.1+cu121:
  Successfully uninstalled torchaudio-2.4.1+cu121


In [None]:
# GPT-2
#!pip install transformers
#!pip install torchvision
#!pip install torchaudio

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend"},
 {'generated_text': "Hello, I'm a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string"},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example."}]

#Sequence to sequence

In a **sequence-to-sequence** model for tasks like Question Answering (Q&A), the encoder and decoder have specific roles that align with your understanding:

1. Encoder:

The encoder processes both the context (the passage or document from which the answer is to be retrieved) and the question.
It converts this input into a sequence of hidden states or a context vector, which is a compressed representation of both the question and the passage. The encoder essentially "understands" the relationship between the question and the context.

2. Decoder:

The decoder receives the output of the encoder (the context vector or hidden states) and generates the answer, token by token.
The decoder uses this context to decide which part of the input is most relevant to the question and then constructs an appropriate response based on that.
In more advanced models with attention mechanisms, the decoder can focus on specific parts of the input (context or question) at each step of answer generation.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

**Ex :**BART, T5, Marian, mBART

#Biases of the transformers models

These models are trained on huge data from the internaet that contains good information and biased inforation also.

In [1]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
