<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/ml/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokens and Masks

In natural language processing (NLP), particularly when dealing with transformer-based models like BERT or GPT, the concept of an attention mask is crucial for handling variable-length input sequences. Here’s a detailed explanation of what an attention mask is and how it functions:

### What is an Attention Mask?

An attention mask is a binary tensor that indicates which elements in the input sequence should be attended to (considered) and which should not (ignored). This is particularly useful when you have input sequences of different lengths and you need to pad them to the same length for batch processing.

### Why is it Needed?

1. **Handling Padding**: When processing sequences of different lengths in batches, shorter sequences are often padded with a special token (e.g., `[PAD]`). These padding tokens should not contribute to the model's understanding of the sequence. The attention mask helps the model distinguish between real tokens and padding tokens.
2. **Efficiency**: By ignoring the padded tokens, the model can focus its computational resources on the meaningful parts of the input, improving both efficiency and performance.

### How Does it Work?

- **Binary Masking**: The attention mask is typically a binary array (or tensor) where `1` indicates that the corresponding token should be attended to and `0` indicates that it should not.
  - Example: For an input sequence `[The, quick, brown, fox, [PAD], [PAD]]`, the attention mask might be `[1, 1, 1, 1, 0, 0]`.

### Implementation in Transformer Models

When a transformer model processes an input sequence, it uses the attention mask in its attention mechanism. The attention mechanism computes attention scores, which determine how much focus each token should give to every other token in the sequence. The attention mask modifies these scores to ensure that the padded tokens are not considered.

### Self Attention

Yes, the concept you're referring to is crucial in the context of self-attention mechanisms within transformer models. In self-attention, each token in a sequence attends to all other tokens, including itself, to build a contextual representation. However, in certain contexts like training, especially for tasks like language modeling, it is important to prevent tokens from attending to future tokens (which haven't been predicted yet).

### Key Points

- **Input IDs**: Token IDs of the input sequence, padded where necessary.
- **Attention Mask**: Binary mask indicating which tokens should be attended to.
- **Model Processing**: The model uses the attention mask to ensure that padding tokens do not influence the processing of the input sequence.

In summary, the attention mask is a fundamental tool in NLP for managing variable-length sequences and ensuring that padding tokens do not interfere with the learning process of the model.

In [38]:
from transformers import BertTokenizer, BertModel

# Sample input sequences
sentences = ["The quick brown fox", "jumps over the lazy dog"]

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the input sequences and pad them to the same length
inputs = tokenizer(sentences, padding=True, return_tensors="pt")

# The inputs dictionary contains input_ids and attention_mask
print(inputs['input_ids'])
print(inputs['attention_mask'])

# Input to the model
model = BertModel.from_pretrained('bert-base-uncased')
outputs = model(**inputs)

# Extract the hidden states
last_hidden_states = outputs.last_hidden_state

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tensor([[  101,  1996,  4248,  2829,  4419,   102,     0],
        [  101, 14523,  2058,  1996, 13971,  3899,   102]])
tensor([[1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1]])


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ah, I see the confusion. While the primary use of attention masks is to handle padding tokens, they can indeed be used to ignore certain words in a sentence. This can be particularly useful in tasks like selective attention in NLP models. Here’s a more detailed explanation and graphic illustrating this use case.

### Selective Attention Mask

In addition to padding, attention masks can be used to ignore specific tokens in the input sequence for various reasons, such as:

- **Focus on Key Tokens**: To make the model focus on specific words or phrases.
- **Exclusion of Stop Words**: To ignore common stop words that may not contribute significantly to the meaning of the sentence.

### How it Works

In this case, the attention mask will still be a binary tensor, but the `0`s will correspond to the tokens that should be ignored, even if they are not padding tokens.

### Example

Consider the sentence "The quick brown fox jumps over the lazy dog." We might want to focus only on the key content words: "quick", "brown", "fox", "jumps", "lazy", "dog".

### Illustration

Let's create a graphic to illustrate this selective attention mask concept.

### Implementation

I will draw a graphic where specific words in a sentence are masked (ignored) using the attention mask.



Here is a graphic illustrating the concept of a selective attention mask:

- **Tokens**: The sequence of words in the sentence.
- **Attention Mask**: A binary array indicating which words should be attended to (1) and which should be ignored (0).

In this example, words like "The", "over", and "the" are marked with 0 in the attention mask, meaning they should be ignored. The remaining words are marked with 1, indicating they should be attended to by the model.

This demonstrates how an attention mask can be used not only for handling padding but also for focusing on specific words in a sentence.


From video https://www.youtube.com/watch?v=QEaBAZQCtwE


* how to use the pipeline how to use model

* and tokenizer how to combine it with

* pytorch or tensorflow how to save and

* load models how to use models from the

* official model hub and also how to fine

* tune your own models


works with tensorflow, pytorch, or flex


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

res = classifier("It's so hot today in Cyprus")

print(res)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9993522763252258}]


1.  apply tokenizer
2.  feed preprocessed text to the model and applies model
3.  post processor




# Text Generation Pipeline

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")

res = generator(
    "Today we will learn about transformers",
    max_length=50,
    num_return_sequences=2

)

for dis in res:
  print(dis.values())

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


dict_values(['Today we will learn about transformers using techniques to achieve good results to optimize optimization. The following resources will be provided to you:'])
dict_values(['Today we will learn about transformers and their technology.\n\n\n\nOur future is more than the invention of technology. As the field changes we need a revolution to see if changes can be turned into practical solutions to a problem, we want to'])


In [None]:
from transformers import pipeline

generator = pipeline("zero-shot-classification")

res = generator(
    "Cyprus is a boring place",
    candidate_labels=["criticism", "education", "business"]

)

print("\n")
print(res['scores'])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.




[0.9652251601219177, 0.028426004573702812, 0.006348819006234407]


classification this means we can give it

a text without knowing the corresponding

label .  It then looks at the labels and sees which matches by percentage the text.

# Tokenizer

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel



model_name="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


text="It's so hot today in Cyprus."

res = classifier(text)

print(res)



res=tokenizer(text)
print("\n res=", res)

tokens=tokenizer.tokenize(text)
print("\n tokens=", tokens)

ids=tokenizer.convert_tokens_to_ids(tokens)
print("\n ids=", ids)

decoded_string=tokenizer.decode(ids)
print("\n decoded_string=",decoded_string)


# 101 is begin sentence
# 102 is end sentence




[{'label': 'POSITIVE', 'score': 0.9993836879730225}]

 res= {'input_ids': [101, 2009, 1005, 1055, 2061, 2980, 2651, 1999, 9719, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 tokens= ['it', "'", 's', 'so', 'hot', 'today', 'in', 'cyprus', '.']

 ids= [2009, 1005, 1055, 2061, 2980, 2651, 1999, 9719, 1012]

 decoded_string= it's so hot today in cyprus.


In [None]:
# Load model directly
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM

# https://huggingface.co/ukr-models/xlm-roberta-base-uk


tokenizer = AutoTokenizer.from_pretrained("ukr-models/xlm-roberta-base-uk")
model = AutoModelForMaskedLM.from_pretrained("ukr-models/xlm-roberta-base-uk")


unmasker = pipeline('fill-mask', model='ukr-models/xlm-roberta-base-uk')




text = """
Ми знову запрошуємо підлітків  <mask> з України віком від 13 до 19 років на БЕЗКОШТОВНИЙ курс  у американського успішного викладача- програміста з програмування на Python. Python - це ТОП1 мова у світі програмування.
"""




res=tokenizer(text)
print("\n res=", res)

tokens=tokenizer.tokenize(text)
print("\n tokens=", tokens)

ids=tokenizer.convert_tokens_to_ids(tokens)
print("\n ids=", ids)

decoded_string=tokenizer.decode(ids)
print("\n decoded_string=",decoded_string, "\n\n")

unmasker(text)



 res= {'input_ids': [0, 2688, 17222, 30223, 1228, 1618, 7749, 4234, 6, 31273, 210, 1702, 14213, 419, 1096, 702, 255, 953, 4664, 29, 24807, 7799, 2693, 28254, 14377, 4943, 84, 25902, 1041, 28385, 695, 18650, 15912, 9, 5725, 20730, 210, 5725, 2741, 29, 24420, 5, 24420, 20, 1544, 25947, 418, 8355, 84, 14042, 5725, 2741, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 tokens= ['▁Ми', '▁знову', '▁запрошує', 'мо', '▁під', 'літ', 'ків', '▁', '<mask>', '▁з', '▁України', '▁вік', 'ом', '▁від', '▁13', '▁до', '▁19', '▁років', '▁на', '▁БЕЗ', 'КО', 'Ш', 'ТОВ', 'НИЙ', '▁курс', '▁у', '▁американськ', 'ого', '▁успішно', 'го', '▁виклад', 'ача', '-', '▁програм', 'іста', '▁з', '▁програм', 'ування', '▁на', '▁Python', '.', '▁Python', '▁-', '▁це', '▁ТОП', '1', '▁мова', '▁у', '▁світі', '▁програм', 'ування', '.']

 ids= [2688, 17222, 30223, 1228, 1618, 7749, 4234, 6, 31

[{'score': 0.5255579352378845,
  'token': 23347,
  'token_str': 'ІТ',
  'sequence': 'Ми знову запрошуємо підлітків ІТ з України віком від 13 до 19 років на БЕЗКОШТОВНИЙ курс у американського успішного викладача- програміста з програмування на Python. Python - це ТОП1 мова у світі програмування.'},
 {'score': 0.13842427730560303,
  'token': 29189,
  'token_str': 'службовців',
  'sequence': 'Ми знову запрошуємо підлітків службовців з України віком від 13 до 19 років на БЕЗКОШТОВНИЙ курс у американського успішного викладача- програміста з програмування на Python. Python - це ТОП1 мова у світі програмування.'},
 {'score': 0.059892717748880386,
  'token': 4,
  'token_str': ',',
  'sequence': 'Ми знову запрошуємо підлітків, з України віком від 13 до 19 років на БЕЗКОШТОВНИЙ курс у американського успішного викладача- програміста з програмування на Python. Python - це ТОП1 мова у світі програмування.'},
 {'score': 0.04573393613100052,
  'token': 23348,
  'token_str': 'спеціаліст',
  'sequence'

# Image Recognition


In [None]:
from transformers import pipeline

captioner = pipeline(model="ydshieh/vit-gpt2-coco-en")
captioner("https://th.bing.com/th/id/OIP.SBYtWe52Cb3ecG65Z0ae8wAAAA?rs=1&pid=ImgDetMain")

[{'generated_text': 'a large boat with a large cargo ship on it '}]