# Hugging face and loading

**Introduction to BERT and Hugging Face Models**  

BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking model introduced by Google that has significantly advanced the field of Natural Language Processing (NLP). Its core innovation lies in its bidirectional training of transformers, allowing it to understand context from both directions of a sentence. This approach makes BERT exceptionally effective in tasks like sentiment analysis, text classification, question answering, and more.  

One of the most accessible ways to work with BERT is through Hugging Face, a platform that hosts a vast repository of pre-trained models. Hugging Face provides tools and resources that simplify the implementation of NLP tasks using BERT and its derivatives. With just a few lines of code, users can access, fine-tune, and deploy state-of-the-art models for various applications.  

By leveraging the Hugging Face Transformers library, researchers and developers can explore a wide range of pre-trained models, from base BERT to specialized variations fine-tuned for specific tasks. The platform also encourages collaboration and innovation by enabling users to share their models and use community-contributed resources.  

In this tutorial, we will explore how to use BERT via the Hugging Face ecosystem, covering everything from loading pre-trained models to customizing them for your unique use case. Let’s dive in!

---

**Exercise 1: Exploring Hugging Face Models**  

1. Access the website [https://huggingface.co/](https://huggingface.co/).  
2. Explore the available models and identify one or more models you are interested in using.  
3. Investigate how to import the selected models using the Hugging Face library or APIs.  
4. Discuss with your instructor the steps and best practices for importing and utilizing these models in your project.  

In [1]:
#load pretrained model from huggingface for word embedding
from transformers import BertModel, BertTokenizer, AutoTokenizer, AutoModelForTokenClassification
import torch


#https://huggingface.co/google-bert/bert-base-uncased
model = 'google-bert/bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model)
model = BertModel.from_pretrained(model)


#cosine similarity library
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


# Embeddings using BERT

**Introduction to Word Embeddings**  

Word embeddings are a fundamental concept in Natural Language Processing (NLP) that transform words into numerical representations. These representations capture semantic meanings and relationships between words, making them essential for many NLP tasks such as language modeling, sentiment analysis, and machine translation.  

Traditional methods like one-hot encoding represented words as sparse vectors, which lacked meaningful relationships and scalability. In contrast, word embeddings represent words as dense, low-dimensional vectors. This approach preserves linguistic context and enables models to understand similarities between words based on their usage in language. For example, embeddings of words like *king* and *queen* or *dog* and *cat* will be close to each other in the embedding space.  

Popular techniques for generating word embeddings include Word2Vec, GloVe, and FastText, which learn these representations from large text corpora. More advanced models like BERT and GPT use contextualized embeddings, which dynamically adjust the vector representation of a word based on its context in a sentence.  

In this part of the tutorial, we will delve into the basics of word embeddings, explore their mathematical properties, and see how they form the building blocks for more complex NLP models. Let’s get started!  

## Word Embeddings

**Exercise 2: Extracting Word Embeddings**  

1. Obtain the embeddings for the word *"bank"* from the sentence *"My money is in the bank."*.  
2. Specify which model you used to extract the embeddings (e.g., BERT, DistilBERT, or another Hugging Face model).  
3. Identify and explain the required inputs and outputs for the model to generate the embeddings.  
   - What preprocessing steps were necessary for the input sentence?  
   - What format did the model return for the embeddings (e.g., vector size, structure)?  

Discuss your findings and approach with your instructor.  

In [2]:
#use model to get embedding of word bank
inputs = tokenizer("My money is in the bank.", return_tensors="pt")
#len = 8

In [3]:
print(inputs)

{'input_ids': tensor([[ 101, 2026, 2769, 2003, 1999, 1996, 2924, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [4]:
#convert input_ids for token format
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
#len = 10 por causa do [CLS] e [SEP]

['[CLS]', 'my', 'money', 'is', 'in', 'the', 'bank', '.', '[SEP]']


In [5]:
outputs = model(**inputs)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [6]:
#embedding dos tokens
outputs['last_hidden_state'].shape


torch.Size([1, 9, 768])

In [7]:
#embedding da sentença
outputs['pooler_output'].shape

torch.Size([1, 768])

In [8]:
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])[6])
financial_bank = outputs['last_hidden_state'][0][6]

bank


## Words in polysemy

**Exercise 3: Exploring Polysemy with Word Embeddings**  

1. Use the same model from Exercise 2 to generate embeddings for the word *"bank"* in the sentence *"The boat is stuck on the bank."*.  
2. Compare the embeddings of the word *"bank"* from Exercise 2 (*"My money is in the bank."*) with those obtained in this exercise.  
3. Measure the similarity between the two embeddings using cosine similarity.  
   - Are the embeddings identical or different?  
   - Discuss how the model handles polysemy (words with multiple meanings) based on the context provided by the sentences.  

Share your findings and insights with your instructor.

In [9]:
#use model to get embedding of word bank
inputs = tokenizer("The boat is stuck on the bank.", return_tensors="pt")
outputs = model(**inputs)

In [10]:
#convert input_ids for token format
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])[7])
river_bank = outputs['last_hidden_state'][0][7]


bank


In [11]:
financial_bank.shape

torch.Size([768])

In [12]:
river_bank.shape

torch.Size([768])

In [13]:
#convert torch size in numpy 1,768
financial_bank = financial_bank.detach().numpy()
river_bank = river_bank.detach().numpy()

#consine similarity from those two vectors
cosine_similarity([financial_bank], [river_bank])

array([[0.47348505]], dtype=float32)

**Exercise 4: Comparing Contextually Similar Words**  

1. Use the same model as in previous exercises to generate the embedding for the word *"safe"* in the sentence *"My money is in the safe."*.  
2. Compare the embedding of *"safe"* with the embedding of *"bank"* obtained from the sentence *"My money is in the bank."* (Exercise 2).  
3. Calculate the cosine similarity between the embeddings of *"bank"* and *"safe"*.  
   - Are the vectors similar?  
   - Discuss how the model interprets contextually similar but distinct words.  

Present your observations and discuss the implications with your instructor.

In [14]:
#use model to get embedding of word bank
inputs = tokenizer("My money is in the safe.", return_tensors="pt")
outputs = model(**inputs)

In [15]:
#convert input_ids for token format
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])[6])
safe_bank = outputs['last_hidden_state'][0][6]

safe


In [16]:
#convert torch size in numpy 1,768
safe_bank = safe_bank.detach().numpy()

#consine similarity from those two vectors
cosine_similarity([safe_bank], [financial_bank])

array([[0.6241646]], dtype=float32)

## Sentence embeddings

**Introduction to Sentence Embeddings**  

Sentence embeddings are numerical representations of entire sentences, capturing their semantic meaning and context. Unlike word embeddings, which focus on individual words, sentence embeddings aim to encode the overall message and relationships between words in a sentence. This makes them essential for tasks like text similarity, sentiment analysis, and machine translation.  

Models like Sentence-BERT, Universal Sentence Encoder (USE), and others generate sentence embeddings by processing text at the sentence level, often considering the relationships between words and phrases. These embeddings allow for efficient comparisons between sentences, enabling tasks such as clustering, semantic search, and paraphrase detection.  

In this tutorial, we will explore how to generate and use sentence embeddings, analyzing their properties and applications in real-world NLP tasks.

---

**Exercise 5: Generating Sentence Embeddings**  

1. Use the model's `pooler_output` to extract the sentence embedding for the sentence *"I love Portugal"*.  
2. Identify and explain the steps required to process the input and retrieve the embedding.  
3. Discuss the dimensions and structure of the resulting sentence embedding.  

In [17]:
#get sentence embedding use 'pooler_output'
inputs = tokenizer("I love Portugal", return_tensors="pt")
outputs = model(**inputs)

#get sentence embedding using CLS
sentence1 = outputs['last_hidden_state'][0][0]
#get pooler output


**Exercise 6: Comparing Sentence Embeddings**  

1. Generate the sentence embedding for *"I am in love with Portuguese lands"* using the same method as in Exercise 5, extracting the embedding from `pooler_output`.  
2. Compare this embedding with the one generated for *"I love Portugal"* in Exercise 5.  
3. Use cosine similarity to measure how similar the embeddings are.  
   - Are the embeddings close in the semantic space?  
   - Discuss how the model captures the similarity in meaning between the two sentences despite differences in wording.  

In [18]:
#get sentence embedding use 'pooler_output'
inputs = tokenizer("I am in love with Portuguese lands", return_tensors="pt")
outputs = model(**inputs)

sentence2 = outputs['last_hidden_state'][0][0]


In [19]:
#convert torch size in numpy 1,768
sentence1 = sentence1.detach().numpy()
sentence2 = sentence2.detach().numpy()

#consine similarity from those two vectors
cosine_similarity([sentence1], [sentence2])

array([[0.95352423]], dtype=float32)

In [20]:
#get sentence embedding use [CLS] token
inputs = tokenizer("I captivated by Portuguese lands", return_tensors="pt")
outputs = model(**inputs)

#get [CLS] token embedding
sentence3 = outputs['last_hidden_state'][0][0]


In [21]:
#convert torch size in numpy 1,768
sentence3 = sentence3.detach().numpy()

#consine similarity from those two vectors
cosine_similarity([sentence1], [sentence3])

array([[0.9331925]], dtype=float32)

**Exercise 7: Comparing Sentence Embeddings with a New Sentence**  

1. Generate the sentence embedding for *"Pillow fights should be banned."* using the same method as in Exercises 5 and 6, extracting the embedding from `pooler_output`.  
2. Compare this new embedding with the ones generated for *"I love Portugal"* and *"I am in love with Portuguese lands"*.  
3. Use cosine similarity to measure how similar the embeddings are to each other.  
   - How do the embeddings for *"Pillow fights should be banned"* compare to those of the previous sentences?  
   - Discuss how the model captures semantic differences between the sentences.  

Discuss your findings and insights with your instructor.

In [22]:
#get sentence embedding use [CLS] token
inputs = tokenizer("pillow fights should be banned.", return_tensors="pt")
outputs = model(**inputs)

#get [CLS] token embedding
sentence4 = outputs['last_hidden_state'][0][0]


In [23]:
#convert torch size in numpy 1,768
sentence4 = sentence4.detach().numpy()

#consine similarity from those two vectors
cosine_similarity([sentence1], [sentence4])

array([[0.8520674]], dtype=float32)

In [24]:
#create a matrix to compare all sentences
import numpy as np
embeddings = np.array([sentence1, sentence2, sentence3, sentence4])
similarity_matrix = cosine_similarity(embeddings)
#add labels
import pandas as pd
labels = ["I love Portugal", "I am in love with Portuguese lands", "I captivated by Portuguese lands", "pillow fights should be banned."]
df = pd.DataFrame(similarity_matrix, index=labels, columns=labels)
df

Unnamed: 0,I love Portugal,I am in love with Portuguese lands,I captivated by Portuguese lands,pillow fights should be banned.
I love Portugal,1.0,0.953524,0.933193,0.852067
I am in love with Portuguese lands,0.953524,1.0,0.959494,0.850306
I captivated by Portuguese lands,0.933193,0.959494,1.0,0.813432
pillow fights should be banned.,0.852067,0.850306,0.813432,1.0


**Exercise 8: Using SBERT for Sentence Similarity Comparison**  

1. The results from the previous exercise may not have been as effective because the `pooler_output` uses the [CLS] token vector, which might not be optimal for some tasks. However, there is a model specifically trained for Semantic Textual Similarity (STS).  
2. Access [https://sbert.net/](https://sbert.net/) and explore the available models.  
3. Use the model *'all-MiniLM-L6-v2'* from SBERT to generate sentence embeddings for the sentences:  
   - *"I love Portugal"*  
   - *"I am in love with Portuguese lands"*  
   - *"Pillow fights should be banned."*  
4. Compare the sentence embeddings again using cosine similarity.  
   - How do the results differ when using the SBERT model?  
   - Discuss the improvement in similarity measurement compared to using the [CLS] token vector from the previous model.  

In [25]:
#did the same test using sbert
#https://sbert.net/
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#convert 3 sentences in embeddings using sbert
sentence1_embedding = model.encode('I love Portugal')
sentence2_embedding = model.encode('I am in love with Portuguese lands')
sentence3_embedding = model.encode('I captivated by Portuguese lands')
sentence4_embedding = model.encode('pillow fights should be banned.')

#compare all embeddings using consine
# print("1 e 2", cosine_similarity([sentence1_embedding], [sentence2_embedding]))
# print("1 e 3", cosine_similarity([sentence1_embedding], [sentence3_embedding]))
# print("1 e 4", cosine_similarity([sentence1_embedding], [sentence4_embedding]))

#create a matrix to compare all sentences
embeddings = np.array([sentence1_embedding, sentence2_embedding, sentence3_embedding, sentence4_embedding])
similarity_matrix = cosine_similarity(embeddings)
#add labels
labels = ["I love Portugal", "I am in love with Portuguese lands", "I captivated by Portuguese lands", "pillow fights should be banned."]
df = pd.DataFrame(similarity_matrix, index=labels, columns=labels)
df

Unnamed: 0,I love Portugal,I am in love with Portuguese lands,I captivated by Portuguese lands,pillow fights should be banned.
I love Portugal,1.0,0.679476,0.548541,0.073248
I am in love with Portuguese lands,0.679476,1.0,0.778218,-0.021231
I captivated by Portuguese lands,0.548541,0.778218,1.0,-0.016899
pillow fights should be banned.,0.073248,-0.021231,-0.016899,1.0


### Token `[CLS]` in the Original BERT

**In the Original BERT Model**:
- The `[CLS]` token is used as a special marker added at the beginning of a sequence.
- After passing through the network, the embedding associated with `[CLS]` is designed to capture the global information of the sentence. It is frequently used for tasks like sentence classification (e.g., sentiment analysis).
- However, the `[CLS]` vector in BERT is not optimized to directly compare sentences or measure semantic similarity. It was fine-tuned on general pre-training tasks, such as masked language modeling and next sentence prediction.

**SBERT Goes Beyond `[CLS]`**:
- SBERT performs fine-tuning specifically to capture the semantics of entire sentences.
- It uses pooling to combine information from all the words in the sentence, rather than relying solely on the `[CLS]` embedding.
- SBERT creates embeddings better suited for calculations like cosine similarity between sentences, whereas `[CLS]` vectors from BERT are less consistent for this type of comparison.



How about Document embeddings?

# Translation BERT and Text-to-text

BERT (Bidirectional Encoder Representations from Transformers) has significantly advanced the field of Natural Language Processing (NLP), and its capabilities extend to machine translation. While BERT is not specifically designed for translation tasks, its deep understanding of context and bidirectional nature allows it to contribute to translation models. By capturing rich semantic and syntactic information from both directions of a sentence, BERT helps improve the quality of translations by offering more accurate context for word choices, phrase structures, and sentence meanings.
In translation tasks, BERT can be used in conjunction with other models like MarianMT or mBART, which are trained specifically for translation, to enhance context understanding and to create more fluent and precise translations. It can also help in tasks such as zero-shot translation, where the model is capable of translating between language pairs it has not explicitly been trained on, thanks to its strong contextual embeddings.
In this part of the tutorial, we will explore how BERT can assist in translation tasks and how to integrate it with other translation models for better performance.

---

**Exercise 9: Using T5 for Machine Translation**  

1. Load the model from Hugging Face: [https://huggingface.co/google-t5/t5-base](https://huggingface.co/google-t5/t5-base).  
2. Explore and understand how the T5 model works, particularly in the context of translation tasks.  
3. Translate the following text from English to German using the T5 model:  
   *"The house is wonderful."*  
4. Discuss the model's input and output format, and explain how T5 handles translation tasks.

In [None]:
#https://huggingface.co/google-t5/t5-base
#Paradigma Text-to-Text
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

#translate a example of sense
inputs = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

**Exercise 10: Testing Text-to-Text Paradigm for Translation**  

1. Using the T5 model from Exercise 9, apply the Text-to-Text paradigm to translate the same sentence *"The house is wonderful."* into other languages of your choice (e.g., Spanish, French, Italian).  
2. For each translation, ensure the correct format is used, where the model receives the task as a text input (e.g., "translate English to Spanish: The quick brown fox jumps over the lazy dog.") and generates the output in the target language.  
3. Discuss how the Text-to-Text paradigm allows the model to handle multiple translation tasks seamlessly by simply changing the input task description.  

Share your results and observations with your instructor.

In [None]:
#use other mother to translate a sentence from english to french
inputs = tokenizer("translate English to French: The house is wonderful.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

["Le roi est d'une grande ville."]


In [None]:
#use other mother to translate a sentence from english to french
inputs = tokenizer("translate Portuguese to English: A casa é linda.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['The house is beautiful.']


Paradigma Text-to-Text:

- **Translation**: Input: `translate English to French: How are you?` -> Output: `Comment ça va?`
- **Summarization**: Input: `summarize: The article discusses...` -> Output: `Key points are...`
- **Question and Answering**: Input: `question: Who wrote Hamlet? context: Hamlet was written by Shakespeare.` -> Output: `Shakespeare`


In [None]:
#use other mother to translate a sentence from english to french
inputs = tokenizer("question: Who wrote Hamlet? context: Hamlet was written by Shakespeare.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Shakespeare']


# Sentiment Analysis

Sentiment analysis involves determining the emotional tone or opinion expressed in a piece of text, such as whether a product review is positive, negative, or neutral. BERT’s bidirectional nature enables it to understand context in a sentence more effectively than traditional models, making it especially well-suited for tasks like sentiment analysis.
BERT captures nuanced meaning by considering both the words before and after a given token, allowing it to understand subtle sentiment shifts and contextual cues that simpler models might miss. For example, it can distinguish between a positive sentence like "I love this movie!" and a negative one like "I hate this movie!" by grasping the underlying sentiment in the surrounding words.
By fine-tuning BERT on a sentiment-labeled dataset, it can be trained to classify texts into various sentiment categories, providing highly accurate and context-aware results. In this tutorial, we will explore how BERT can be fine-tuned for sentiment analysis tasks, helping to build models that can assess customer reviews, social media posts, and more.

---

**Exercise 11: Sentiment Analysis with BERT**  

1. Import the model *"nlptown/bert-base-multilingual-uncased-sentiment"* from Hugging Face.  
2. Use the model to perform sentiment analysis on the following sentences:  
   - *"I love this product!"*  
   - *"I hate this product!"*  
3. Analyze the model's output for each sentence.  
   - How does the model interpret the sentiment of these sentences?  
   - What is the format of the output, and how can you interpret the sentiment scores or labels?

In [None]:
#Using sentiment analysis using bert
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

#load sentiment analysis model
#https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment?text=I+like+you.
#5 labels
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

In [None]:
#test in a sentence
tokens = tokenizer.encode("I love this product!", return_tensors="pt")
result = model(tokens)

#turn result readeble
print(int(torch.argmax(result.logits))+1)

5


In [None]:
result.logits

tensor([[-2.4349, -2.7697, -1.1186,  1.4708,  3.9320]],
       grad_fn=<AddmmBackward0>)

In [None]:
#test in a sentence
tokens = tokenizer.encode("I hate this product!", return_tensors="pt")
result = model(tokens)

#turn result readeble
print(int(torch.argmax(result.logits))+1)

1


**Exercise 12: Sentiment Analysis with a Different Model**  

1. Import the model *"cardiffnlp/twitter-roberta-base-sentiment"* from Hugging Face.  
2. Perform sentiment analysis on the same sentences:  
   - *"I love this product!"*  
   - *"I hate this product!"*  
3. Compare the results from this model with those obtained in Exercise 11.  
   - How does the sentiment analysis output differ between the two models?  
   - What is the format of the output from the *"cardiffnlp/twitter-roberta-base-sentiment"* model, and how can you interpret it?

In [None]:
#test other sentiment analysis model
#https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment
#3 labels
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [None]:
# test in a sentence
tokens = tokenizer.encode("I hate this product!", return_tensors="pt")
result = model(tokens)

#turn result readeble
print(int(torch.argmax(result.logits))+1)

1


# Question Answering

QA systems involve providing precise answers to questions based on a given context, often in the form of a passage of text.
BERT's bidirectional approach allows it to understand the full context of a sentence or passage, making it highly effective for identifying and extracting relevant information. Unlike traditional models, which process text in a left-to-right or right-to-left fashion, BERT processes the entire context simultaneously, understanding the relationships between all words. This ability to capture subtle contextual nuances is crucial in QA tasks, where the answer is often hidden within complex or ambiguous phrasing.
In QA applications, BERT is typically fine-tuned on datasets such as the SQuAD (Stanford Question Answering Dataset), where it learns to pinpoint the correct span of text that answers a given question. BERT's performance on QA tasks has set new benchmarks, achieving state-of-the-art results by accurately identifying the answer's location in the context.
In this part of the tutorial, we will explore how BERT can be used for Question Answering tasks, from fine-tuning models to extracting answers from a given text, demonstrating its power in real-world applications.

---

**Exercise 13: Building a Question Answering Function with BERT**  

1. Create a function that accepts a question and a context as inputs.  
2. Use the model *"bert-large-uncased-whole-word-masking-finetuned-squad"* from Hugging Face to process these inputs.  
3. The function should output the answer to the question based on the context provided.  
4. Ensure the function uses the proper tokenization and model inference steps to generate the answer.  

For example:  
- Input:  
  - **Question**: *"What is the capital of France?"*  
  - **Context**: *"France is a country in Europe. Its capital is Paris, known for its culture and history."*  
- Output: *"Paris"*

In [None]:
# Exemplo context
context = (
        """Portugal, officially the Portuguese Republic, is a country in the Iberian Peninsula in Southwestern Europe.
    Featuring the westernmost point in continental Europe, to its north and east is Spain,
    with which it shares the longest uninterrupted border in the European Union;
    to the south and the west is the North Atlantic Ocean; and to the west and southwest lie the Macaronesian
    archipelagos of the Azores and Madeira, which are two autonomous regions of Portugal.
    Lisbon is the capital and largest city, followed by Porto, which is the only other metropolitan area.

    The western part of the Iberian Peninsula has been continuously inhabited since prehistoric times,
    with the earliest signs of settlement dating to 5500 BCE.[14] Celtic and Iberian peoples arrived in the first millennium BCE,
    with Phoenician and later Punic influence reaching the south during the same period.
    The region came under Roman control in the second century BCE, followed by a succession of Germanic peoples
    and the Alans from the fifth to eighth centuries CE. Muslims conquered most of the Iberian Peninsula in the eighth
    century CE, but were gradually expelled by the Christian Reconquista over the next several centuries.
    Modern Portugal began taking shape during this period, initially as a county of the Christian Kingdom of León in 868,
    and ultimately as an independent Kingdom with the Treaty of Zamora in 1143.[15]"""
)

In [None]:
from transformers import BertForQuestionAnswering, BertTokenizer
import torch

def answer_question(question, context):
    """
    Função para responder uma pergunta baseada no contexto usando BERT.

    Args:
    question (str): A pergunta a ser respondida.
    context (str): O texto que contém a resposta.

    Returns:
    str: A resposta extraída do contexto.
    """
    # Carregar modelo e tokenizer pré-treinados
    model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForQuestionAnswering.from_pretrained(model_name)

    # Tokenizar entrada
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    #convert input_ids for token format
    print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))

    # Obter pontuações de início e fim das respostas
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Encontrar os índices com as pontuações mais altas
    answer_start = torch.argmax(answer_start_scores)
    print(">>",answer_start)
    answer_end = torch.argmax(answer_end_scores) + 1
    print(">>",answer_end)

    # Converter os tokens de volta para texto
    answer_tokens = tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    return answer

**Exercise 14: Testing the Question Answering Function**  

1. Test the function you created in Exercise 13 by providing different questions and contexts.  
2. Try various types of questions, such as factual queries, location-based questions, or questions involving more complex contexts.

In [None]:
question = "What is the capital of Portugal?"
answer = answer_question(question, context)
print()
print(f"Pergunta: {question}")
print(f"Resposta: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'what', 'is', 'the', 'capital', 'of', 'portugal', '?', '[SEP]', 'portugal', ',', 'officially', 'the', 'portuguese', 'republic', ',', 'is', 'a', 'country', 'in', 'the', 'iberian', 'peninsula', 'in', 'southwestern', 'europe', '.', 'featuring', 'the', 'western', '##most', 'point', 'in', 'continental', 'europe', ',', 'to', 'its', 'north', 'and', 'east', 'is', 'spain', ',', 'with', 'which', 'it', 'shares', 'the', 'longest', 'un', '##int', '##er', '##rup', '##ted', 'border', 'in', 'the', 'european', 'union', ';', 'to', 'the', 'south', 'and', 'the', 'west', 'is', 'the', 'north', 'atlantic', 'ocean', ';', 'and', 'to', 'the', 'west', 'and', 'southwest', 'lie', 'the', 'mac', '##aro', '##nesian', 'archipelago', '##s', 'of', 'the', 'azores', 'and', 'madeira', ',', 'which', 'are', 'two', 'autonomous', 'regions', 'of', 'portugal', '.', 'lisbon', 'is', 'the', 'capital', 'and', 'largest', 'city', ',', 'followed', 'by', 'porto', ',', 'which', 'is', 'the', 'only', 'other', 'metropolitan', 'are

In [None]:
question = "Where in Europe is Portugal?"
answer = answer_question(question, context)
print()
print(f"Pergunta: {question}")
print(f"Resposta: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'where', 'in', 'europe', 'is', 'portugal', '?', '[SEP]', 'portugal', ',', 'officially', 'the', 'portuguese', 'republic', ',', 'is', 'a', 'country', 'in', 'the', 'iberian', 'peninsula', 'in', 'southwestern', 'europe', '.', 'featuring', 'the', 'western', '##most', 'point', 'in', 'continental', 'europe', ',', 'to', 'its', 'north', 'and', 'east', 'is', 'spain', ',', 'with', 'which', 'it', 'shares', 'the', 'longest', 'un', '##int', '##er', '##rup', '##ted', 'border', 'in', 'the', 'european', 'union', ';', 'to', 'the', 'south', 'and', 'the', 'west', 'is', 'the', 'north', 'atlantic', 'ocean', ';', 'and', 'to', 'the', 'west', 'and', 'southwest', 'lie', 'the', 'mac', '##aro', '##nesian', 'archipelago', '##s', 'of', 'the', 'azores', 'and', 'madeira', ',', 'which', 'are', 'two', 'autonomous', 'regions', 'of', 'portugal', '.', 'lisbon', 'is', 'the', 'capital', 'and', 'largest', 'city', ',', 'followed', 'by', 'porto', ',', 'which', 'is', 'the', 'only', 'other', 'metropolitan', 'area', '.'

In [None]:
question = "Where is Portugal?"
answer = answer_question(question, context)
print()
print(f"Pergunta: {question}")
print(f"Resposta: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'where', 'is', 'portugal', '?', '[SEP]', 'portugal', ',', 'officially', 'the', 'portuguese', 'republic', ',', 'is', 'a', 'country', 'in', 'the', 'iberian', 'peninsula', 'in', 'southwestern', 'europe', '.', 'featuring', 'the', 'western', '##most', 'point', 'in', 'continental', 'europe', ',', 'to', 'its', 'north', 'and', 'east', 'is', 'spain', ',', 'with', 'which', 'it', 'shares', 'the', 'longest', 'un', '##int', '##er', '##rup', '##ted', 'border', 'in', 'the', 'european', 'union', ';', 'to', 'the', 'south', 'and', 'the', 'west', 'is', 'the', 'north', 'atlantic', 'ocean', ';', 'and', 'to', 'the', 'west', 'and', 'southwest', 'lie', 'the', 'mac', '##aro', '##nesian', 'archipelago', '##s', 'of', 'the', 'azores', 'and', 'madeira', ',', 'which', 'are', 'two', 'autonomous', 'regions', 'of', 'portugal', '.', 'lisbon', 'is', 'the', 'capital', 'and', 'largest', 'city', ',', 'followed', 'by', 'porto', ',', 'which', 'is', 'the', 'only', 'other', 'metropolitan', 'area', '.', 'the', 'wester

In [None]:
question = "What are the names of the archipelagos in Portugal?"
answer = answer_question(question, context)
print()
print(f"Pergunta: {question}")
print(f"Resposta: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'what', 'are', 'the', 'names', 'of', 'the', 'archipelago', '##s', 'in', 'portugal', '?', '[SEP]', 'portugal', ',', 'officially', 'the', 'portuguese', 'republic', ',', 'is', 'a', 'country', 'in', 'the', 'iberian', 'peninsula', 'in', 'southwestern', 'europe', '.', 'featuring', 'the', 'western', '##most', 'point', 'in', 'continental', 'europe', ',', 'to', 'its', 'north', 'and', 'east', 'is', 'spain', ',', 'with', 'which', 'it', 'shares', 'the', 'longest', 'un', '##int', '##er', '##rup', '##ted', 'border', 'in', 'the', 'european', 'union', ';', 'to', 'the', 'south', 'and', 'the', 'west', 'is', 'the', 'north', 'atlantic', 'ocean', ';', 'and', 'to', 'the', 'west', 'and', 'southwest', 'lie', 'the', 'mac', '##aro', '##nesian', 'archipelago', '##s', 'of', 'the', 'azores', 'and', 'madeira', ',', 'which', 'are', 'two', 'autonomous', 'regions', 'of', 'portugal', '.', 'lisbon', 'is', 'the', 'capital', 'and', 'largest', 'city', ',', 'followed', 'by', 'porto', ',', 'which', 'is', 'the', 'on

In [None]:
question = "What is the capital of Spain?"
answer = answer_question(question, context)
print()
print(f"Pergunta: {question}")
print(f"Resposta: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'what', 'is', 'the', 'capital', 'of', 'spain', '?', '[SEP]', 'portugal', ',', 'officially', 'the', 'portuguese', 'republic', ',', 'is', 'a', 'country', 'in', 'the', 'iberian', 'peninsula', 'in', 'southwestern', 'europe', '.', 'featuring', 'the', 'western', '##most', 'point', 'in', 'continental', 'europe', ',', 'to', 'its', 'north', 'and', 'east', 'is', 'spain', ',', 'with', 'which', 'it', 'shares', 'the', 'longest', 'un', '##int', '##er', '##rup', '##ted', 'border', 'in', 'the', 'european', 'union', ';', 'to', 'the', 'south', 'and', 'the', 'west', 'is', 'the', 'north', 'atlantic', 'ocean', ';', 'and', 'to', 'the', 'west', 'and', 'southwest', 'lie', 'the', 'mac', '##aro', '##nesian', 'archipelago', '##s', 'of', 'the', 'azores', 'and', 'madeira', ',', 'which', 'are', 'two', 'autonomous', 'regions', 'of', 'portugal', '.', 'lisbon', 'is', 'the', 'capital', 'and', 'largest', 'city', ',', 'followed', 'by', 'porto', ',', 'which', 'is', 'the', 'only', 'other', 'metropolitan', 'area',