In [1]:
!pip install transformers[torch]
!pip install datasets
!pip install accelerate -U
!pip install -q transformers[torch] datasets



In [2]:
from transformers import BertForQuestionAnswering, AutoTokenizer, pipeline
from transformers import AutoModelForQuestionAnswering

# CONTEXT
data = "Neymar made his professional debut with Santos in 2009, and in 2011, he helped them win their first Copa Libertadores in nearly 50 years.[5] In 2013, he joined Barcelona and became part of an attacking trio with Lionel Messi and Luis Suárez, dubbed MSN. Winning the continental treble of La Liga, the Copa del Rey, and the UEFA Champions League in the trio's first season, Neymar was the joint-top scorer of the Champions League campaign and top scorer in the Copa del Rey. Neymar joined Paris Saint-Germain (PSG) in 2017 in a transfer costing €222 million, making him the most expensive player ever.[note 1][8] There, he won Ligue 1 Player of the Year, won five Ligue 1 titles, and was integral to PSG being runners-up in the 2019-20 Champions League. He also ranks as PSG's fourth-highest all-time top goalscorer, despite reoccurring injuries consistently disrupting his playing time. In 2023, he became the most expensive signing in Saudi Pro League history, costing €90 million, as he signed for Al Hilal. Debuting for Brazil aged 18, Neymar is the nation's all-time top goalscorer, with 79 goals in 128 matches. He won the 2013 FIFA Confederations Cup, winning the Golden Ball. In the 2014 FIFA World Cup, he was named in the Dream Team. He captained Brazil to their first Olympic gold medal in men's football at the 2016 Summer Olympics, having already achieving a silver medal at the 2012 edition. Helping Brazil to a runner-up finish at the 2021 Copa América, he was jointly awarded Best Player. In the 2022 World Cup, he became the third Brazilian player to score in three World Cups, after Pelé and Ronaldo. Neymar has won a record six Samba Gold awards. Neymar has been named in the FIFA FIFPro World11 and the UEFA Team of the Year twice and the UEFA Champions League Squad of the Season three times. He finished third for the FIFA Ballon d'Or in 2015 and 2017 and won the FIFA Puskás Award in 2011. SportsPro named Neymar the world's most marketable athlete in 2012 and 2013, and ESPN cited him as the world's fourth-most-famous athlete in 2016. In 2017, Time included him in its annual list of the 100 most influential people in the world.[9] France Football ranked Neymar the world's third-highest-paid footballer of 2018. Forbes ranked him the world's third-highest-paid athlete of 2019,[10] dropping to fourth in 2020.[11]"

# Define your questions
questions = [
    "Did Neymar play for Barcelona?",
    "Did Neymar play for Real Madrid?",
    "How many Champions Leagues has Neymar won?",
    "Where is Neymar from?",
    "Which players did Neymar play with?",
    "Has Neymar won a world cup?",
    "Is Neymar the best player in the world?"
]

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')

# Initialize the question answering pipeline
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)

# Iterate over each question and provide answers from the file context
for question in questions:
    answer = nlp({
        'question': question,
        'context': data
    })
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}\n")
    print(f"Score: {answer['score']:.2f}")  # Confidence score of the answer
    print(f"Context: {data[max(0, answer['start'] - 30):answer['end'] + 30]}")  # Display context around the answer
    print("\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Question: Did Neymar play for Barcelona?
Answer: three

Score: 0.00
Context: ns League Squad of the Season three times. He finished third for 


Question: Did Neymar play for Real Madrid?
Answer: three

Score: 0.00
Context: ns League Squad of the Season three times. He finished third for 


Question: How many Champions Leagues has Neymar won?
Answer: three

Score: 0.00
Context: ns League Squad of the Season three times. He finished third for 


Question: Where is Neymar from?
Answer: , and ESPN

Score: 0.00
Context: table athlete in 2012 and 2013, and ESPN cited him as the world's four


Question: Which players did Neymar play with?
Answer: three

Score: 0.00
Context: ns League Squad of the Season three times. He finished third for 


Question: Has Neymar won a world cup?
Answer: three

Score: 0.00
Context: ns League Squad of the Season three times. He finished third for 


Question: Is Neymar the best player in the world?
Answer: , and ESPN

Score: 0.00
Context: table athlete in 2012 

This model without fine-tuning is not prepared for question and answer use! Each answer provided is essentially garbage as shown by the values.

In [3]:
# Import libraries
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, BertConfig
from datasets import load_dataset, DatasetDict
from transformers import DefaultDataCollator

# Load the SQuAD dataset
dataset = load_dataset("squad")

# Load tokenizer and model with customized dropout configuration
config = BertConfig.from_pretrained("bert-base-uncased", hidden_dropout_prob=0.3, attention_probs_dropout_prob=0.3)
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased", config=config)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Function to prepare train features
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",  # truncate context, not the question
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Remove unnecessary columns and prepare for start/end position labels
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

# Apply the function to the data
tokenized_datasets = dataset.map(prepare_train_features, batched=True, remove_columns=dataset["train"].column_names)

# Set training arguments
args = TrainingArguments(
    output_dir="./finetune-BERT-squad",
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=15,
    per_device_eval_batch_size=15,
    num_train_epochs=15,
    weight_decay=0.05,
    logging_dir='./logs',
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False
)

# Using default data collator from transformers
data_collator = DefaultDataCollator()

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"].select(range(7000)),
    eval_dataset=tokenized_datasets["validation"].select(range(700)),
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Start training
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained("./fine_tuned_bert_qa")
tokenizer.save_pretrained("./fine_tuned_bert_qa")


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,No log,2.686252
2,3.933500,2.062819
3,2.648500,1.73578
4,2.082400,1.63749
5,1.753400,1.606177
6,1.595300,1.536891
7,1.449700,1.53742
8,1.391700,1.513141
9,1.301100,1.482793
10,1.223700,1.496154


Checkpoint destination directory ./finetune-BERT-squad/checkpoint-467 already exists and is non-empty. Saving will proceed but saved results may be invalid.


('./fine_tuned_bert_qa/tokenizer_config.json',
 './fine_tuned_bert_qa/special_tokens_map.json',
 './fine_tuned_bert_qa/vocab.txt',
 './fine_tuned_bert_qa/added_tokens.json',
 './fine_tuned_bert_qa/tokenizer.json')

In [4]:
from transformers import BertForQuestionAnswering, AutoTokenizer
from transformers import pipeline

# Small data sample to test faster
data = "Neymar made his professional debut with Santos in 2009, and in 2011, he helped them win their first Copa Libertadores in nearly 50 years.[5] In 2013, he joined Barcelona and became part of an attacking trio with Lionel Messi and Luis Suárez, dubbed MSN. Winning the continental treble of La Liga, the Copa del Rey, and the UEFA Champions League in the trio's first season, Neymar was the joint-top scorer of the Champions League campaign and top scorer in the Copa del Rey. Neymar joined Paris Saint-Germain (PSG) in 2017 in a transfer costing €222 million, making him the most expensive player ever.[note 1][8] There, he won Ligue 1 Player of the Year, won five Ligue 1 titles, and was integral to PSG being runners-up in the 2019-20 Champions League. He also ranks as PSG's fourth-highest all-time top goalscorer, despite reoccurring injuries consistently disrupting his playing time. In 2023, he became the most expensive signing in Saudi Pro League history, costing €90 million, as he signed for Al Hilal. Debuting for Brazil aged 18, Neymar is the nation's all-time top goalscorer, with 79 goals in 128 matches. He won the 2013 FIFA Confederations Cup, winning the Golden Ball. In the 2014 FIFA World Cup, he was named in the Dream Team. He captained Brazil to their first Olympic gold medal in men's football at the 2016 Summer Olympics, having already achieving a silver medal at the 2012 edition. Helping Brazil to a runner-up finish at the 2021 Copa América, he was jointly awarded Best Player. In the 2022 World Cup, he became the third Brazilian player to score in three World Cups, after Pelé and Ronaldo. Neymar has won a record six Samba Gold awards. Neymar has been named in the FIFA FIFPro World11 and the UEFA Team of the Year twice and the UEFA Champions League Squad of the Season three times. He finished third for the FIFA Ballon d'Or in 2015 and 2017 and won the FIFA Puskás Award in 2011. SportsPro named Neymar the world's most marketable athlete in 2012 and 2013, and ESPN cited him as the world's fourth-most-famous athlete in 2016. In 2017, Time included him in its annual list of the 100 most influential people in the world.[9] France Football ranked Neymar the world's third-highest-paid footballer of 2018. Forbes ranked him the world's third-highest-paid athlete of 2019,[10] dropping to fourth in 2020.[11]"

questions = [
    "Did Neymar play for Barcelona?",
    "Did Neymar play for Real Madrid?",
    "How many Champions Leagues has Neymar won?",
    "Where is Neymar from?",
    "Which players did Neymar play with?",
    "Has Neymar won a world cup?",
    "Is Neymar the best player in the world?"
]

model_path = '/content/fine_tuned_bert_qa'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)
nlp = pipeline("question-answering", model=model, tokenizer=tokenizer)

# Iterate over each question and provide answers from the file context
for question in questions:
    answer = nlp({
        'question': question,
        'context': data
    })
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}\n")
    print(f"Score: {answer['score']:.2f}")  # Confidence score of the answer
    print(f"Context: {data[max(0, answer['start'] - 30):answer['end'] + 30]}")  # Display context around the answer
    print("\n")

Question: Did Neymar play for Barcelona?
Answer: 2017

Score: 0.02
Context:  Paris Saint-Germain (PSG) in 2017 in a transfer costing €222 mi


Question: Did Neymar play for Real Madrid?
Answer: Paris Saint-Germain (PSG) in 2017

Score: 0.05
Context: e Copa del Rey. Neymar joined Paris Saint-Germain (PSG) in 2017 in a transfer costing €222 mi


Question: How many Champions Leagues has Neymar won?
Answer: five

Score: 0.60
Context: gue 1 Player of the Year, won five Ligue 1 titles, and was integ


Question: Where is Neymar from?
Answer: Paris Saint-Germain (PSG) in 2017

Score: 0.03
Context: e Copa del Rey. Neymar joined Paris Saint-Germain (PSG) in 2017 in a transfer costing €222 mi


Question: Which players did Neymar play with?
Answer: Pelé and Ronaldo

Score: 0.73
Context: re in three World Cups, after Pelé and Ronaldo. Neymar has won a record six 


Question: Has Neymar won a world cup?
Answer: six Samba Gold awards

Score: 0.05
Context: aldo. Neymar has won a record six Samba Gold 

REPORT:

Before getting in to the model I used in this notebook I would like to touch another attempt I made for the sake of exploring my learning process for this assignment. I used the 'deepset/bert-base-cased-squad2' tokenizer and model with the exact same context/questions/answers as I did in this notebook, and was able to generate responses with a 16% success rate. One out of 6 questions was perfectly correct. I attempted to fine tune this model and was unable to improve it, every attempy only made it worse. I beleive this was largely because I was using a custom dataset of my own which was severly limited, as I only had created 40 examples of question and answer items like this:

"question": "Did Neymar play for Barcelona?",
        "context": "Neymar made his professional debut with Santos in 2009, and in 2011, he helped them win their first Copa Libertadores title since 1963. Neymar transferred to Barcelona in 2013, and won two La Liga titles, three Copa del Rey, and the UEFA Champions League in 2015.",
        "answers": {
            "text": ["Neymar transferred to Barcelona in 2013"],
            "answer_start": [98]


I decided to ask ChatGPT4 to help me create more items for the training, but was still generating poor results. After this experience I decided to explore more models and came across the 'Squad' model used

more information here: https://huggingface.co/csarron/bert-base-uncased-squad-v  

which garnered a poor response, complete garbage. This is because this model is not prepared to process question and answering without fine tuning. After fine tuning I have seen an improvement in responses being generated being closer to what the question is asking in terms of format, but still not perfect. If the question is asking for a number, a number is provided, and if a location, a location is provided. The previous project is included here, below, to demonstrate what was created before using a different model with the given data.

I do beleive the model could be improved more as I had adjusted the parameters of the model to comabt overfitting over an entire day of trial and error. I was encouraged to purchase more GPU power to speed up this process, which allowed me to increase the number of epochs! Unfortunately I was running out of disk space at this level and decided that I would stop here. Again, overfitting had become a problem but with increasing the amount of epochs and increasing the weight-decay and dropout-rates, among other small tweaks, yielded much better results.