## Installation Reqs (Linux Ubuntu)
0. <code> conda create -n env_pytorch python=3.12 conda activate env_pytorch <code>
1. <code> python3 -m pip install playwright
playwright install </code> 
2. <code> pip install datasets <code>
3. <code> pip install transformers <code>
4. <code> wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4 <code>
5. <code> pip install pytorch <code>
5. <code> pip install evaluate <code>


I found that initially creating a virtual environment in conda with python, then installing the dependencies produced less confusion with the path to pytorch.

## Webscraping 

This webscraper employs the use of PlayWright, my implementation heavily relies on the CSS selector pattern I see on Google Developer Inspect mode for the website. That, however, did produce some errors when there were unique edge cases in which the Locator object didn't seem to have methods to deal with css patterns of that form. For instance, the table element, and each sub element for nutrient facts proved very odd to process with the Locator object. Otherwise, for simple patterns found for each food page this web scraper did it's job, there was more information to be extracted. Again, the nutrient facts. For the sake of time, I skipped that information. But with more time the code can be extend to include all information on the page.

In [1]:
## WEB SCRAPING CELL: 
from playwright.async_api import async_playwright
import asyncio

async def process_locator(locator):
    count = await locator.count()
    if count > 1: #We have many elements and must resolve each inner text
        texts = ""
        for i in range(count):
            element = locator.nth(i)
            if await element.is_visible():
                inner_text = await element.inner_text()
                texts = texts + "," + inner_text
                return texts
    else:
        if await locator.is_visible():
            return await locator.inner_text()
        else:
            return "NA"
    

async def main():
   async with async_playwright() as pw:
       browser = await pw.chromium.launch(
           ##We'll employ the use of chromium for this webscraper
           ##Using a proxy creates HTTP errors.
          headless=False
      )

       #Beginning page: 
       page = await browser.new_page()
       await page.goto('https://world.openfoodfacts.org/')
       await page.wait_for_timeout(5000)
       result = []
       food_urls = []
       food_list = await page.query_selector_all('.list_product_a')
       for food in food_list:
           food_urls.append(await food.get_attribute('href'))
           
       for food_url in food_urls:
            food_info = {}
            await page.goto(food_url)
            #Title: 
            title = page.locator(".title-1")
            food_info['title'] = await process_locator(title)
            #Common Name:
            common_name = page.locator("#field_generic_name_value")
            food_info['common_name'] = await process_locator(common_name)
            #Quantity:
            quantity = page.locator("#field_quantity_value")
            food_info['quantity'] = await process_locator(quantity)
            #Packaging: 
            packaging = page.locator("#field_packaging_value")
            food_info['packaging'] = await process_locator(packaging)
            #Brands:
            brand = page.locator("#field_brands_value")
            food_info['brand'] = await process_locator(brand)
            #Categories:
            categories = page.locator("#field_categories_value")
            food_info['categories'] = await process_locator(categories)
            #Certifications:
            certifications = page.locator("#field_labels_value")
            food_info['certifications'] = await process_locator(certifications)
            #Origin:
            origin = page.locator("#field_origin_value")
            food_info['origin'] = await process_locator(origin)
            #origin of ingredients:
            origin_of_ingredients = page.locator("#field_origins_value")
            food_info['origin_of_ingredients'] = await process_locator(origin_of_ingredients)
            #Places of manufacturing:
            places_manufactured = page.locator("#field_manufacturing_places_value")
            food_info['places_manufactured'] = await process_locator(places_manufactured)
            #Stores:
            stores = page.locator("#field_stores_value")
            food_info['stores'] = await process_locator(stores)
            #Countries where Sold:
            countries_sold = page.locator("#field_countries_value")
            food_info['countries_sold'] = await process_locator(countries_sold)
           
            #HEALTH SECTION
            #Notice, because of the increasing complexity of the DOM elements in this area the CSS selectors don't follow a similarly nice pattern
            #Ingredients: 
            ingredients = page.locator("#panel_ingredients_content .panel_text")
            food_info['ingredients'] = await process_locator(ingredients)
            #NOVA score:
            nova_score = page.locator("ul#panel_nova li.accordion-navigation h4")
            food_info['nova_score'] = await process_locator(nova_score)
            # #Palm Status:
            # palm_status = page.locator(".accordion-navigation active .content panel_content active .panel_text")
            # food_info['palm_status'] = await process_locator(palm_status)
            # #Vegan Status:
            # vegan_status = page.locator("#panel_ingredients_analysis_en-vegan_content .panel_text")
            # food_info['vegan_status'] = await process_locator(vegan_status)
            # #Vegetarian Status:
            # vegetarian_status = page.locator("#panel_ingredients_analysis_en-vegetarian_content .panel_text")
            # food_info['vegetarian_status'] = await process_locator(vegetarian_status)
            #Nutrition grade:
            nutrition_grade = page.locator(".accordion-navigation .grade_a_title")
            food_info['nutrition_grade'] = await process_locator(nutrition_grade)

            # #NUTITRION FACTS
            # #
            # table_rows = await page.query_selector_all("#panel_nutrition_facts_table_content")
            # nutrition_facts = {}
            # for row in table_rows:
            #     columns = await row.query_selector_all('td')
            #     name = await process_locator(columns[0])
            #     value_per_100g = await process_locator(columns[1])
            #     nutrition_facts[name] = {
            #         "100g/100ml": value_per_100g
            #     }
                    
                    
            # food_info['nutrition_table'] = nutrition_facts
            result.append(food_info)
            


       
       

       
           
           
           
       await browser.close()
       return result
if __name__ == '__main__':
   result = await main()

#Problems & Changes:
#

#CITATIONs: 
#Code cited from OxyLabs: https://github.com/oxylabs/playwright-web-scraping?tab=readme-ov-file
#,https://playwright.dev/python/docs/locators

In [2]:
result[10]

{'title': "Huile d'olive vierge extra - Domaine de Bournissac - 500\xa0ml",
 'common_name': "Huile d'olive",
 'quantity': '500 ml',
 'packaging': 'NA',
 'brand': 'Domaine de Bournissac',
 'categories': 'Plant-based foods and beverages, Plant-based foods, Fats, Vegetable fats, Olive tree products, Vegetable oils, Olive oils, Extra-virgin olive oils, Virgin olive oils',
 'certifications': 'Organic, EU Organic, FR-BIO-10\n',
 'origin': 'NA',
 'origin_of_ingredients': 'NA',
 'places_manufactured': 'NA',
 'stores': 'NA',
 'countries_sold': 'France',
 'ingredients': 'NA',
 'nova_score': 'Processed culinary ingredients',
 'nutrition_grade': 'NA'}

## Preprocessing data

Here we process every instance of food_data such that we create a context, a body of text that contains at some point the answer to our question. The format for the context is quite simple, we take some information from the food_data element and put it into a correct sentence. We then define a function to create a pair of questions and answers, for the sake of simplicity I define 4 questions and answers. This could be extended even further, but requires more preprocessing. We finally convert the list of dictionaries to a Dataset object from HuggingFace's datasets library. Finally the dataset is split into training and validation sets, at a 80/20 split, deterministically (for the purpose of debugging).

In [3]:
#PREPROCESSING CELL
from datasets import DatasetDict, Dataset, Features
##DATA FORMAT:
# Context: 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'
# Question: 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
# Answer: {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

#Defining a function to create the context, in the context part of a QA data format
def create_context(food_data):
    context = (
        f"{food_data['title']} is commonly known as {food_data['common_name']}. "
        f"The ingredients are {food_data['ingredients']}. "
        f"The packaging includes {food_data['packaging']}. "
        f"The brand is {food_data['brand']}. "
        f"It falls under the categories {food_data['categories']}. "
        f"It has certifications like {food_data['certifications']}. "
        f"It originates from {food_data['origin']} and the origin of ingredients is {food_data['origin_of_ingredients']}. "
        f"It is manufactured in {food_data['places_manufactured']}. "
        f"It is sold in countries like {food_data['countries_sold']}. "
        f"The nutrition grade is {food_data['nutrition_grade']}. "
        f"The NOVA score is {food_data['nova_score']}. "
        f"It can be found in stores such as {food_data['stores']}. "
    )
    return context



def create_question_answer(food_data, context):
    qa_pairs = []

    #Generate question about common name 
    question = "What is the common name of " + food_data['title'] + "?"
    answer = {'text': [food_data['common_name']], 'answer_start':[context.index(food_data['common_name'])]}
    qa_pairs.append({'question':question, 'answer':answer})
    #Generate question about ingredients
    question = "What are some ingredients in " + food_data['title'] + "?"
    answer = {'text':[food_data['ingredients']], 'answer_start':[context.index(food_data['ingredients'])]}
    qa_pairs.append({'question':question, 'answer':answer})
    #Generate question 
    question = "What is the packaging of " + food_data['title'] + "?"
    answer = {'text':[food_data['packaging']], 'answer_start':[context.index(food_data['packaging'])]}
    qa_pairs.append({'question':question, 'answer':answer})
    #Generate Question
    question = "What is the brand of " + food_data['title'] + "?"
    answer = {'text':[food_data['brand']], 'answer_start':[context.index(food_data['brand'])]}
    qa_pairs.append({'question':question, 'answer':answer})
    #Generate Question
    
    return qa_pairs

def create_qac_dataset(food_data_list):
    qac_dataset = []
    for food_data in food_data_list:
        context = create_context(food_data)
        qa_pairs = create_question_answer(food_data, context)
        for qa in qa_pairs:
            current_dict = {"context": context, "question": qa['question'], "answer": qa['answer']}
            qac_dataset.append(current_dict)
    return qac_dataset
        
   #As of the current function, it is a deterministic split, this is done for debugging purposes
def split_dataset(dataset, split_ratio=0.8):
    split_index = int(len(dataset) * split_ratio)
    
    training_data = dataset[:split_index]
    validation_data = dataset[split_index:]
    
    return {"train": training_data, "validation": validation_data}



#This is the format found on the Hugging Face tutorial, following this contruction for simplicity
def convert_to_dataset_dict(training_set, validation_set):
    features = Features({
        "id": "string",
        "title": "string",
        "context": "string",
        "question": "string",
        "answers": "string",
    })

    # Create Dataset objects
    train_dataset = Dataset.from_pandas(training_set)
    validation_dataset = Dataset.from_pandas(validation_set)

    # Create DatasetDict
    dataset_dict = DatasetDict({
        "train": train_dataset,
        "validation": validation_dataset
    })

    return dataset_dict



In [4]:
#Create our question, answer, context dictionary
qac_result = create_qac_dataset(result)

In [5]:
#The documentation on Hugging face suggests that this is the best format
idx = 0
for qa in qac_result:
    qa["id"] = str(idx + 1)  # Adding 1 to start id from 1
    qa["title"] = f"Title {idx + 1}"  # Assuming title follows a pattern, adjust as needed

    # Move id and title to the beginning of the dictionary
    qa.update({"id": qa["id"], "title": qa["title"]})
    idx += 1

In [6]:
qac_result[0]

{'context': "Eau de Source - Cristaline - 1,5\xa0L is commonly known as Spring water. The ingredients are water. The packaging includes Aluminium-can, HdpeFilm-packet, PpFilm-wrapper, Ldpe-film. The brand is Cristaline. It falls under the categories Beverages, Waters, Spring waters. It has certifications like Triman\n. It originates from Embouteillée à 24610 Saint-Martin de Gurson France and the origin of ingredients is France, fr:Saint-Martin de Gurson. It is manufactured in Saint-Martin de Gurson, France, 24610. It is sold in countries like Belgium, Côte d'Ivoire, France, Germany, Guadeloupe, Italy, Luxembourg, Mali, Martinique, New Caledonia, Switzerland, United Kingdom. The nutrition grade is Very good nutritional quality. The NOVA score is Unprocessed or minimally processed foods. It can be found in stores such as Carrefour, Leclerc, Auchan, Intermarché, Super U, E.Leclerc. ",
 'question': 'What is the common name of Eau de Source - Cristaline - 1,5\xa0L?',
 'answer': {'text': ['S

In [7]:
import pandas as pd
#Let us split the dataset into training, and validation
qac_dataset = split_dataset(qac_result, split_ratio=0.8)
training_list = qac_dataset["train"]
validation_list = qac_dataset["validation"]
#Convert to pandas Dataframe object so that we can use a default Dataset.from_pandas()
df_training = pd.DataFrame(training_list)
df_validation = pd.DataFrame(validation_list)
df_training

Unnamed: 0,context,question,answer,id,title
0,"Eau de Source - Cristaline - 1,5 L is commonly...",What is the common name of Eau de Source - Cri...,"{'text': ['Spring water'], 'answer_start': [56]}",1,Title 1
1,"Eau de Source - Cristaline - 1,5 L is commonly...",What are some ingredients in Eau de Source - C...,"{'text': ['water'], 'answer_start': [63]}",2,Title 2
2,"Eau de Source - Cristaline - 1,5 L is commonly...",What is the packaging of Eau de Source - Crist...,"{'text': ['Aluminium-can, HdpeFilm-packet, PpF...",3,Title 3
3,"Eau de Source - Cristaline - 1,5 L is commonly...",What is the brand of Eau de Source - Cristalin...,"{'text': ['Cristaline'], 'answer_start': [16]}",4,Title 4
4,Whole Wheat Chocolate Biscuits - Lu - 300 g is...,What is the common name of Whole Wheat Chocola...,"{'text': ['NA'], 'answer_start': [65]}",5,Title 5
...,...,...,...,...,...
315,Levure de bière - Gerblé - 150 g is commonly k...,What is the brand of Levure de bière - Gerblé ...,"{'text': ['Gerblé'], 'answer_start': [18]}",316,Title 316
316,Granola - LU - 200 g e is commonly known as Bi...,What is the common name of Granola - LU - 200 ...,{'text': ['Biscuits sablés nappés de chocolat ...,317,Title 317
317,Granola - LU - 200 g e is commonly known as Bi...,What are some ingredients in Granola - LU - 20...,"{'text': [',wheat flour 48%, milk chocolate 27...",318,Title 318
318,Granola - LU - 200 g e is commonly known as Bi...,What is the packaging of Granola - LU - 200 g e?,"{'text': ['fr:sachet plastique, fr:étui carton...",319,Title 319


In [8]:
dataset = convert_to_dataset_dict(df_training, df_validation)

In [9]:
#Based on the tutorial: https://huggingface.co/learn/nlp-course/en/chapter7/7 we'll use the bert-base-cased model
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

#Based on the tutorial we want to insert tokens create a sentence of this form: 
#[CLS] question [SEP] context [SEP]

## Begin Testing preproccess

Notice: Much of this code is to ensure that my scraped data follows the same format as given on the Hugging Face documentation

In [10]:
#Test tokenizer format
#See the format of splitting the context
context = dataset["train"][0]['context']
question = dataset["train"][0]['question']

inputs = tokenizer(question, 
                   context, 
                   max_length = 100, 
                   truncation="only_second", 
                   stride = 50, 
                   return_overflowing_tokens=True,)
for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] What is the common name of Eau de Source - Cristaline - 1, 5 L? [SEP] Eau de Source - Cristaline - 1, 5 L is commonly known as Spring water. The ingredients are water. The packaging includes Aluminium - can, HdpeFilm - packet, PpFilm - wrapper, Ldpe - film. The brand is Cristaline. It falls under the categories Beverages, [SEP]
[CLS] What is the common name of Eau de Source - Cristaline - 1, 5 L? [SEP] The packaging includes Aluminium - can, HdpeFilm - packet, PpFilm - wrapper, Ldpe - film. The brand is Cristaline. It falls under the categories Beverages, Waters, Spring waters. It has certifications like Triman. It originates from Embouteillée à 24610 Saint - [SEP]
[CLS] What is the common name of Eau de Source - Cristaline - 1, 5 L? [SEP], Ldpe - film. The brand is Cristaline. It falls under the categories Beverages, Waters, Spring waters. It has certifications like Triman. It originates from Embouteillée à 24610 Saint - Martin de Gurson France and the origin of ingredients is F

In [11]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [12]:
inputs = tokenizer(
    dataset["train"][2:6]["question"],
    dataset["train"][2:6]["context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 4 examples gave 45 features.
Here is where each comes from: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3].


In [13]:
answers = dataset["train"][2:6]["answer"]
start_positions = []
end_positions = []

for i, offset in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)

    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    # If the answer is not fully inside the context, label is (0, 0)
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        start_positions.append(0)
        end_positions.append(0)
    else:
        # Otherwise it's the start and end token positions
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

start_positions, end_positions

([51,
  24,
  0,
  0,
  0,
  0,
  0,
  0,
  27,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  39,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [79,
  52,
  0,
  0,
  0,
  0,
  0,
  0,
  30,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  40,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0])

In [14]:
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: Aluminium-can, HdpeFilm-packet, PpFilm-wrapper, Ldpe-film, labels give: Aluminium - can, HdpeFilm - packet, PpFilm - wrapper, Ldpe - film


In [15]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

Theoretical answer: Aluminium-can, HdpeFilm-packet, PpFilm-wrapper, Ldpe-film, decoded example: [CLS] What is the packaging of Eau de Source - Cristaline - 1, 5 L? [SEP] and the origin of ingredients is France, fr : Saint - Martin de Gurson. It is manufactured in Saint - Martin de Gurson, France, 24610. It is sold in countries like Belgium, Côte d'Ivoire, France, Germany, Guadeloupe, Italy, Luxembourg, Mali, Martinique, New Caledonia, Switzerland, United Kingdom. [SEP]


## End Testing Preprocessing

## Further Preprocessing (To format the same as Hugging Face documentation)

In [16]:
#Function cited from: https://huggingface.co/learn/nlp-course/en/chapter7/7
#Processing the training data
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answer"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [17]:
train_dataset = dataset["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)
len(dataset["train"]), len(train_dataset)

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

(320, 392)

In [18]:
#Function cited directly from:https://huggingface.co/learn/nlp-course/en/chapter7/7
#Processing the validation Data
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [19]:
validation_dataset = dataset["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)
len(dataset["validation"]), len(validation_dataset)

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

(80, 116)

## Training & Fine-Tuning the Model

We use the bert-base-cased from HuggingFace, a pretrained model on English language. It is intended to be fined tuned on tasks like QA. 

Notice: The majority of this code is referenced from: https://huggingface.co/learn/nlp-course/en/chapter7/7 tutorial. On exactly how to evaluate a QA system 

In [20]:
small_eval_set = dataset["validation"].select(range(40))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [22]:
# import sys
# sys.path.append('/home/sebastiancsabry/miniconda3/lib/python3.12/site-packages')

In [23]:
import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(
    device
)

with torch.no_grad():
    outputs = trained_model(**batch)

In [24]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

In [25]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

In [26]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

In [27]:
import evaluate

metric = evaluate.load("squad")

In [28]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answer"]} for ex in small_eval_set
]

In [29]:
print(predicted_answers[7])
print(theoretical_answers[7])

{'id': '328', 'prediction_text': 'Sucre'}
{'id': '328', 'answers': {'answer_start': [0], 'text': ['Sunny Via']}}


Notice here, with just a test on the predicted and theoretical answers we have a relatively low score

In [30]:
metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 40.0, 'f1': 51.537402104080776}

We will assess the performance of the model with the use of this function compute_metrics; using beginning and end logits of each answer, then use the metric.compute() function to asses the score = (start_logit + end_logit). Metric.compute() uses the "Squad" metric, which is used for evaluating Question Answering models.

In [31]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answer"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers) #Squad evaluation here

In [32]:
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

  0%|          | 0/40 [00:00<?, ?it/s]

{'exact_match': 40.0, 'f1': 51.537402104080776}

Again, above we see that without fine-tuning we have a relatively low score of 40.0! We'll see if some fine-tuning mediates this score.

## Fine-Tuning 

In [33]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
#Logging in here as part of the tutorial
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Here I use the parameters from the tutorial.

In [35]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)

In [36]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss


KeyboardInterrupt: 

In [37]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, dataset["validation"])

  0%|          | 0/80 [00:00<?, ?it/s]

{'exact_match': 65.0, 'f1': 66.8044820331491}

Above, we see that our exact match score is much higher than 40.0, with the trainer. Next we'll use an optimizer to push this score further.

We do some hyperparameter tuning, beyond the default arguments

This code very closely follows: https://huggingface.co/learn/nlp-course/en/chapter7/7 Section on Custom Training loops

In [38]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

In [39]:
#We reset the model to the pre-training version
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We employ the use of AdamW to optimize our models parameters. AdamW is an optimizer with weight decay, using gradient accumulation. Here adamW gradually optimizes the model by minimizing the loss-function. 

In [40]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [41]:
from accelerate import Accelerator

accelerator = Accelerator('fp16')
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [42]:
from transformers import get_scheduler

num_train_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [43]:
#This code block is used to simply save the fine-tuned model for later use
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-squad"
repo_name = get_full_repo_name(model_name)
repo_name

'TheUnknot/bert-finetuned-squad'

In [44]:
#This code block is used to simply save the fine-tuned model for later use
output_dir = "bert-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/home/sebastiancsabry/Documents/Projects/RAGvsFine/bert-finetuned-squad-accelerate is already a clone of https://huggingface.co/TheUnknot/bert-finetuned-squad. Make sure you pull the latest changes with `repo.git_pull()`.


In [45]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]

    metrics = compute_metrics(
        start_logits, end_logits, validation_dataset, dataset["validation"]
    )
    print(f"epoch {epoch}:", metrics)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/245 [00:00<?, ?it/s]

Evaluation!


  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

epoch 0: {'exact_match': 33.75, 'f1': 37.28439968076291}


Several commits (8) will be pushed upstream.


Evaluation!


  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

epoch 1: {'exact_match': 71.25, 'f1': 74.64352804884267}


Several commits (9) will be pushed upstream.


Evaluation!


  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

epoch 2: {'exact_match': 73.75, 'f1': 74.04166666666666}


Several commits (10) will be pushed upstream.


Evaluation!


  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

epoch 3: {'exact_match': 72.5, 'f1': 72.91666666666667}


Several commits (11) will be pushed upstream.


Evaluation!


  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

epoch 4: {'exact_match': 71.25, 'f1': 72.01522435897436}


Several commits (12) will be pushed upstream.
'(MaxRetryError("HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com', port=443): Max retries exceeded with url: /repos/9e/ba/9eba72b93301803727f5e155402ce06b51910d672d7f30d56eb4b9c74431e8a7/df538ddb3f76d66a4378f4bf4af1eda7d91f9e58aa3dffa04e31c45b5f656f1e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20240328%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240328T021953Z&X-Amz-Expires=86400&X-Amz-Signature=3b0d13fa1820f534b9742467d785966610bd400e5bdbc66be32aea39d051e5ac&X-Amz-SignedHeaders=host&partNumber=26&uploadId=Crm6B9kfJYqjrRTLD5XyO57MXO7vezAkq87JlUoTNsiYr2KwVx_djsulUeboAfQ6NRwdXRkU7vnYHHVXW_RcYZv3KtUCe1mdB3x4AEXF35tBTMp2XiJasGQx1zPfAydJ&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2406)')))"), '(Request ID: 3228edee-a000-44cb-8b2a-bcd7d5c8017a)')' thrown while requesting PUT https://hf-hub-lfs-us-e

Look at that, our average squad score is 64.5 is significantly better than the initial test (without fine-tuning) of 40. (I'm unsure as to why our initial epoch provides a 33.75 score, but as we increase the number of epochs the score tends to converge to around a score of 70). Note: There is a possibility of improvement through changing the parameters of our optimizer AdamW. 