In [None]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 36.8 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 59.6 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 75.0 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 72.8 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 74.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
from google.colab import output
output.disable_custom_widget_manager()

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
 !apt install git-lfs 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.24.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

  

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,5726d791f1498d1400e8ecc4,Copyright_infringement,"Early court cases focused on the liability of Internet service providers (ISPs) for hosting, transmitting or publishing user-supplied content that could be actioned under civil or criminal law, such as libel, defamation, or pornography. As different content was considered in different legal systems, and in the absence of common definitions for ""ISPs,"" ""bulletin boards"" or ""online publishers,"" early law on online intermediaries' liability varied widely from country to country. The first laws on online intermediaries' liability were passed from the mid-1990s onwards.[citation needed]",What was the result of early law on online liability?,"{'text': ['varied widely from country to country'], 'answer_start': [442]}"
1,56df7e2356340a1900b29c2c,Oklahoma_City,"Although technically not a university, the FAA's Mike Monroney Aeronautical Center has many aspects of an institution of higher learning. Its FAA Academy is accredited by the North Central Association of Colleges and Schools. Its Civil Aerospace Medical Institute (CAMI) has a medical education division responsible for aeromedical education in general as well as the education of aviation medical examiners in the U.S. and 93 other countries. In addition, The National Academy of Science offers Research Associateship Programs for fellowship and other grants for CAMI research.",What institution is accredited by the North Central Association of Colleges and Schools?,"{'text': ['Mike Monroney Aeronautical Center'], 'answer_start': [49]}"
2,57324681e99e3014001e6626,Jehovah%27s_Witnesses,"Baptism is a requirement for being considered a member of Jehovah's Witnesses. Jehovah's Witnesses do not practice infant baptism, and previous baptisms performed by other denominations are not considered valid. Individuals undergoing baptism must affirm publicly that dedication and baptism identify them ""as one of Jehovah's Witnesses in association with God's spirit-directed organization,"" though Witness publications say baptism symbolizes personal dedication to God and not ""to a man, work or organization."" Their literature emphasizes the need for members to be obedient and loyal to Jehovah and to ""his organization,""[note 2] stating that individuals must remain part of it to receive God's favor and to survive Armageddon.",What do Witness publications say baptism symbolizes a person's personal dedication to?,"{'text': ['God'], 'answer_start': [468]}"
3,56d1e8c3e7d4791d00902464,Buddhism,Sentient beings always suffer throughout saṃsāra until they free themselves from this suffering (dukkha) by attaining Nirvana. Then the absence of the first Nidāna—ignorance—leads to the absence of the others.,What is suffering also called?,"{'text': ['dukkha'], 'answer_start': [97]}"
4,57311df705b4da19006bcdb0,Kievan_Rus%27,"In 941, Igor led another major Rus' attack on Constantinople, probably over trading rights again. A navy of 10,000 vessels, including Pecheneg allies, landed on the Bithynian coast and devastated the Asiatic shore of the Bosphorus. The attack was well-timed, perhaps due to intelligence, as the Byzantine fleet was occupied with the Arabs in the Mediterranean, and the bulk of its army was stationed in the east. The Rus’ burned towns, churches, and monasteries, butchering the people and amassing booty. The emperor arranged for a small group of retired ships to be outfitted with Greek fire throwers and sent them out to meet the Rus’, luring them into surrounding the contingent before unleashing the Greek fire. Liutprand of Cremona wrote that ""the Rus', seeing the flames, jumped overboard, preferring water to fire. Some sank, weighed down by the weight of their breastplates and helmets; others caught fire."" Those captured were beheaded. The ploy dispelled the Rus’ fleet, but their attacks continued into the hinterland as far as Nicomedia, with many atrocities reported as victims were crucified and set up for use as targets. At last a Byzantine army arrived from the Balkans to drive the Rus' back, and a naval contingent reportedly destroyed much of the Rus' fleet on its return voyage (possibly an exaggeration since the Rus' soon mounted another attack). The outcome indicates increased military might by Byzantium since 911, suggesting a shift in the balance of power.",What year did Igot led a Rus attack on Constantinople?,"{'text': ['941'], 'answer_start': [3]}"
5,570652ab75f01819005e7b47,Black_people,"In early 1991, non-Arabs of the Zaghawa tribe of Sudan attested that they were victims of an intensifying Arab apartheid campaign, segregating Arabs and non-Arabs (specifically people of sub-Saharan African descent). Sudanese Arabs, who controlled the government, were widely referred to as practicing apartheid against Sudan's non-Arab citizens. The government was accused of ""deftly manipulat(ing) Arab solidarity"" to carry out policies of apartheid and ethnic cleansing.",Who felt persecuted due to the apartheid?,"{'text': ['non-Arabs of the Zaghawa tribe'], 'answer_start': [15]}"
6,5725f56a89a1e219009ac104,Arsenal_F.C.,"Arsenal's home colours have been the inspiration for at least three other clubs. In 1909, Sparta Prague adopted a dark red kit like the one Arsenal wore at the time; in 1938, Hibernian adopted the design of the Arsenal shirt sleeves in their own green and white strip. In 1920, Sporting Clube de Braga's manager returned from a game at Highbury and changed his team's green kit to a duplicate of Arsenal's red with white sleeves and shorts, giving rise to the team's nickname of Os Arsenalistas. These teams still wear those designs to this day.",What early team copied the Arsenal's red current color in 1909?,"{'text': ['Sparta Prague'], 'answer_start': [90]}"
7,571a6c094faf5e1900b8a999,Ashkenazi_Jews,"Genetic studies on Ashkenazim have been conducted to determine how much of their ancestry comes from the Levant, and how much derives from European populations. These studies—researching both their paternal and maternal lineages—point to a significant prevalence of ancient Levantine origins. But they have arrived at diverging conclusions regarding both the degree and the sources of their European ancestry. These diverging conclusions focus particularly on the extent of the European genetic origin observed in Ashkenazi maternal lineages.",Have studies on the genetics of the Ashkenazim come to similar or divergent conclusions regarding the degree and sources of their European ancestry?,"{'text': ['they have arrived at diverging conclusions'], 'answer_start': [297]}"
8,56e1a815e3433e1400423088,Hydrogen,"Hydrogen poses a number of hazards to human safety, from potential detonations and fires when mixed with air to being an asphyxiant in its pure, oxygen-free form. In addition, liquid hydrogen is a cryogen and presents dangers (such as frostbite) associated with very cold liquids. Hydrogen dissolves in many metals, and, in addition to leaking out, may have adverse effects on them, such as hydrogen embrittlement, leading to cracks and explosions. Hydrogen gas leaking into external air may spontaneously ignite. Moreover, hydrogen fire, while being extremely hot, is almost invisible, and thus can lead to accidental burns.",What can hydrogen embrittlement lead to?,"{'text': ['cracks and explosions'], 'answer_start': [426]}"
9,570c774db3d812140066d1ff,FC_Barcelona,"Barcelona won the treble in the 2014–2015 season, winning La Liga, Copa del Rey and UEFA Champions League titles, and became the first European team to have won the treble twice. On 17 May, the club clinched their 23rd La Liga title after defeating Atlético Madrid. This was Barcelona's seventh La Liga title in the last ten years. On 30 May, the club defeated Athletic Bilbao in the Copa del Rey final at Camp Nou. On 6 June, Barcelona won the UEFA Champions League final with a 3–1 win against Juventus, which completed the treble, the club's second in 6 years. Barcelona's attacking trio of Messi, Suárez and Neymar, dubbed MSN, scored 122 goals in all competitions, the most in a season for an attacking trio in Spanish football history.",What team has won the treble competitions twice?,"{'text': ['Barcelona'], 'answer_start': [0]}"


In [41]:
#Converting the dataset into array/list of objects/dictionaries according to the format required by OpenAI in Prompt and completion form

import json

example_train = []
for i in range(500):

      example_json_t = { 'prompt': { 'title': datasets["train"][i]['title'],'context':datasets["train"][i]['context'] ,'question':datasets["train"][i]['question']}, 'completion': datasets["train"][i]['answers']['text']}
      example_train .append(example_json_t)

json.dump(example_train , open("example_train.json", 'w' ))

example_test = []
for i in range(500):


      example_json_v = { 'prompt': { 'title': datasets["validation"][i]['title'],'context':datasets["validation"][i]['context'] ,'question':datasets["validation"][i]['question']}, 'completion': datasets["validation"][i]['answers']['text']}
      example_test .append(example_json_v)

json.dump(example_test, open("example_validation.json", 'w' ))

