In [9]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:
from google.colab import output
output.enable_custom_widget_manager()

In [11]:
from google.colab import output
output.disable_custom_widget_manager()

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

Then you need to install Git-LFS. Uncomment the following instructions:

In [12]:
 !apt install git-lfs 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [13]:
import transformers

print(transformers.__version__)

4.24.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [14]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [17]:
squad_v2 = True
datasets = load_dataset("squad_v2" if squad_v2 else "squad")



  0%|          | 0/2 [00:00<?, ?it/s]

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [18]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [19]:
datasets["train"][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [20]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [21]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,5ad28d6fd7d075001a429a2a,House_music,"Atkins, a former member of Cybotron, released Model 500 ""No UFOs"" in 1985, which became a regional hit, followed by dozens of tracks on Transmat, Metroplex and Fragile. One of the most unusual was ""Strings of Life"" by Derrick May, a darker, more intellectual strain of house. ""Techno-Scratch"" was released by the Knights Of The Turntable in 1984 which had a similar techno sound to Cybotron. The manager of the Factory nightclub and co-owner of the Haçienda, Tony Wilson, also promoted acid house culture on his weekly TV show. The Midlands also embraced the late 1980s house scene with illegal parties and more legal dance clubs such as The Hummingbird.",May was a former member of what music group?,"{'text': [], 'answer_start': []}"
1,5a7e1ad270df9f001a8754b7,Translation,"Arabic translation efforts and techniques are important to Western translation traditions due to centuries of close contacts and exchanges. Especially after the Renaissance, Europeans began more intensive study of Arabic and Persian translations of classical works as well as scientific and philosophical works of Arab and oriental origins. Arabic and, to a lesser degree, Persian became important sources of material and perhaps of techniques for revitalized Western traditions, which in time would overtake the Islamic and oriental traditions.",Why are Arabic translation efforts pointless to Western translation traditions?,"{'text': [], 'answer_start': []}"
2,5710a1c4b654c5140001f9e9,Age_of_Enlightenment,"The word ""public"" implies the highest level of inclusivity – the public sphere by definition should be open to all. However, this sphere was only public to relative degrees. Enlightenment thinkers frequently contrasted their conception of the ""public"" with that of the people: Condorcet contrasted ""opinion"" with populace, Marmontel ""the opinion of men of letters"" with ""the opinion of the multitude,"" and d'Alembert the ""truly enlightened public"" with ""the blind and noisy multitude"". Additionally, most institutions of the public sphere excluded both women and the lower classes. Cross-class influences occurred through noble and lower class participation in areas such as the coffeehouses and the Masonic lodges.",Whixh population groups were excluded from most institutions of the public sphere?,"{'text': ['women and the lower classes'], 'answer_start': [553]}"
3,5ad0161a77cf76001a686950,Kievan_Rus%27,"In the north, the Republic of Novgorod prospered because it controlled trade routes from the River Volga to the Baltic Sea. As Kievan Rus' declined, Novgorod became more independent. A local oligarchy ruled Novgorod; major government decisions were made by a town assembly, which also elected a prince as the city's military leader. In the 12th century, Novgorod acquired its own archbishop Ilya in 1169, a sign of increased importance and political independence, while about 30 years prior to that in 1136 in Novgorod was established a republican form of government - elective monarchy. Since then Novgorod enjoyed a wide degree of autonomy although being closely associated with the Kievan Rus.",What year did Novgorod assign its own archbishop?,"{'text': [], 'answer_start': []}"
4,572689745951b619008f7631,Presbyterianism,"The biggest Presbyterian church is the National Presbyterian Church in Mexico (Iglesia Nacional Presbiteriana de México), which has around 2,500,000 members and associates and 3000 congregations, but there are other small denominations like the Associate Reformed Presbyterian Church in Mexico which was founded in 1875 by the Associate Reformed Church in North America. The Independent Presbyterian Church and the Presbyterian Reformed Church in Mexico, the National Conservative Presbyterian Church in Mexico are existing churches in the Reformed tradition.",When was the Associate Reformed Presbyterian Church in Mexico formed?,"{'text': ['1875'], 'answer_start': [315]}"
5,570d6008fed7b91900d45f6b,Valencia,"The inevitable march to civil war and the combat in Madrid resulted in the removal of the capital of the Republic to Valencia. On 6 November 1936 the city became the capital of Republican Spain under the control of the prime minister Manuel Azana; the government moved to the Palau de Benicarló, its ministries occupying various other buildings. The city was heavily bombarded by air and sea, necessitating the construction of over two hundred bomb shelters to protect the population. On 13 January 1937 the city was first shelled by a vessel of the Fascist Italian Navy, which was blockading the port by the order of Benito Mussolini. The bombardment intensified and inflicted massive destruction on several occasions; by the end of the war the city had survived 442 bombardments, leaving 2,831 dead and 847 wounded, although it is estimated that the death toll was higher, as the data given are those recognised by Francisco Franco's government. The Republican government passed to Juan Negrín on 17 May 1937 and on 31 October of that year moved to Barcelona. On 30 March 1939 Valencia surrendered and the Nationalist troops entered the city. The postwar years were a time of hardship for Valencians. During Franco's regime speaking or teaching Valencian was prohibited; in a significant reversal it is now compulsory for every schoolchild in Valencia.",Where did the government relocate to in 1937?,"{'text': ['Barcelona'], 'answer_start': [1051]}"
6,56f83269aef2371900625eed,Szlachta,"One of the most important victories of the magnates was the late 16th century right to create ordynacja's (similar to majorats), which ensured that a family which gained wealth and power could more easily preserve this. Ordynacje's of families of Radziwiłł, Zamoyski, Potocki or Lubomirski often rivalled the estates of the king and were important power bases for the magnates.",The right to create ordynacja's was important to what group?,"{'text': ['magnates'], 'answer_start': [43]}"
7,5acfd2ec77cf76001a6861fc,Presbyterianism,"A number of new Presbyterian Churches were founded by Scottish immigrants to England in the 19th century and later. Following the 'Disruption' in 1843 many of those linked to the Church of Scotland eventually joined what became the Presbyterian Church of England in 1876. Some, that is Crown Court (Covent Garden, London), St Andrew's (Stepney, London) and Swallow Street (London), did not join the English denomination, which is why there are Church of Scotland congregations in England such as those at Crown Court, and St Columba's, Pont Street (Knightsbridge) in London. There is also a congregation in the heart of London's financial district called London City Presbyterian Church that is also affiliated with Free Church of Scotland.",In what year did the Presbyterian Church of England become the Church of Scotland?,"{'text': [], 'answer_start': []}"
8,56db635fe7c41114004b505b,American_Idol,"American Idol was nominated for the Emmy's Outstanding Reality Competition Program for nine years but never won. Director Bruce Gower won a Primetime Emmy Award for Outstanding Directing For A Variety, Music Or Comedy Series in 2009, and the show won a Creative Arts Emmys each in 2007 and 2008, three in 2009, and two in 2011, as well as a Governor's Award in 2007 for its Idol Gives Back edition. It won the People's Choice Award, which honors the popular culture of the previous year as voted by the public, for favorite competition/reality show in 2005, 2006, 2007, 2010, 2011 and 2012. It won the first Critics' Choice Television Award in 2011 for Best Reality Competition.",What award did American Idol win for its Idol Gives Back charity work?,"{'text': ['Governor's Award in 2007'], 'answer_start': [341]}"
9,5a7a7c4c21c2de001afe9c77,European_Central_Bank,"On the other hand, certain financial techniques can reduce the impact of such purchases on the currency. One is sterilisation, in which highly valued assets are sold at the same time that the weaker assets are purchased, which keeps the money supply neutral. Another technique is simply to accept the bad assets as long-term collateral (as opposed to short-term repo swaps) to be held until their market value stabilises. This would imply, as a quid pro quo, adjustments in taxation and expenditure in the economies of the weaker states to improve the perceived value of the assets.","What can a state do with good assets, rather than cashing them in directly?","{'text': [], 'answer_start': []}"


In [22]:
show_random_elements(datasets["validation"])

Unnamed: 0,id,title,context,question,answers
0,5727db85ff5b5019007d96fd,Harvard_University,"Harvard's athletic rivalry with Yale is intense in every sport in which they meet, coming to a climax each fall in the annual football meeting, which dates back to 1875 and is usually called simply ""The Game"". While Harvard's football team is no longer one of the country's best as it often was a century ago during football's early days (it won the Rose Bowl in 1920), both it and Yale have influenced the way the game is played. In 1903, Harvard Stadium introduced a new era into football with the first-ever permanent reinforced concrete stadium of its kind in the country. The stadium's structure actually played a role in the evolution of the college game. Seeking to reduce the alarming number of deaths and serious injuries in the sport, Walter Camp (former captain of the Yale football team), suggested widening the field to open up the game. But the stadium was too narrow to accommodate a wider playing surface. So, other steps had to be taken. Camp would instead support revolutionary new rules for the 1906 season. These included legalizing the forward pass, perhaps the most significant rule change in the sport's history.",In what year did Harvard Stadium become the first ever concrete reinforced stadium in the country?,"{'text': ['1903', '1903', '1903'], 'answer_start': [434, 434, 434]}"
1,5ad0322877cf76001a686dff,Imperialism,"It became a moral justification to lift the world up to French standards by bringing Christianity and French culture. In 1884 the leading exponent of colonialism, Jules Ferry declared France had a civilising mission: ""The higher races have a right over the lower races, they have a duty to civilize the inferior"". Full citizenship rights – ‘’assimilation’’ – were offered, although in reality assimilation was always on the distant horizon. Contrasting from Britain, France sent small numbers of settlers to its colonies, with the only notable exception of Algeria, where French settlers nevertheless always remained a small minority.",The English thought bringing what would uplift other regions?,"{'text': [], 'answer_start': []}"
2,5ad14f5a645df0001a2d1710,European_Union_law,"EU Competition law has its origins in the European Coal and Steel Community (ECSC) agreement between France, Italy, Belgium, the Netherlands, Luxembourg and Germany in 1951 following the second World War. The agreement aimed to prevent Germany from re-establishing dominance in the production of coal and steel as members felt that its dominance had contributed to the outbreak of the war. Article 65 of the agreement banned cartels and article 66 made provisions for concentrations, or mergers, and the abuse of a dominant position by companies. This was the first time that competition law principles were included in a plurilateral regional agreement and established the trans-European model of competition law. In 1957 competition rules were included in the Treaty of Rome, also known as the EC Treaty, which established the European Economic Community (EEC). The Treaty of Rome established the enactment of competition law as one of the main aims of the EEC through the ""institution of a system ensuring that competition in the common market is not distorted"". The two central provisions on EU competition law on companies were established in article 85, which prohibited anti-competitive agreements, subject to some exemptions, and article 86 prohibiting the abuse of dominant position. The treaty also established principles on competition law for member states, with article 90 covering public undertakings, and article 92 making provisions on state aid. Regulations on mergers were not included as member states could not establish consensus on the issue at the time.",Which countries did not agree upon the European Coal and Steel Community agreement?,"{'text': [], 'answer_start': []}"
3,572ffce5a23a5019007fcc15,Rhine,"Around 2.5 million years ago (ending 11,600 years ago) was the geological period of the Ice Ages. Since approximately 600,000 years ago, six major Ice Ages have occurred, in which sea level dropped 120 m (390 ft) and much of the continental margins became exposed. In the Early Pleistocene, the Rhine followed a course to the northwest, through the present North Sea. During the so-called Anglian glaciation (~450,000 yr BP, marine oxygen isotope stage 12), the northern part of the present North Sea was blocked by the ice and a large lake developed, that overflowed through the English Channel. This caused the Rhine's course to be diverted through the English Channel. Since then, during glacial times, the river mouth was located offshore of Brest, France and rivers, like the Thames and the Seine, became tributaries to the Rhine. During interglacials, when sea level rose to approximately the present level, the Rhine built deltas, in what is now the Netherlands.",What period was 2.5 million years ago?,"{'text': ['Ice Ages', 'geological period', 'geological period of the Ice Ages'], 'answer_start': [88, 63, 63]}"
4,57287fec4b864d1900164a3e,Yuan_dynasty,"There were many religions practiced during the Yuan dynasty, such as Buddhism, Islam, and Christianity. The establishment of the Yuan dynasty had dramatically increased the number of Muslims in China. However, unlike the western khanates, the Yuan dynasty never converted to Islam. Instead, Kublai Khan, the founder of the Yuan dynasty, favored Buddhism, especially the Tibetan variants. As a result, Tibetan Buddhism was established as the de facto state religion. The top-level department and government agency known as the Bureau of Buddhist and Tibetan Affairs (Xuanzheng Yuan) was set up in Khanbaliq (modern Beijing) to supervise Buddhist monks throughout the empire. Since Kublai Khan only esteemed the Sakya sect of Tibetan Buddhism, other religions became less important. He and his successors kept a Sakya Imperial Preceptor (Dishi) at court. Before the end of the Yuan dynasty, 14 leaders of the Sakya sect had held the post of Imperial Preceptor, thereby enjoying special power. Furthermore, Mongol patronage of Buddhism resulted in a number of monuments of Buddhist art. Mongolian Buddhist translations, almost all from Tibetan originals, began on a large scale after 1300. Many Mongols of the upper class such as the Jalayir and the Oronar nobles as well as the emperors also patronized Confucian scholars and institutions. A considerable number of Confucian and Chinese historical works were translated into the Mongolian language.",What was the Yuan's unofficial state religion?,"{'text': ['Tibetan Buddhism', 'Tibetan Buddhism', 'Tibetan Buddhism'], 'answer_start': [401, 401, 401]}"
5,5733ea04d058e614000b6595,French_and_Indian_War,"In the spring of 1753, Paul Marin de la Malgue was given command of a 2,000-man force of Troupes de la Marine and Indians. His orders were to protect the King's land in the Ohio Valley from the British. Marin followed the route that Céloron had mapped out four years earlier, but where Céloron had limited the record of French claims to the burial of lead plates, Marin constructed and garrisoned forts. He first constructed Fort Presque Isle (near present-day Erie, Pennsylvania) on Lake Erie's south shore. He had a road built to the headwaters of LeBoeuf Creek. Marin constructed a second fort at Fort Le Boeuf (present-day Waterford, Pennsylvania), designed to guard the headwaters of LeBoeuf Creek. As he moved south, he drove off or captured British traders, alarming both the British and the Iroquois. Tanaghrisson, a chief of the Mingo, who were remnants of Iroquois and other tribes who had been driven west by colonial expansion. He intensely disliked the French (whom he accused of killing and eating his father). Traveling to Fort Le Boeuf, he threatened the French with military action, which Marin contemptuously dismissed.",Where did Marin build first fort?,"{'text': ['Fort Presque Isle (near present-day Erie, Pennsylvania', 'Fort Presque Isle', 'near present-day Erie, Pennsylvania', 'Fort Presque Isle', 'near present-day Erie, Pennsylvania'], 'answer_start': [425, 425, 444, 425, 444]}"
6,573362b94776f41900660976,Warsaw,"Building activity occurred in numerous noble palaces and churches during the later decades of the 17th century. One of the best examples of this architecture are Krasiński Palace (1677–1683), Wilanów Palace (1677–1696) and St. Kazimierz Church (1688–1692). The most impressive examples of rococo architecture are Czapski Palace (1712–1721), Palace of the Four Winds (1730s) and Visitationist Church (façade 1728–1761). The neoclassical architecture in Warsaw can be described by the simplicity of the geometrical forms teamed with a great inspiration from the Roman period. Some of the best examples of the neoclassical style are the Palace on the Water (rebuilt 1775–1795), Królikarnia (1782–1786), Carmelite Church (façade 1761–1783) and Evangelical Holy Trinity Church (1777–1782). The economic growth during the first years of Congress Poland caused a rapid rise architecture. The Neoclassical revival affected all aspects of architecture, the most notable are the Great Theater (1825–1833) and buildings located at Bank Square (1825–1828).",What type of architecture is the Palace of Four Windows an impressive example of?,"{'text': ['rococo', 'rococo', 'rococo'], 'answer_start': [289, 289, 289]}"
7,5a55068b134fea001a0e180d,Packet_switching,"Packet switching contrasts with another principal networking paradigm, circuit switching, a method which pre-allocates dedicated network bandwidth specifically for each communication session, each having a constant bit rate and latency between nodes. In cases of billable services, such as cellular communication services, circuit switching is characterized by a fee per unit of connection time, even when no data is transferred, while packet switching may be characterized by a fee per unit of information transmitted, such as characters, packets, or messages.",How much bandwidth is dedicated for each communication session?,"{'text': [], 'answer_start': []}"
8,5a8203d431013a001a3350bf,Harvard_University,"In 1846, the natural history lectures of Louis Agassiz were acclaimed both in New York and on the campus at Harvard College. Agassiz's approach was distinctly idealist and posited Americans' ""participation in the Divine Nature"" and the possibility of understanding ""intellectual existences"". Agassiz's perspective on science combined observation with intuition and the assumption that a person can grasp the ""divine plan"" in all phenomena. When it came to explaining life-forms, Agassiz resorted to matters of shape based on a presumed archetype for his evidence. This dual view of knowledge was in concert with the teachings of Common Sense Realism derived from Scottish philosophers Thomas Reid and Dugald Stewart, whose works were part of the Harvard curriculum at the time. The popularity of Agassiz's efforts to ""soar with Plato"" probably also derived from other writings to which Harvard students were exposed, including Platonic treatises by Ralph Cudworth, John Norrisand, in a Romantic vein, Samuel Coleridge. The library records at Harvard reveal that the writings of Plato and his early modern and Romantic followers were almost as regularly read during the 19th century as those of the ""official philosophy"" of the more empirical and more deistic Scottish school.",Where were the writings of Plato acclaimed in 1846?,"{'text': [], 'answer_start': []}"
9,571cbe35dd7acb1400e4c13f,Oxygen,"Oxygen presents two spectrophotometric absorption bands peaking at the wavelengths 687 and 760 nm. Some remote sensing scientists have proposed using the measurement of the radiance coming from vegetation canopies in those bands to characterize plant health status from a satellite platform. This approach exploits the fact that in those bands it is possible to discriminate the vegetation's reflectance from its fluorescence, which is much weaker. The measurement is technically difficult owing to the low signal-to-noise ratio and the physical structure of vegetation; but it has been proposed as a possible method of monitoring the carbon cycle from satellites on a global scale.",On what scale would scientists show measurements of vegetation?,"{'text': ['global', 'a global scale', 'global', 'global', 'a global scale'], 'answer_start': [669, 667, 669, 669, 667]}"


In [23]:
#Converting the dataset into array/list of objects/dictionaries according to the format required by OpenAI in Prompt and completion form

import json

example_train = []
for i in range(500):

      example_json_t = { 'prompt': { 'title': datasets["train"][i]['title'],'context':datasets["train"][i]['context'] ,'question':datasets["train"][i]['question']}, 'completion': datasets["train"][i]['answers']['text']}
      example_train .append(example_json_t)

json.dump(example_train , open("example_train.json", 'w' ))

example_test = []
for i in range(500):


      example_json_v = { 'prompt': { 'title': datasets["validation"][i]['title'],'context':datasets["validation"][i]['context'] ,'question':datasets["validation"][i]['question']}, 'completion': datasets["validation"][i]['answers']['text']}
      example_test .append(example_json_v)

json.dump(example_test, open("example_validation.json", 'w' ))

