# Dataset

For this task we will use the **[Stanford Question Answering Dataset (SQuAD V2.0)](https://rajpurkar.github.io/SQuAD-explorer/)**.  
This is a comprehensive dataset that is used as benchmark throughout the industry.

SQuAD Provides a [getting started](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) dataset that we can use to explore the data.

In [3]:
# nuclio: ignore
import nuclio

### Setup the dataset

In the following steps we will use **Pandas** to load and explore our dataset.  
We will try to describe how the dataset is built and how we can use it

In [4]:
import pandas as pd

In [5]:
squad_train_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json'
squad_dev_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'

In [6]:
df = pd.read_json(squad_train_url)

In [7]:
text = f'''
The Squad Train dataset contains {df.shape[1]} samples.
Each sample contains a Version id under {df.columns[0]} and the actual line description under {df.columns[1]}

A line example:
{df.head(1)}

Which has:
Version: {df.iloc[1, 0]}
Data: {df.iloc[1, 1]}
'''
print(text)


The Squad Train dataset contains 2 samples.
Each sample contains a Version id under version and the actual line description under data

A line example:
  version                                               data
0    v2.0  {'title': 'Beyoncé', 'paragraphs': [{'qas': [{...

Which has:
Version: v2.0
Data: {'title': 'Frédéric_Chopin', 'paragraphs': [{'qas': [{'question': "What was Frédéric's nationalities?", 'id': '56cbd2356d243a140015ed66', 'answers': [{'text': 'Polish and French', 'answer_start': 182}], 'is_impossible': False}, {'question': 'In what era was Frédéric active in?', 'id': '56cbd2356d243a140015ed67', 'answers': [{'text': 'Romantic era', 'answer_start': 276}], 'is_impossible': False}, {'question': 'For what instrument did Frédéric write primarily for?', 'id': '56cbd2356d243a140015ed68', 'answers': [{'text': 'solo piano', 'answer_start': 318}], 'is_impossible': False}, {'question': 'In what area was Frédéric born in?', 'id': '56cbd2356d243a140015ed69', 'answers': [{'text': '

In [8]:
sample = df.iloc[0, 1]

In [9]:
sample

{'title': 'Beyoncé',
 'paragraphs': [{'qas': [{'question': 'When did Beyonce start becoming popular?',
     'id': '56be85543aeaaa14008c9063',
     'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
     'is_impossible': False},
    {'question': 'What areas did Beyonce compete in when she was growing up?',
     'id': '56be85543aeaaa14008c9065',
     'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
     'is_impossible': False},
    {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
     'id': '56be85543aeaaa14008c9066',
     'answers': [{'text': '2003', 'answer_start': 526}],
     'is_impossible': False},
    {'question': 'In what city and state did Beyonce  grow up? ',
     'id': '56bf6b0f3aeaaa14008c9601',
     'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
     'is_impossible': False},
    {'question': 'In which decade did Beyonce become famous?',
     'id': '56bf6b0f3aeaaa14008c9602',
     'answers': [{'text

In [10]:
df.apply(lambda row: row['data']['title'], axis=1)

0                                                Beyoncé
1                                        Frédéric_Chopin
2         Sino-Tibetan_relations_during_the_Ming_dynasty
3                                                   IPod
4                 The_Legend_of_Zelda:_Twilight_Princess
5                                    Spectre_(2015_film)
6                                2008_Sichuan_earthquake
7                                          New_York_City
8                                  To_Kill_a_Mockingbird
9                                           Solar_energy
10                                            Kanye_West
11                                              Buddhism
12                                         American_Idol
13                                                   Dog
14                      2008_Summer_Olympics_torch_relay
15                                                Genome
16                                  Comprehensive_school
17                             

In [11]:
import tensorflow

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
!pip install tensorflow-hub

Collecting tensorflow-hub
[?25l  Downloading https://files.pythonhosted.org/packages/00/0e/a91780d07592b1abf9c91344ce459472cc19db3b67fdf3a61dca6ebb2f5c/tensorflow_hub-0.7.0-py2.py3-none-any.whl (89kB)
[K    100% |████████████████████████████████| 92kB 2.7MB/s ta 0:00:01
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.7.0


In [3]:
!pip install tensorflow==2.0.0

Collecting tensorflow==2.0.0
[31m  Could not find a version that satisfies the requirement tensorflow==2.0.0 (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)[0m
[31mNo matching distribution found for tensorflow==2.0.0[0m


In [1]:
albert_module = hub.Module(
    "https://tfhub.dev/google/albert_xxlarge/2",
    trainable=True)


NameError: name 'hub' is not defined