<a href="https://colab.research.google.com/github/vifirsanova/empi/blob/main/dataset/ASD_QA_dataset_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook was created by [Victoria Firsanova](https://vifirsanova.github.io)

The notebook describes the process of preparing [the ASD QA dataset](https://figshare.com/articles/dataset/Autism_Spectrum_Disorder_and_Asperger_Syndrome_Question_Answering_Dataset_1_0/13295831) for HuggingFace

# Import

In [None]:
!pip install datasets

In [12]:
import json
from pathlib import Path
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Load data

First things first, I uploaded the dataset to Colab

In [None]:
path = Path('original.json')
data = json.loads(path.read_text(encoding='utf-8'))

data = data['data']

# Dataset stats

Now let's look at some stats

In [7]:
c_len_t = []
c_len_w = []

for i in range(len(data)):
  for paragraph in data[i]['paragraphs']:
    c_len_t.append(len(paragraph['context']))
    c_len_w.append(len(paragraph['context'].split()))

qa_stats = 0

for i in range(len(data)):
  for paragraph in data[i]['paragraphs']:
    qa_stats += len(paragraph['qas'])


fakes = 0
ans_len_t = []
q_len_t = []
ans_len_w = []
q_len_w = []

for i in range(len(data)):
  for paragraph in data[i]['paragraphs']:
    for elem in paragraph['qas']:
      q_len_t.append(len(elem['question']))
      q_len_w.append(len(elem['question'].split()))
      if elem['is_impossible'] == True:
        fakes += 1
      if len(elem['answers'][0]['text']) > 0:
        ans_len_t.append(len(elem['answers'][0]['text']))
        ans_len_w.append(len(elem['answers'][0]['text'].split()))

print('The number of QA pairs', qa_stats)
print('The number of irrelevant questions', fakes)
print('The average question length', round(sum(q_len_t) / len(q_len_t)), 'symbols', round(sum(q_len_w) / len(q_len_w)), 'words')
print('The average answer length', round(sum(ans_len_t) / len(ans_len_t)), 'symbols', round(sum(ans_len_w) / len(ans_len_w)), 'words')
print('The average reading paragraph length', round(sum(c_len_t) / len(c_len_t)), 'symbols', round(sum(c_len_w) / len(c_len_w)), 'words')
print('Max question length', max(q_len_t), 'symbols', max(q_len_w), 'words')
print('Max answer length', max(ans_len_t), 'symbols', max(ans_len_w), 'words')
print('Max reading paragraph length', max(c_len_t), 'symbols', max(c_len_w), 'words')
print('Min question length', min(q_len_t), 'symbols', min(q_len_w), 'words')
print('Min answer length', min(ans_len_t), 'symbols', min(ans_len_w), 'words')
print('Min reading paragraph length', min(c_len_t), 'symbols', min(c_len_w), 'words')

The number of QA pairs 4138
The number of irrelevant questions 352
The average question length 53 symbols 8 words
The average answer length 141 symbols 20 words
The average reading paragraph length 453 symbols 63 words
Max question length 226 symbols 32 words
Max answer length 555 symbols 85 words
Max reading paragraph length 551 symbols 94 words
Min question length 9 symbols 2 words
Min answer length 5 symbols 1 words
Min reading paragraph length 144 symbols 17 words


# Split the data

In [8]:
train, temp = train_test_split(data, test_size=0.3, shuffle=True)
val, test = train_test_split(temp, test_size=0.5, shuffle=True)

# Retrieve questions, answers and reading passages from the dataset -> form the dataframe

In [None]:
qs = []
ans = []
pars = []

def dict_from_data(sample):
  for elem in sample:
    for paragraph in elem['paragraphs']:
      for element in paragraph['qas']:
        qs.append (element['question'])
        ans.append(element['answers'][0])
        pars.append(paragraph['context'])

  sample_dict = dict()

  sample_dict['question'] = qs
  sample_dict['answers'] = ans
  sample_dict['paragraph'] = pars

  return sample_dict

train_dict, val_dict, test_dict = dict_from_data(train), dict_from_data(val), dict_from_data(test)

# Create HF Dataset

In [14]:
train_dataset, val_dataset, test_dataset = Dataset.from_dict(train_dict), Dataset.from_dict(val_dict), Dataset.from_dict(test_dict)

In [16]:
test_dataset[0]

{'question': 'Как-то дети должны относится к особенным ровесникам?',
 'answers': {'answer_end': 383,
  'answer_start': 0,
  'text': 'Братьев и сестёр надо учить правильно реагировать'},
 'paragraph': 'Братьев и сестёр нужно учить правильно реагировать. Обычно это обращение к родителям за помощью в разрешении ситуации. Родителям нужно приложить все усилия, чтобы дать детям безопасное место для важных вещей и защитить от агрессивного поведения. Томас Пауэлл и Пегги Галлахер предлагают идеи по обучению базовым навыкам поведения братьев и сестёр. Сиблингам важно чувствовать, что с их братом или сестрой обращаются как можно более "нормально".\xa0'}

# Push the data to HF

In [None]:
train_dataset.push_to_hub('missvector/asd-qa-train')
val_dataset.push_to_hub('missvector/asd-qa-val')
test_dataset.push_to_hub('missvector/asd-qa-test')

Now the ASD QA dataset and its user friendly preview is available on HF

1. [Train set](https://huggingface.co/datasets/missvector/asd-qa-train)
2. [Validation set](https://huggingface.co/datasets/missvector/asd-qa-val)
3. [Test set](https://huggingface.co/datasets/missvector/asd-qa-test)