# Question Answering (QA) Multilingual Statement Dataset Creation

A list of multilingual datasets we'll be using:
- google/xquad (Extractive QA)
- mhardalov/exams (Multiple Choice QA)

We are creating data points with each row as:
- 'statement': a statement created using text from dataset and template
- 'is_true': a truth value that indicates if the statement is true or false

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K

In [1]:
# a lot of these came from the evaluation script so there are some that are unnecessary
from datasets import load_dataset, get_dataset_config_names, Dataset
import random
import numpy as np
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader
import pandas as pd
from datasets import Dataset
from sklearn.utils import resample
from sklearn.utils import resample
from copy import copy
from torch.utils.data import DataLoader
import argparse
from tqdm import tqdm

In [2]:
SEED = 42
NUM_PROC=5
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
CACHE=None

In [7]:
dataset = load_dataset('google/xquad', 'xquad.en', split='validation')

In [8]:
print(dataset[0])

{'id': '56beb4343aeaaa14008c925b', 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high sev

In [14]:

def fill_template(templates, values):
    temp = random.sample(templates,1)[0]
    for i in range(len(values)):
        #print(f"i: {i}, values: {values[i]}")
        temp = temp.replace("${"+str(i+1)+"}", values[i])
    return temp

def generate_eval_exams(dataset_name, templates, split, label_column, choices, question=None, template_key=None, template_labels=None, generate_other=False, input_sentences=[]):
  """
    - dataset_name: Path name to HuggingFace repo
    - templates: 2D list of templates
    - split: list of dataset split we are going to use, i.e. ['split']
    - question: column label
    - choices: optional choices (choice1 vs choice2 that we embed)
    - template_key: 'question' for XCOPA, 'label' for XNLI
    - template_labels: ['cause', 'effect'] for XCOPA, [0, 1, 2] for XNLI # choose which template we are going to use out of list of lists
    - generate_other: needed for XNLI
    - input_sentences: needed for XStoryCloze
  """
  langs = get_dataset_config_names(dataset_name)
  langs = [lang for lang in langs if 'crosslingual' in lang and len(lang.split('_')[1]) == 2]
  # sanity check: remove any langs that have more than 2 letters (should only be 2 letter code)
  # langs = [lang for lang in langs if len(lang) == 2]

  data = {}
  for lang in langs:
    print(f"loading dataset lang: {lang}")
    data[lang] = load_dataset(dataset_name, lang, split=split, cache_dir=CACHE)

  langs = [lang for lang in langs if 'all' not in lang]
  col_names = copy(data[langs[0]][0].column_names)
  #col_names.remove(label_column)

  def create_statements_labels_exams(example):
    template=""
    # XCOPA, XNLI
    if template_key:
      if template_labels:
        for idx, val in enumerate(template_labels):
          if example[template_key] == val:
            template = templates[idx]

    # XWinograd
    if not template:
      template=templates

    # Choose from templates given
    temp = random.choice(template)

    right_answer = ord(example[label_column]) - ord('A') # should give index (0, 1, 2, 3)
    #print(right_answer)
    #assert(right_answer >= 0 and right_answer <= 3)

    #idx = random.choice(range(0, len(choices)))
    # choose between 1 and 0
    truth_val = random.choice(range(0, 2))

    if truth_val:
      idx = right_answer
    else:
      idx = random.choice([i for i in range(len(example['question']['choices']['text'])) if i != right_answer])

    example['is_true'] = 1 if idx == right_answer else 0
    values = []
    if question:
      values.append(example[question]['stem'])
    values.append(example['question']['choices']['text'][idx])
    example['statement'] = fill_template([temp], values)

    return example

  resulting_statements = {}
  for lang in langs:
    print(f"Processing {lang}...")
    resulting_statements[lang] = [split.map(create_statements_labels_exams, remove_columns=col_names, num_proc=NUM_PROC) for split in data[lang]][0]

  return resulting_statements, langs

In [15]:
dataset = "mhardalov/exams"
templates= ["\"${1}\". Answer: \"${2}\"",
            "Q: \"${1}\". A: \"${2}\"",
            "Question: '\"${1}\".' Answer: '\"${2}\"'"]
split=['train']
label_column='answerKey'
question='question'
choices=[0, 1, 2, 3] # choosing 1 of ['A', 'B', 'C', 'D']

exams_statements, exams_langs = generate_eval_exams(dataset, templates, split, label_column, choices, question)

loading dataset lang: crosslingual_bg
loading dataset lang: crosslingual_hr
loading dataset lang: crosslingual_hu
loading dataset lang: crosslingual_it
loading dataset lang: crosslingual_mk
loading dataset lang: crosslingual_pl
loading dataset lang: crosslingual_pt
loading dataset lang: crosslingual_sq
loading dataset lang: crosslingual_sr
loading dataset lang: crosslingual_tr
loading dataset lang: crosslingual_vi
Processing crosslingual_bg...


Map (num_proc=5):   0%|          | 0/2344 [00:00<?, ? examples/s]

Processing crosslingual_hr...


Map (num_proc=5):   0%|          | 0/2341 [00:00<?, ? examples/s]

Processing crosslingual_hu...


Map (num_proc=5):   0%|          | 0/1731 [00:00<?, ? examples/s]

Processing crosslingual_it...


Map (num_proc=5):   0%|          | 0/1010 [00:00<?, ? examples/s]

Processing crosslingual_mk...


Map (num_proc=5):   0%|          | 0/1665 [00:00<?, ? examples/s]

Processing crosslingual_pl...


Map (num_proc=5):   0%|          | 0/1577 [00:00<?, ? examples/s]

Processing crosslingual_pt...


Map (num_proc=5):   0%|          | 0/740 [00:00<?, ? examples/s]

Processing crosslingual_sq...


Map (num_proc=5):   0%|          | 0/1194 [00:00<?, ? examples/s]

Processing crosslingual_sr...


Map (num_proc=5):   0%|          | 0/1323 [00:00<?, ? examples/s]

Processing crosslingual_tr...


Map (num_proc=5):   0%|          | 0/1571 [00:00<?, ? examples/s]

Processing crosslingual_vi...


Map (num_proc=5):   0%|          | 0/1955 [00:00<?, ? examples/s]

In [20]:
values = exams_statements['crosslingual_it']['is_true']
unique, counts = np.unique(values, return_counts=True)
print(unique, counts)

[0 1] [529 481]


In [16]:
print(exams_statements['crosslingual_it'][0])

{'is_true': 1, 'statement': '"Quale tra i seguenti ormoni stimola la tiroide alla produzione degli ormoni?". Answer: "l’ormone tireotropo"'}


In [24]:
def push_dataset(statements, langs, dataset_name):
  for lang_code in langs:
    statements[lang_code].push_to_hub(f"mbzuai-ugrip-statement-tuning/{dataset_name}", lang_code.split('_')[1], split='train')

In [22]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [25]:
push_dataset(exams_statements, exams_langs, "exams")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/322 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/613 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/904 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.94k [00:00<?, ?B/s]

# XQUAD

In [9]:
print("Processing XQUAD...")
dataset = "google/xquad"
templates = ["Context: \"${1}\"\n Question: \"${2}\"\n Answer: \"${3}\"",
             "\"${1}\"\n According to the passage above, the answer of \"${2}\" is \"${3}\"",
             "Passage: \"${1}\"\n Question: \"${2}\"\n Answer: \"${3}\"",
             "\"${1}\"\n Q: \"${2}\"\n A: \"${3}\""]
split = ['validation']
label_column = 'answers'
question='context'
choices=['answers']

xquad_statements, xquad_langs = generate_eval(dataset, templates, split, label_column, choices, question)

SyntaxError: invalid syntax. Perhaps you forgot a comma? (<ipython-input-9-6d61d679dd79>, line 3)

In [48]:
def generate_eval_xquad(dataset_name, templates, split, label_column, choices, question=None, template_key=None, template_labels=None, generate_other=False, input_sentences=[]):
  """
    - dataset_name: Path name to HuggingFace repo
    - templates: 2D list of templates
    - split: list of dataset split we are going to use, i.e. ['split']
    - question: column label
    - choices: optional choices (choice1 vs choice2 that we embed)
    - template_key: 'question' for XCOPA, 'label' for XNLI
    - template_labels: ['cause', 'effect'] for XCOPA, [0, 1, 2] for XNLI # choose which template we are going to use out of list of lists
    - generate_other: needed for XNLI
    - input_sentences: needed for XStoryCloze
  """
  langs = get_dataset_config_names(dataset_name)
  langs = [lang for lang in langs]
  # sanity check: remove any langs that have more than 2 letters (should only be 2 letter code)
  # langs = [lang for lang in langs if len(lang) == 2]

  data = {}
  for lang in langs:
    print(f"loading dataset lang: {lang}")
    data[lang] = load_dataset(dataset_name, lang, split=split, cache_dir=CACHE)

  langs = [lang for lang in langs]
  col_names = copy(data[langs[0]][0].column_names)

  def create_statements_labels_xquad(example):
    template=""
    # XCOPA, XNLI
    if template_key:
      if template_labels:
        for idx, val in enumerate(template_labels):
          if example[template_key] == val:
            template = templates[idx]

    # XWinograd
    if not template:
      template=templates

    # Choose from templates given
    temp = random.choice(template)

    # choose between 1 and 0
    truth_val = random.choice(range(0, 2))

    right_answer = example['answers']['text'][0]
    right_start = example['answers']['answer_start']

    # if right answer is a number
    if right_answer.isdigit():
      # if it is a float
      if '.' in right_answer:
        wrong_answer = str(random.uniform(1.0, 300.0))
      # if it is an integer
      else:
        wrong_answer = str(random.randint(1, 300))
    elif type(right_answer) == str:
      random_start = 0
      random_span = 1
      while True:
        random_start = random.randint(0, len(example['context'])//4)
        random_span = random.randint(1, len(example['context']) - random_start-1)  # Ensure the span doesn't exceed the string length
        if random_start != right_start:
          break
      wrong_answer = example['context'][random_start:random_start+random_span]

    if truth_val:
      ans = right_answer
    else:
      ans = wrong_answer

    example['is_true'] = 1 if truth_val else 0
    values = []
    if question:
      values.append(example['context'])
    values.append(example['question'])
    values.append(ans)
    example['statement'] = fill_template([temp], values)

    return example

  resulting_statements = {}
  for lang in langs:
    print(f"Processing {lang}...")
    resulting_statements[lang] = [split.map(create_statements_labels_xquad, remove_columns=col_names, num_proc=NUM_PROC) for split in data[lang]][0]

  return resulting_statements, langs

In [49]:
print("Processing XQUAD...")
dataset = "google/xquad"
templates = ["Context: \"${1}\"\n Question: \"${2}\"\n Answer: \"${3}\"",
             "\"${1}\"\n According to the passage above, the answer of \"${2}\" is \"${3}\"",
             "Passage: \"${1}\"\n Question: \"${2}\"\n Answer: \"${3}\"",
             "\"${1}\"\n Q: \"${2}\"\n A: \"${3}\""]
split = ['validation']
label_column = 'answers'
question='context'
choices=['answers']

xquad_statements, xquad_langs = generate_eval_xquad(dataset, templates, split, label_column, choices, question)

Processing XQUAD...
loading dataset lang: xquad.ar
loading dataset lang: xquad.de
loading dataset lang: xquad.el
loading dataset lang: xquad.en
loading dataset lang: xquad.es
loading dataset lang: xquad.hi
loading dataset lang: xquad.ro
loading dataset lang: xquad.ru
loading dataset lang: xquad.th
loading dataset lang: xquad.tr
loading dataset lang: xquad.vi
loading dataset lang: xquad.zh
Processing xquad.ar...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.de...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.el...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.en...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.es...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.hi...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.ro...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.ru...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.th...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.tr...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.vi...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

Processing xquad.zh...


Map (num_proc=5):   0%|          | 0/1190 [00:00<?, ? examples/s]

In [50]:
xquad_statements['xquad.en'][0]

{'is_true': 1,
 'statement': '"The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL\'s active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina\'s secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptio

In [52]:
unique, counts = np.unique(xquad_statements['xquad.en']['is_true'], return_counts=True)
print(unique, counts)

[0 1] [615 575]


In [54]:
def push_dataset(statements, langs, dataset_name):
  for lang_code in langs:
    statements[lang_code].push_to_hub(f"mbzuai-ugrip-statement-tuning/{dataset_name}", lang_code.split('.')[1], split='train')

In [55]:
push_dataset(xquad_statements, xquad_langs, 'xquad')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/324 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/910 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.67k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/2.96k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/3.25k [00:00<?, ?B/s]