<a href="https://colab.research.google.com/github/zetavg/LLM-Research/blob/main/MLQA_Dataset_Converter_(en_zh_tw).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLQA Dataset Converter (en-zh_tw)

Converts the MLQA (MultiLingual Question Answering) dataset specifically for English and Traditional Chinese bilingual language models.

Generated dataset on Hugging Face: https://huggingface.co/datasets/zetavg/mlqa_en_zh_tw.

Sample:

```json
[
  {
    "title": {
      "zh_tw": "2014 年冬季奧林匹克運動會冰壺比賽",
      "en": "Curling at the 2014 Winter Olympics"
    },
    "paragraphs": [
      {
        "context": {
          "zh_tw": "本屆冬奧會冰壺比賽參加資格有兩種辦法可以取得。各國家或地區可以透過 2012 年和 2013 年的世界冰壺錦標賽，也可以透過 2013 年 12 月舉辦的一次冬奧會資格賽來取得資格。七個國家透過兩屆世錦賽積分之和來獲得資格，兩個國家則透過冬奧會資格賽。作為主辦國，俄羅斯自動獲得參賽資格，這樣就確定了冬奧會冰壺比賽的男女各十支參賽隊伍。",
          "en": "Qualification to the curling tournaments at the Winter Olympics was determined through two methods. Nations could qualify teams by earning qualification points from performances at the 2012 and 2013 World Curling Championships. Teams could also qualify through an Olympic qualification event which was held in the autumn of 2013. Seven nations qualified teams via World Championship qualification points, while two nations qualified through the qualification event. As host nation, Russia qualified teams automatically, thus making a total of ten teams per gender in the curling tournaments."
        },
        "qas": [
          {
            "id": "b08184972e38a79c47d01614aa08505bb3c9b680",
            "question": {
              "zh_tw": "俄羅斯有多少隊獲得參賽資格？",
              "en": "How many teams did Russia qualify for?"
            },
            "answers": {
              "zh_tw": [
                {
                  "text": "十支",
                  "answer_start": 161
                }
              ],
              "en": [
                {
                  "text": "ten teams",
                  "answer_start": 543
                }
              ]
            }
          }
        ]
      }
    ]
  }
]
```

Note that some data points may have some of the languages lacking in the title, context, questions or answers.


Upload the 6 files obtained from [facebookresearch/MLQA](https://github.com/facebookresearch/mlqa) under `/content` (the default working directory):
Ｓ
* `dev-context-zh-question-zh.json`
* `dev-context-zh-question-en.json`
* `dev-context-en-question-zh.json`
* `test-context-zh-question-zh.json`
* `test-context-zh-question-en.json`
* `test-context-en-question-zh.json`

Converted and merged datasets will be saved under `/content/outputs`.

In [None]:
!pip install datasets opencc pangu

In [None]:
import json

def load_json_file(file_name):
    with open(file_name, 'r') as file:
        data = json.load(file)
    return data

In [None]:
import json

def print_json(*args):
    print(*[
        v if isinstance(v, str)
        else json.dumps(v, indent=2, ensure_ascii=False)
        for v in args
    ])

## Prepare Sample Data

In [None]:
# @title Sample data in context-zh, question-zh version
import json

sample_zh_zh_data = load_json_file("dev-context-zh-question-zh.json")['data']

sample_zh_zh_data_item = sample_zh_zh_data[1]
print_json("Sample item:", sample_zh_zh_data_item)

In [None]:
# @title Find matching sample data other language versions

def find_item_by_qa_id(data, qa_id):
    for item in data:
        for paragraph in item['paragraphs']:
            for qa in paragraph['qas']:
                if qa['id'] == qa_id:
                    return item
    return None

sample_zh_en_data = load_json_file("dev-context-zh-question-en.json")['data']
sample_en_zh_data = load_json_file("dev-context-en-question-zh.json")['data']

sample_zh_en_data_item = find_item_by_qa_id(
    sample_zh_en_data,
    sample_zh_zh_data_item['paragraphs'][0]['qas'][0]['id']
)
print_json("Sample zh-en item:", sample_zh_en_data_item)

sample_en_zh_data_item = find_item_by_qa_id(
    sample_en_zh_data,
    sample_zh_zh_data_item['paragraphs'][0]['qas'][0]['id']
)
print_json("Sample en-zh item:", sample_en_zh_data_item)

## Write a Data Converter

In [None]:
# @markdown Convert function for converting Simplified Chinese to Traditional Chinese and add spacing.

import opencc
import pangu
s2twp_converter = opencc.OpenCC('s2twp.json')

def convert(text):
    text = s2twp_converter.convert(text)
    text = pangu.spacing_text(text)
    return text

# Example usage
print(convert('互联网（英语称为Internet）是指20世纪末期兴起电脑网络与电脑网络之间所串连成的庞大网络系统。'))

In [None]:
# @markdown Print a sample item to reference it's structure.
print_json(sample_zh_zh_data_item)

Function to convert an item (in-place!):

In [None]:
def convert_item(item,
                 convert_contexts=True,
                 convert_questions=True,
                 convert_answers=True):
    if convert_contexts:
        item['title'] = convert(item['title'])

    for paragraph in item['paragraphs']:
        if convert_contexts:
            original_context = paragraph['context']
            paragraph['context'] = convert(paragraph['context'])

        for qa in paragraph['qas']:
            if convert_questions:
                qa['question'] = convert(qa['question'])

            if convert_contexts or convert_answers:
                for answer in qa['answers']:
                    if convert_contexts:
                        # Need to update answer_start
                        # if the context has been converted
                        original_context_before_answer = \
                            original_context[:answer['answer_start']]
                        converted_context_before_answer = \
                            convert(original_context_before_answer)
                        answer['answer_start'] = \
                            len(converted_context_before_answer)

                    if convert_answers:
                        answer['text'] = convert(answer['text'])


# Example usage

convert_item(sample_zh_zh_data_item)
print_json("Converted zh-zh item", sample_zh_zh_data_item)

convert_item(sample_zh_en_data_item,
             convert_questions=False)
print_json("Converted zh-en item", sample_zh_en_data_item)

convert_item(sample_en_zh_data_item,
             convert_contexts=False, convert_answers=False)
print_json("Converted en-zh item", sample_en_zh_data_item)


In [None]:
def convert_items(items_list, **kwargs):
    for item in items_list:
        convert_item(item, **kwargs)


# Example usage

convert_items(sample_zh_zh_data)
# print_json("Converted zh-zh items", sample_zh_zh_data[:2])

convert_items(sample_zh_en_data,
             convert_questions=False)
# print_json("Converted zh-en items", sample_zh_en_data[:2])

convert_items(sample_en_zh_data,
             convert_contexts=False, convert_answers=False)
# print_json("Converted en-zh items", sample_en_zh_data[:2])

## Write a Function to Merge Items



In [None]:
def make_mergeable_item(item, context_version, question_version):
    return {
        'title': {context_version: item['title']},
        'paragraphs': [{
            'context': {context_version: paragraph['context']},
            'qas': [{
                'id': qa['id'],
                'question': {question_version: qa['question']},
                'answers': {context_version: qa['answers']},
            } for qa in paragraph['qas']]
        } for paragraph in item['paragraphs']]
    }


# Example usage
print_json(
    "Original:",
    sample_zh_en_data_item)
print_json(
    "Mergeable:",
    make_mergeable_item(sample_zh_en_data_item, 'zh_tw', 'en'))

In [None]:
def merge_item(mergeable_item_1, mergeable_item_2):
    paragraphs_map = {}
    for paragraph in mergeable_item_1['paragraphs'] + mergeable_item_2['paragraphs']:
        paragraph_id = paragraph['qas'][0]['id']
        if paragraph_id not in paragraphs_map:
            paragraphs_map[paragraph_id] = {
                'context': {},
                'qas_map': {},
            }
        paragraphs_map[paragraph_id]['context'] = {
            **paragraphs_map[paragraph_id]['context'],
            **paragraph['context']
        }

        qas_map = paragraphs_map[paragraph_id]['qas_map']
        for qa in paragraph['qas']:
            qa_id = qa['id']
            if qa_id not in qas_map:
                qas_map[qa_id] = {
                    'id': qa_id,
                    'question': {},
                    'answers': {}
                }    
            qas_map[qa_id] = {
                **qas_map[qa_id],
                'question': {
                    **qas_map[qa_id]['question'],
                    **qa['question']

                },
                'answers': {
                    **qas_map[qa_id]['answers'],
                    **qa['answers']
                }
            }

    return {
        'title': {
            **mergeable_item_1['title'],
            **mergeable_item_2['title'],
        },
        'paragraphs': [{
            'context': paragraph['context'],
            'qas': [qa for qa in paragraph['qas_map'].values()]
        } for paragraph in paragraphs_map.values()]
    }


# Example usage
print_json(
    "Merged:",
    merge_item(
        make_mergeable_item(sample_en_zh_data_item, 'en', 'zh_tw'),
        make_mergeable_item(sample_zh_en_data_item, 'zh_tw', 'en')
    )
)

In [None]:
def merge_items(list_of_items_and_info):
    merged_items = {}
    for items, context_version, question_version in list_of_items_and_info:
        for item in items:
            # Take the ID of the first question as the ID of the item
            item_id = item['paragraphs'][0]['qas'][0]['id']
            mergeable_item = make_mergeable_item(
                item, context_version, question_version)
            if item_id not in merged_items:
                merged_items[item_id] = mergeable_item
            else:
                merged_items[item_id] = merge_item(
                    merged_items[item_id],
                    mergeable_item
                )
    return list(merged_items.values())


# Sample usage
sample_merged_items = merge_items([
    (sample_zh_zh_data, 'zh_tw', 'zh_tw'),
    (sample_en_zh_data, 'en', 'zh_tw'),
    (sample_zh_en_data, 'zh_tw', 'en'),
])
print_json("Sample merged items:", sample_merged_items[:3])
print("Sample zh-zh items count:", len(sample_zh_zh_data))
print("Sample en-zh items count:", len(sample_en_zh_data))
print("Sample zh-en items count:", len(sample_zh_en_data))
print("Sample merged items count:", len(sample_merged_items))

import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()
display(pd.DataFrame.from_dict([
    {
        'en title': item['title'].get('en'),
        'zh_tw title': item['title'].get('zh_tw'),
    } for item in sample_merged_items
], orient='columns'))

## Convert, Merge and Save!

In [None]:
import os
output_dir = '/content/outputs'
os.makedirs(output_dir, exist_ok=True)

print("Processing dev data...")
dev_zh_zh_data = load_json_file("dev-context-zh-question-zh.json")['data']
dev_zh_en_data = load_json_file("dev-context-zh-question-en.json")['data']
dev_en_zh_data = load_json_file("dev-context-en-question-zh.json")['data']
convert_items(dev_zh_zh_data)
convert_items(dev_zh_en_data,
              convert_questions=False)
convert_items(dev_en_zh_data,
              convert_contexts=False, convert_answers=False)
merged_dev_data = merge_items([
    (dev_zh_zh_data, 'zh_tw', 'zh_tw'),
    (dev_zh_en_data, 'zh_tw', 'en'),
    (dev_en_zh_data, 'en', 'zh_tw'),
])
print(f"Converted and merged into {len(merged_dev_data)} rows.")

output_filename = output_dir + "/" + "dev-en-zh_tw.json"
with open(output_filename, "w") as outfile:
    json.dump(merged_dev_data, outfile, indent=2, ensure_ascii=False)
print(f"Written to {output_filename}.")

In [None]:
print("Processing test data...")
test_zh_zh_data = load_json_file("test-context-zh-question-zh.json")['data']
test_zh_en_data = load_json_file("test-context-zh-question-en.json")['data']
test_en_zh_data = load_json_file("test-context-en-question-zh.json")['data']
convert_items(test_zh_zh_data)
convert_items(test_zh_en_data,
              convert_questions=False)
convert_items(test_en_zh_data,
              convert_contexts=False, convert_answers=False)
merged_test_data = merge_items([
    (test_zh_zh_data, 'zh_tw', 'zh_tw'),
    (test_zh_en_data, 'zh_tw', 'en'),
    (test_en_zh_data, 'en', 'zh_tw'),
])
print(f"Converted and merged into {len(merged_test_data)} rows.")

output_filename = output_dir + "/" + "test-en-zh_tw.json"
with open(output_filename, "w") as outfile:
    json.dump(merged_test_data, outfile, indent=2, ensure_ascii=False)
print(f"Written to {output_filename}.")