## 目標
### 1. 將DRDC資料轉換為cmrc2018的格式
### 2. 建立huggingface DatasetDict
### 3. 上傳到Huggingface DataSet

- 繁體中文DRDC,資料結構過於複雜,轉成cmrc2018的格式,會降低複雜度,易於了解資料集的轉換

In [1]:
import json
from datasets import Dataset, DatasetDict
from pprint import pprint

def convert_to_cmrc2018(input_file, output_file):
    """Converts a DRCD JSON file to cmrc2018 format, removing the 'data' key."""

    data = []
    with open(input_file, 'r', encoding='utf-8') as f:
        drcd_data = json.load(f)['data']  # Load the data
        for article in drcd_data:
            for paragraph in article['paragraphs']:
                for qa in paragraph['qas']:
                    cmrc_example = {
                        'id': qa['id'],
                        'context': paragraph['context'],
                        'question': qa['question'],
                        'answers': {
                            'text': [qa['answers'][0]['text']],  # Only the first answer
                            'answer_start': [qa['answers'][0]['answer_start']]
                        }
                    }
                    data.append(cmrc_example)

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4) # Directly dump the list


# Convert your files
convert_to_cmrc2018('DRCD_training.json', 'cmrc2018_train.json')
convert_to_cmrc2018('DRCD_dev.json', 'cmrc2018_dev.json')
convert_to_cmrc2018('DRCD_test.json', 'cmrc2018_test.json')


# Create a DatasetDict (rest of the code remains unchanged)
train_dataset = Dataset.from_json('cmrc2018_train.json')
validation_dataset = Dataset.from_json('cmrc2018_dev.json')
test_dataset = Dataset.from_json('cmrc2018_test.json')

datasets = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset,
    'test': test_dataset
})

# Prepare for Hugging Face upload (requires authentication)
# You'll need a Hugging Face account and an access token.

# from huggingface_hub import login
# login() #  Login to Hugging Face, follow the instructions

# from huggingface_hub import upload_dataset
# upload_dataset(dataset, "your-huggingface-username/your-dataset-name", token="your-huggingface-token")
print(datasets)
print(datasets["train"][0])



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 26936
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 3524
    })
    test: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 3493
    })
})
{'id': '1001-10-1', 'context': '2010年引進的廣州快速公交運輸系統，屬世界第二大快速公交系統，日常載客量可達100萬人次，高峰時期每小時單向客流高達26900人次，僅次於波哥大的快速交通系統，平均每10秒鐘就有一輛巴士，每輛巴士單向行駛350小時。包括橋樑在內的站台是世界最長的州快速公交運輸系統站台，長達260米。目前廣州市區的計程車和公共汽車主要使用液化石油氣作燃料，部分公共汽車更使用油電、氣電混合動力技術。2012年底開始投放液化天然氣燃料的公共汽車，2014年6月開始投放液化天然氣插電式混合動力公共汽車，以取代液化石油氣公共汽車。2007年1月16日，廣州市政府全面禁止在市區內駕駛摩托車。違反禁令的機動車將會予以沒收。廣州市交通局聲稱禁令的施行，使得交通擁擠問題和車禍大幅減少。廣州白雲國際機場位於白雲區與花都區交界，2004年8月5日正式投入運營，屬中國交通情況第二繁忙的機場。該機場取代了原先位於市中心的無法滿足日益增長航空需求的舊機場。目前機場有三條飛機跑道，成為國內第三個擁有三跑道的民航機場。比鄰近的香港國際機場第三跑道預計的2023年落成早8年。', 'question': '廣州的快速公交運輸系統每多久就會有一輛巴士？', 'answers': {'answer_start': [84], 'text': ['10秒鐘']}}


### 將資料上傳至huggingface


In [3]:
from huggingface_hub import HfApi, notebook_login
api = HfApi()
notebook_login() #this is likely needed for authentication
datasets.push_to_hub('roberthsu2003/for_MRC_QA',private=False)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/27 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/roberthsu2003/for_MRC_QA/commit/cc8019be74843a25670b39badb64f6135ec9e97c', commit_message='Upload dataset', commit_description='', oid='cc8019be74843a25670b39badb64f6135ec9e97c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/roberthsu2003/for_MRC_QA', endpoint='https://huggingface.co', repo_type='dataset', repo_id='roberthsu2003/for_MRC_QA'), pr_revision=None, pr_num=None)