Skip to content

Inconsistency between load_dataset and load_from_disk functionality #7503

Open
@zzzzzec

Description

@zzzzzec

Issue Description

I've encountered confusion when using load_dataset and load_from_disk in the datasets library. Specifically, when working offline with the gsm8k dataset, I can load it using a local path:

import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')

output:

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

This works as expected. However, after processing the dataset (converting answer format from #### to \boxed{})

import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
ds_train = ds['train']
ds_test = ds['test']
import re
def convert(sample):
    solution = sample['answer']
    solution = re.sub(r'####\s*(\S+)', r'\\boxed{\1}', solution)
    sample = {
        'problem': sample['question'],
        'solution': solution
    }
    return sample

ds_train = ds_train.map(convert, remove_columns=['question', 'answer'])
ds_test = ds_test.map(convert,remove_columns=['question', 'answer'])

I saved it using save_to_disk:

from datasets.dataset_dict import DatasetDict
data_dict = DatasetDict({
    'train': ds_train,
    'test': ds_test
})
data_dict.save_to_disk('/root/xxx/datasets/gsm8k-new')

But now I can only load it using load_from_disk:

new_ds = load_from_disk('/root/xxx/datasets/gsm8k-new')

output:

DatasetDict({
    train: Dataset({
        features: ['problem', 'solution'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['problem', 'solution'],
        num_rows: 1319
    })
})

Attempting to use load_dataset produces unexpected results:

new_ds = load_dataset('/root/xxx/datasets/gsm8k-new')

output:

DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    test: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})

Questions

  1. Why is it designed such that after using save_to_disk, the dataset cannot be loaded with load_dataset? For small projects with limited code, it might be relatively easy to change all instances of load_dataset to load_from_disk. However, for complex frameworks like TRL or lighteval, diving into the framework code to change load_dataset to load_from_disk is extremely tedious and error-prone.
    Additionally, load_from_disk cannot load datasets directly downloaded from the hub, which means that if you need to modify a dataset, you have to choose between using load_from_disk or load_dataset. This creates an unnecessary dichotomy in the API and complicates workflow when working with modified datasets.
  2. What's the recommended approach for this use case? Should I manually process my gsm8k-new dataset to make it compatible with load_dataset? Is there a standard way to convert between these formats?

thanks~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions