Open
Description
Issue Description
I've encountered confusion when using load_dataset
and load_from_disk
in the datasets library. Specifically, when working offline with the gsm8k dataset, I can load it using a local path:
import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
output:
DatasetDict({
train: Dataset({
features: ['question', 'answer'],
num_rows: 7473
})
test: Dataset({
features: ['question', 'answer'],
num_rows: 1319
})
})
This works as expected. However, after processing the dataset (converting answer format from #### to \boxed{})
import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
ds_train = ds['train']
ds_test = ds['test']
import re
def convert(sample):
solution = sample['answer']
solution = re.sub(r'####\s*(\S+)', r'\\boxed{\1}', solution)
sample = {
'problem': sample['question'],
'solution': solution
}
return sample
ds_train = ds_train.map(convert, remove_columns=['question', 'answer'])
ds_test = ds_test.map(convert,remove_columns=['question', 'answer'])
I saved it using save_to_disk:
from datasets.dataset_dict import DatasetDict
data_dict = DatasetDict({
'train': ds_train,
'test': ds_test
})
data_dict.save_to_disk('/root/xxx/datasets/gsm8k-new')
But now I can only load it using load_from_disk:
new_ds = load_from_disk('/root/xxx/datasets/gsm8k-new')
output:
DatasetDict({
train: Dataset({
features: ['problem', 'solution'],
num_rows: 7473
})
test: Dataset({
features: ['problem', 'solution'],
num_rows: 1319
})
})
Attempting to use load_dataset produces unexpected results:
new_ds = load_dataset('/root/xxx/datasets/gsm8k-new')
output:
DatasetDict({
train: Dataset({
features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
num_rows: 1
})
test: Dataset({
features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
num_rows: 1
})
})
Questions
- Why is it designed such that after using
save_to_disk
, the dataset cannot be loaded withload_dataset
? For small projects with limited code, it might be relatively easy to change all instances ofload_dataset
toload_from_disk
. However, for complex frameworks like TRL or lighteval, diving into the framework code to changeload_dataset
toload_from_disk
is extremely tedious and error-prone.
Additionally,load_from_disk
cannot load datasets directly downloaded from the hub, which means that if you need to modify a dataset, you have to choose between usingload_from_disk
orload_dataset
. This creates an unnecessary dichotomy in the API and complicates workflow when working with modified datasets. - What's the recommended approach for this use case? Should I manually process my gsm8k-new dataset to make it compatible with load_dataset? Is there a standard way to convert between these formats?
thanks~
Metadata
Metadata
Assignees
Labels
No labels