Inconsistency between load_dataset and load_from_disk functionality

## Issue Description

I've encountered confusion when using `load_dataset` and `load_from_disk` in the datasets library. Specifically, when working offline with the gsm8k dataset, I can load it using a local path:

```python
import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
```
output:
```text
DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})
```

This works as expected. However, after processing the dataset (converting answer format from #### to \boxed{})
```python
import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
ds_train = ds['train']
ds_test = ds['test']
import re
def convert(sample):
    solution = sample['answer']
    solution = re.sub(r'####\s*(\S+)', r'\\boxed{\1}', solution)
    sample = {
        'problem': sample['question'],
        'solution': solution
    }
    return sample

ds_train = ds_train.map(convert, remove_columns=['question', 'answer'])
ds_test = ds_test.map(convert,remove_columns=['question', 'answer'])
```

 I saved it using save_to_disk:
```python
from datasets.dataset_dict import DatasetDict
data_dict = DatasetDict({
    'train': ds_train,
    'test': ds_test
})
data_dict.save_to_disk('/root/xxx/datasets/gsm8k-new')
```
But now I can only load it using load_from_disk:

```python
new_ds = load_from_disk('/root/xxx/datasets/gsm8k-new')
```
output:
```text
DatasetDict({
    train: Dataset({
        features: ['problem', 'solution'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['problem', 'solution'],
        num_rows: 1319
    })
})
```

Attempting to use load_dataset produces unexpected results:
```python
new_ds = load_dataset('/root/xxx/datasets/gsm8k-new')
```
output:
```text
DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    test: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})
```
Questions
1. Why is it designed such that after using `save_to_disk`, the dataset cannot be loaded with `load_dataset`? For small projects with limited code, it might be relatively easy to change all instances of `load_dataset` to `load_from_disk`. However, for complex frameworks like TRL or lighteval, diving into the framework code to change `load_dataset` to `load_from_disk` is extremely tedious and error-prone.
Additionally, `load_from_disk` cannot load datasets directly downloaded from the hub, which means that if you need to modify a dataset, you have to choose between using `load_from_disk` or `load_dataset`. This creates an unnecessary dichotomy in the API and complicates workflow when working with modified datasets.
2. What's the recommended approach for this use case? Should I manually process my gsm8k-new dataset to make it compatible with load_dataset? Is there a standard way to convert between these formats?

thanks~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistency between load_dataset and load_from_disk functionality #7503

Issue Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency between load_dataset and load_from_disk functionality #7503

Description

Issue Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions