## Authorized and Setup Environment

In [None]:
!pip install -Uq datasets

In [None]:
# fill in your huggingface token to authorize
!huggingface-cli login

In [None]:
from datasets import load_dataset

## Load dataset in chunks via Parquet

- The dataset is divided into 105 chunks, each chunk containing approximately 4,000 samples.
- Link to other parquet files: [https://huggingface.co/datasets/linhtran92/viet_bud500/tree/main/data](https://huggingface.co/datasets/linhtran92/viet_bud500/tree/main/data)

In [None]:
train_url = "https://huggingface.co/datasets/linhtran92/viet_bud500/resolve/main/data/train-00000-of-00105-be5f872f8be772f5.parquet"
test_url = "https://huggingface.co/datasets/linhtran92/viet_bud500/resolve/main/data/test-00000-of-00002-531c1d81edb57297.parquet"

data_files = {"train": train_url, "test" : test_url}
dataset = load_dataset("parquet", data_files=data_files, num_proc=2)
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 6040
    })
    test: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 3750
    })
})

## Load dataset via Streaming

- Dataset streaming lets you work with a dataset without downloading it.
- Suitable if you want to quickly explore just a few samples of the dataset.

In [None]:
dataset = load_dataset("linhtran92/viet_bud500", split='test', streaming=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/105 [00:00<?, ?it/s]

In [None]:
dataset.take(2)

IterableDataset({
    features: ['audio', 'transcription'],
    n_shards: 2
})

##Load the complete dataset

- Requires approximately 100GB of storage.
- Take about 2 hours to complete loading.




In [None]:
dataset = load_dataset("linhtran92/viet_bud500", split="test", num_proc=2)
dataset

In [None]:
from datasets import load_dataset

# load from parquet file (~4000 samples in a parquet file)
# link to other parquet files: https://huggingface.co/datasets/linhtran92/viet_bud500/tree/main/data
train_url = "https://huggingface.co/datasets/linhtran92/viet_bud500/resolve/main/data/train-00000-of-00105-be5f872f8be772f5.parquet"
test_url = "https://huggingface.co/datasets/linhtran92/viet_bud500/resolve/main/data/test-00000-of-00002-531c1d81edb57297.parquet"
data_files = {"train": train_url, "test" : test_url}
dataset = load_dataset("parquet", data_files=data_files)

# load dataset via streaming
dataset = load_dataset("linhtran92/viet_bud500", split='test', streaming=True)
dataset.take(2)

# load all (649158 samples, ~100gb, 2hrs to )
dataset = load_dataset("linhtran92/viet_bud500", split="test")