## How to use
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

For example, to download the persian config, simply specify the corresponding language config name (i.e., "fa_ir" for persian):

In [1]:
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "fa_ir", split="train")

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 100%|██████████| 12.6k/12.6k [00:00<00:00, 12.6MB/s]
Downloading readme: 100%|██████████| 13.3k/13.3k [00:00<00:00, 13.4MB/s]
Downloading data: 100%|██████████| 2.14G/2.14G [23:07<00:00, 1.54MB/s]
Downloading data: 100%|██████████| 272M/272M [02:51<00:00, 1.58MB/s]it]
Downloading data: 100%|██████████| 657M/657M [07:57<00:00, 1.38MB/s]t] 
Downloading data files: 100%|██████████| 3/3 [34:06<00:00, 682.12s/it]
Extracting data files: 100%|██████████| 3/3 [00:31<00:00, 10.38s/it]
Downloading data: 100%|██████████| 2.48M/2.48M [00:02<00:00, 1.22MB/s]
Downloading data: 100%|██████████| 290k/290k [00:00<00:00, 804kB/s]t]
Downloading data: 100%|██████████| 708k/708k [00:00<00:00, 891kB/s] ]
Downloading data files: 100%|██████████| 3/3 [00:06<00:00,  2.06s/it]
Generating train split: 0 examples [00:00, ? examples/s]


DatasetGenerationError: An error occurred while generating the dataset

Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.

In [None]:
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "fa_ir", split="train", streaming=True)
print(next(iter(fleurs)))

### Local:

In [None]:
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
fleurs = load_dataset("google/fleurs", "fa_ir", split="train")
batch_sampler = BatchSampler(RandomSampler(fleurs), batch_size=32, drop_last=False)
dataloader = DataLoader(fleurs, batch_sampler=batch_sampler)

### Streaming:

In [None]:
from datasets import load_dataset
from torch.utils.data import DataLoader
fleurs = load_dataset("google/fleurs", "fa_ir", split="train")
dataloader = DataLoader(fleurs, batch_size=32)

## 1. Speech Recognition (ASR)

In [None]:
from datasets import load_dataset

fleurs_asr = load_dataset("google/fleurs", "fa_ir")  # for Afrikaans
# to download all data for multi-lingual fine-tuning uncomment following line
# fleurs_asr = load_dataset("google/fleurs", "all")

# see structure
print(fleurs_asr)

# load audio sample on the fly
audio_input = fleurs_asr["train"][0]["audio"]  # first decoded audio sample
transcription = fleurs_asr["train"][0]["transcription"]  # first transcription
# use `audio_input` and `transcription` to fine-tune your model for ASR

# for analyses see language groups
all_language_groups = fleurs_asr["train"].features["lang_group_id"].names
lang_group_id = fleurs_asr["train"][0]["lang_group_id"]

all_language_groups[lang_group_id]

## 2. Language Identification
LangID can often be a domain classification, but in the case of FLEURS-LangID, recordings are done in a similar setting across languages and the utterances correspond to n-way parallel sentences, in the exact same domain, making this task particularly relevant for evaluating LangID. The setting is simple, FLEURS-LangID is splitted in train/valid/test for each language. We simply create a single train/valid/test for LangID by merging all.

In [None]:
from datasets import load_dataset

fleurs_langID = load_dataset("google/fleurs", "all") # to download all data

# see structure
print(fleurs_langID)

# load audio sample on the fly
audio_input = fleurs_langID["train"][0]["audio"]  # first decoded audio sample
language_class = fleurs_langID["train"][0]["lang_id"]  # first id class
language = fleurs_langID["train"].features["lang_id"].names[language_class]

# use audio_input and language_class to fine-tune your model for audio classification

## 3. Retrieval
Retrieval provides n-way parallel speech and text data. Similar to how XTREME for text leverages Tatoeba to evaluate bitext mining a.k.a sentence translation retrieval, we use Retrieval to evaluate the quality of fixed-size representations of speech utterances. Our goal is to incentivize the creation of fixed-size speech encoder for speech retrieval. The system has to retrieve the English "key" utterance corresponding to the speech translation of "queries" in 15 languages. Results have to be reported on the test sets of Retrieval whose utterances are used as queries (and keys for English). We augment the English keys with a large number of utterances to make the task more difficult.

In [None]:
from datasets import load_dataset

fleurs_retrieval = load_dataset("google/fleurs", "fa_ir")  # for Afrikaans
# to download all data for multi-lingual fine-tuning uncomment following line
# fleurs_retrieval = load_dataset("google/fleurs", "all")

# see structure
print(fleurs_retrieval)

# load audio sample on the fly
audio_input = fleurs_retrieval["train"][0]["audio"]  # decoded audio sample
text_sample_pos = fleurs_retrieval["train"][0]["transcription"]  # positive text sample
text_sample_neg = fleurs_retrieval["train"][1:20]["transcription"] # negative text samples

# use `audio_input`, `text_sample_pos`, and `text_sample_neg` to fine-tune your model for retrieval