<a href="https://colab.research.google.com/github/yujin-kimmm/mirdata_colab_example/blob/main/mirdata_colab_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to the mirdata Colab Example!**

This notebook provides a hands-on introduction to using the `mirdata` library in Google Colab. `mirdata` is a Python library designed to make it easy to load and work with common music information retrieval (MIR) datasets.

In this notebook, you will learn how to:

* Install `mirdata`
* Load a dataset
* Download and store the dataset
* Validate the dataset

Let's get started!

## Getting Ready

First, install `mirdata` package

In [1]:
!pip install mirdata

Collecting git+https://github.com/mir-dataset-loaders/mirdata.git
  Cloning https://github.com/mir-dataset-loaders/mirdata.git to /tmp/pip-req-build-714ct_qh
  Running command git clone --filter=blob:none --quiet https://github.com/mir-dataset-loaders/mirdata.git /tmp/pip-req-build-714ct_qh
  Resolved https://github.com/mir-dataset-loaders/mirdata.git to commit d3a92876cfa85c4ecd92348f44b393c345be9b2d
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting Deprecated>=1.2.14 (from mirdata==1.0.0rc1)
  Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting pretty_midi>=0.2.10 (from mirdata==1.0.0rc1)
  Downloading pretty_midi-0.2.10.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mido

Next, import the `mirdata` package

In [2]:
import mirdata

To check all available datasets in `mirdata`, you can print the list of the datasets

In [3]:
print(mirdata.list_datasets())

['acousticbrainz_genre', 'baf', 'ballroom', 'beatles', 'beatport_key', 'billboard', 'brid', 'candombe', 'cante100', 'cipi', 'compmusic_carnatic_rhythm', 'compmusic_carnatic_varnam', 'compmusic_hindustani_rhythm', 'compmusic_indian_tonic', 'compmusic_jingju_acappella', 'compmusic_otmm_makam', 'compmusic_raga', 'cuidado', 'da_tacos', 'dagstuhl_choirset', 'dali', 'egfxset', 'filosax', 'four_way_tabla', 'freesound_one_shot_percussive_sounds', 'giantsteps_key', 'giantsteps_tempo', 'good_sounds', 'groove_midi', 'gtzan_genre', 'guitarset', 'hainsworth', 'haydn_op20', 'idmt_smt_audio_effects', 'ikala', 'irmas', 'jtd', 'maestro', 'mdb_stem_synth', 'medley_solos_db', 'medleydb_melody', 'medleydb_pitch', 'mridangam_stroke', 'mtg_jamendo_autotagging_moodtheme', 'openmic2018', 'orchset', 'phenicx_anechoic', 'queen', 'rwc_classical', 'rwc_jazz', 'rwc_popular', 'salami', 'saraga_carnatic', 'saraga_hindustani', 'scms', 'simac', 'slakh', 'tinysol', 'tonality_classicaldb', 'tonas', 'vocadito']


## Initialize a dataset

To use a loader in `mirdata`, you should first initialize it. For this example, we will use `orchset` as an exmaple.

In [4]:
dataset = mirdata.initialize('orchset')

### Dataset versions

Mirdata supports working with multiple dataset versions. To see all available versions of a specific dataset, run `mirdata.list_dataset_versions('orchset')`. Use version parameter if you wish to use a version other than the default one.

In [10]:
# if you are willing to use a default version of the dataset, please comment out or pass this cell.

# To see all available versions of a specific dataset:
mirdata.list_dataset_versions('orchset')

# Use 'version' parameter if you wish to use a version other than the default one.
dataset = mirdata.initialize('orchset', data_home='/choose/where/data/live', version="1.0") # replace the directory to your directory.

## Download the dataset

`mirdata` datasets are downloaded to a default directory ``/root/sound_datasets/<Dataset_Name>`` in Colab. This directory is temporary and will be reset every time you restart your Google Colab Session.

In [5]:
dataset.download()

311MB [05:38, 963kB/s]                           
32.0kB [00:00, 50.6kB/s]                            


By default, data is downloaded to
```
/root/mir_datasets/<dataset_name>
```

### Data Storage

There are several ways to keep the dataset without downloading everytime you restart the session.

1. **Copying the downloaded dataset to Google Drive:** After downloading the dataset, you can copy it to your Google Drive for persistent storage and easy access across different Colab sessions.

In [9]:
from google.colab import drive
drive.mount('/content/drive')

# Replace the <dataset_name> to name of the dataset you downloaded.
!cp -r /root/mir_datasets/<dataset_name> /content/drive/MyDrive/<dataset_name>

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


2. **Setting a custom download path when initializing the dataset:** You can specify a different download directory when initializing the dataset loader using the `data_home` parameter. This allows you to download the dataset directly to a desired location, such as a mounted Google Drive folder.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Replace the <dataset_name> to name of the dataset you downloaded.
import mirdata
dataset = mirdata.initialize('<dataset_name>', data_home='/content/drive/MyDrive/<Folder_Name>')
dataset.download() # Dataset will be downloaded to `data_home` directory.

3. **Accessing a dataset downloaded outside of Google Colab:** If you have already downloaded a dataset locally, you can upload it to your Google Drive or directly to the Colab environment and then initialize the `mirdata` loader with the path to the dataset directory using the `data_home` parameter.

## Validate the dataset

Using the method `validate()`, we can check if the files in the local version are the same than the available canonical version, and the files were downloaded correctly (none of them are corrupted).

In [6]:
dataset.validate()

100%|██████████| 1/1 [00:00<00:00, 2732.45it/s]
100%|██████████| 64/64 [00:01<00:00, 47.20it/s]


({'metadata': {}, 'tracks': {}}, {'metadata': {}, 'tracks': {}})

Now you are ready to use mirdata in Google Colab! You can explore more examples from the mirdata usage examples in the documentation to learn about different datasets and tasks, and then apply these techniques and insights to your own music information retrieval projects.