# 00-download_dataset_annotations

This notebook downloads `annotation.json` from the `dlfm2016-fix1` release of [otmm_makam_recognition_dataset](https://github.com/sertansenturk/otmm_makam_recognition_dataset/blob/dlfm2016-fix1/annotations.json). Then, it creates an mlflow run under an experiment named `data_processing`, parses the annotations and saves it as an artifact with relevant information.

In [1]:
import configparser
import importlib
import json
import logging
import os
import tempfile

import mlflow
import numpy as np
import pandas as pd


# Stop if annotations were fetched in the past


In [2]:
experiment_name = "data_processing"
run_name = "download_annotations"

experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is not None:
    annotation_runs = mlflow.search_runs(
        experiment_ids=experiment.experiment_id,
        filter_string=f"tags.mlflow.runName = '{run_name}'")

    assert len(annotation_runs) == 0, (
        f"There is already a run for {run_name}:{', '.join(annotation_runs.run_id)}. "
        "Overwriting is not permitted. Please inspect the run in the mlflow UI "
        "and manually make the necessary corrections.")


## Init logger

In [3]:
importlib.reload(logging)  # fix jupyter logging: https://stackoverflow.com/a/21475297
logging.basicConfig(level=logging.INFO)

# create logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

logger.info("Logger initiated...")

INFO:__main__:Logger initiated...


## Read relevant parts of the configuration

In [4]:
config_dir = "../config"

config = configparser.ConfigParser()
config_file = config.read(os.path.join(config_dir, 'config.ini'))
logger.info(f"Reading configuration from {config_file}")

annotation_github_file = config["dataset"]["annotation_file"]
num_recordings = config.getint("dataset", "num_recordings")
num_recordings_per_makam = config.getint("dataset", "num_recordings_per_makam")
num_makams = num_recordings / num_recordings_per_makam  # 20
logger.info(f"annotations.json URL: {annotation_github_file}")

INFO:__main__:Reading configuration from ['../config/config.ini']
INFO:__main__:annotations.json URL: https://raw.githubusercontent.com/sertansenturk/otmm_makam_recognition_dataset/dlfm2016-fix1/annotations.json


## Download annotations from `otmm_makam_recognition_dataset`

We use the `otmm_makam_recognition_dataset`, which was compiled for the following paper:

    Karakurt, A., Şentürk S., & Serra X. (2016). MORTY: A Toolbox for Mode Recognition and Tonic Identification. 3rd International Digital Libraries for Musicology Workshop. New York, USA
    
We read the annotations from `dlfm2016-fix1`, the latest release of the dataset as of April 2020. 

### Read from github

In [5]:
logger.info(f"Reading annotations from {annotation_github_file}")
annotations = pd.read_json(annotation_github_file)
annotations["mb_url"] = annotations["mbid"]
annotations["mbid"] = annotations["mbid"].str.split(pat = "/").apply(lambda a: a[-1])


INFO:__main__:Reading annotations from https://raw.githubusercontent.com/sertansenturk/otmm_makam_recognition_dataset/dlfm2016-fix1/annotations.json


### Populate `dunya_uid`s

The MBIDs in `CompMusic makam music corpus` may be outdated, i.e. they may not be pointing to the master MBID. `otmm_makam_recognition_dataset` patches such recordings with an extra `dunya_uid` key. 

Below, we merge the `mbid` and `dunya_uid`'s to ensure we send the correct requests to Dunya API.

In [6]:
annotations.loc[annotations["dunya_uid"].isna(), "dunya_uid"] = \
    annotations.loc[annotations["dunya_uid"].isna(), "mbid"]

display(annotations.head())

Unnamed: 0,mbid,verified,tonic,makam,observations,dunya_uid,mb_url
0,00f1c6d9-c8ee-45e3-a06f-0882ebcb4e2f,False,256.0,Acemasiran,,00f1c6d9-c8ee-45e3-a06f-0882ebcb4e2f,http://musicbrainz.org/recording/00f1c6d9-c8ee...
1,168f7c75-84fb-4316-99d7-acabadd3b2e6,False,115.2,Acemasiran,,168f7c75-84fb-4316-99d7-acabadd3b2e6,http://musicbrainz.org/recording/168f7c75-84fb...
2,24f549dd-3fa4-4e9b-a356-778fbbfd5cad,False,232.5,Acemasiran,,24f549dd-3fa4-4e9b-a356-778fbbfd5cad,http://musicbrainz.org/recording/24f549dd-3fa4...
3,407bb0b4-f19b-42ab-8c0a-9f1263126951,False,233.5,Acemasiran,,407bb0b4-f19b-42ab-8c0a-9f1263126951,http://musicbrainz.org/recording/407bb0b4-f19b...
4,443819eb-6092-420c-bd86-d946a0ad6555,False,219.6,Acemasiran,,443819eb-6092-420c-bd86-d946a0ad6555,http://musicbrainz.org/recording/443819eb-6092...


### Validate annotations

In [7]:
assert len(annotations.mbid) == num_recordings, f"There are less than {num_recordings} recordings"
assert len(annotations.mbid.unique()) == num_recordings, "MusicBrainz ID (MBIDs) are not unique"

makam_counts = annotations.makam.value_counts()
assert len(makam_counts) == num_makams, "There are less than {num_makams} makams"
np.testing.assert_array_equal(makam_counts.unique(), [50])


## Log annotations

In [8]:
# start run
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=run_name) as mlflow_run:
    run_id = mlflow_run.info.run_id
    
    git_repo_url = !git config --get remote.origin.url
    git_commit = !git rev-parse HEAD
    notebook_name = "./notebooks/00-download_dataset_annotations.ipynb"
    dataset_tags = {"dataset_" + key: val for key, val in dict(config["dataset"]).items()}
    mlflow.set_tags({
        "mlflow.source.type": "NOTEBOOK",
        "mlflow.source.name": notebook_name,
        "mlflow.source.git.commit": git_commit[0],
        "mlflow.source.git.repoURL": git_repo_url[0],
        **dataset_tags
    })
    
    with tempfile.TemporaryDirectory() as tmp_dir:
        annotations_tmp_file = os.path.join(tmp_dir, "annotations.json")
        annotations.to_json(annotations_tmp_file, orient="records")
        
        mlflow.log_artifacts(tmp_dir)
