## Install dependencies

In [1]:
import ddsp
import tensorflow as tf
import numpy
import matplotlib
import os
import glob
from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ One-hot encoded tune body, meter, key ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [2]:
# Mention the path to the datastore
BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/"
ABC_DATA_DIR = os.path.join(BASE_DIR, "datasets", "abc_data")
AUDIO_DATA_DIR = os.path.join(BASE_DIR, "datasets", "audio")
ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'abc')
AUDIO_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'audio')

## Preprocessing - ABC Notation Dataset

In [3]:
preprocessor = ABCPreProcessor(ABC_DATA_DIR, ABC_TFRECORD_DIR, 'processed-abc-files')
# Stores extracted information in a structured JSON file
#preprocessor.process()

# Compute features and create a dataset
preprocessor.calculate_statistics()

preprocessor.save_as_tfrecord_dataset()

Length of data: 3127
Size of Vocabulary: 68
{'[', '^', 'M', 'J', '(', '_', '|', 'B', "'", 'l', 't', 'm', '<', 'P', '=', 'o', 'r', 'z', '~', 'f', '2', '>', 'v', '.', '3', 'k', '{', '<s>', ',', '#', '1', '8', '}', 'e', '7', 'T', 'g', 'D', 'E', 'b', 'S', 'c', '-', 'p', 'F', 'A', '6', 'n', 'Q', 'd', 'G', '4', 'H', '\\', ' ', ']', '"', 'h', 'O', 's', 'C', ':', 'i', ')', 'a', 'L', '/', '9'}
Number of modal keys: 26
{'Dm', 'Emix', 'Ddor', 'Edor', 'Am', 'Bmix', 'Gdor', 'Dmix', 'Em', 'Amix', 'Bdor', 'D', 'Ador', 'E', 'Glyd', 'F', 'A', 'Cmix', 'F#m', 'G', 'Gm', 'F#dor', 'C', 'E ', 'Bm', 'Gmix'}
Number of musical meters: 8
{'3/2', '2/4', '4/4', '6/8', '3/4', '9/8', '6/4', '2/2'}
Number of rhythms: 23
{'single jig', 'mazurka', 'set dance', 'slide', 'waltz', 'hop jig', 'fling', 'polka', 'slip jig', 'Double jig, march', 'jig', 'Double jig', 'hornpipe', 'reel', 'song', 'march', 'slow air', 'air', 'carolan', 'country dance', 'barndance', 'strathspey', 'highland'}
{'Dm': 0, 'Emix': 1, 'Ddor': 2, 'Edor'