## Install dependencies

In [1]:
import os
import tensorflow as tf

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [2]:
# Mention the path to the datastore
BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/"
ABC_DATA_DIR = os.path.join(BASE_DIR, "datasets", "abc_data")
AUDIO_DATA_DIR = os.path.join(BASE_DIR, "datasets", "audio")
ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'abc')
AUDIO_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'audio')
PROCESSED_ABC_FILENAME = 'processed-abc-files'
PROCESSED_AUDIO_FILENAME = 'processed-audio-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Calculate maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [3]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
json_path = preprocessor.process(ABC_DATA_DIR)
tfrecord_path = preprocessor.save_as_tfrecord_dataset()
preprocessed_dataset = preprocessor.load_tfrecord_dataset()
#abc_dataset = preprocessor.prepare_dataset(preprocessed_dataset)

The raw data has already been processed. Pre-processed information found at - /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/tfrecords/abc/processed-abc-files.json
{'tune': array([[ 2,  4,  2, ...,  0,  0,  0],
       [ 2, 11, 14, ...,  0,  0,  0],
       [12, 27, 12, ...,  0,  0,  0],
       ...,
       [11, 12,  1, ...,  0,  0,  0],
       [ 7, 24, 11, ...,  0,  0,  0],
       [ 9,  6,  3, ...,  0,  0,  0]], dtype=int32), 'K': {'E': 0, 'Dmix': 1, 'Gdor': 2, 'E ': 3, 'A': 4, 'Edor': 5, 'Em': 6, 'Gm': 7, 'Bdor': 8, 'Ador': 9, 'F#m': 10, 'Bm': 11, 'Gmix': 12, 'D': 13, 'Bmix': 14, 'Glyd': 15, 'Cmix': 16, 'F#dor': 17, 'Ddor': 18, 'Dm': 19, 'Emix': 20, 'Amix': 21, 'C': 22, 'F': 23, 'G': 24, 'Am': 25}, 'M': {'C': 0, '3/2': 1, '3/4': 2, '6/4': 3, 'C|': 4, '2/2': 5, '4/4': 6, '2/4': 7, '9/8': 8, '6/8': 9}, 'R': {'slip jig': 0, 'reel': 1, 'song': 2, 'Double jig, march': 3, 'carolan': 4, 'march': 5, 'waltz': 6, 'barndance': 7, 'single jig': 8, 'highland': 9, 'hornpipe': 10, '

NameError: name 'akajk' is not defined

In [None]:
print(preprocessed_dataset)
for x in preprocessed_dataset:
    print(tf.io.parse_tensor(x['tune'], tf.int32))
    print(x['key'])
    print(x['rhythm'])
    print(x['meter'])
    break