## Install dependencies

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
%matplotlib inline

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [None]:
# Mention the path to the datastore
#BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020"

BASE_DIR = "/home/rithomas/"
ABC_DATA_DIR = os.path.join(BASE_DIR, "data", "ABC")
ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "cache", "ABC", "6-8-Meter")
PROCESSED_ABC_FILENAME = 'processed-abc-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [None]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
#csv_path = preprocessor.process(ABC_DATA_DIR)
tokenizer = preprocessor.create_tokenizer()
print('---------------------------------------------------------')
print('ABC Extended Vocabulary:')
print(tokenizer.return_vocabulary())
print('---------------------------------------------------------')
tfrecord_path = preprocessor.save_as_tfrecord_dataset(tokenizer)