## Install dependencies

In [1]:
import os
import tensorflow as tf
%matplotlib inline

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [2]:
# Mention the path to the datastore
BASE_DIR = "/home/rithomas/project/AI-Music-Generation-Challenge-2020/"
ABC_DATA_DIR = os.path.join("/home/rithomas/data", "ABC")
ABC_TFRECORD_DIR = os.path.join("/home/rithomas/cache", "ABC", "Double-Jigs/")
PROCESSED_ABC_FILENAME = 'processed-abc-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [3]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
json_path = preprocessor.process(ABC_DATA_DIR)
tfrecord_path = preprocessor.save_as_tfrecord_dataset(
    os.path.join(ABC_TFRECORD_DIR, 'tunes_vocab.json')
)

100%|██████████| 366/366 [00:00<00:00, 6590.91it/s]
  0%|          | 0/366 [00:00<?, ?it/s]

Cool. Lets process these ABC files now!
Processing files and writing extracted information to CSV for easy processing from here onwards ...
CSV PATH --> /home/rithomas/cache/ABC/Double-Jigs/processed-abc-files.csv
Found 1 files in the ABC Data Directory. Looking into these files now ..
---------------------------------------------------------
0. /home/rithomas/data/ABC/double_jigs_cleaned.abc
Extracting information and storing to CSV now ..
---------------------------------------------------------
Stored tunes from file /home/rithomas/data/ABC/double_jigs_cleaned.abc to CSV!
---------------------------------------------------------
Number of tunes - 365
---------------------------------------------------------
ABC Extended Vocabulary:
{'word_to_idx': {'C,': '1', '^C,': '2', 'D,': '3', '^D,': '4', 'E,': '5', 'F,': '6', '^F,': '7', 'G,': '8', '^G,': '9', 'A,': '10', '^A,': '11', 'B,': '12', 'C': '13', '^C': '14', 'D': '15', '^D': '16', 'E': '17', 'F': '18', '^F': '19', 'G': '20', '^G': '

100%|█████████▉| 365/366 [01:16<00:00,  4.78it/s]

Done saving to TFRecord Dataset!



