## Install dependencies

In [1]:
import os
import tensorflow as tf
%matplotlib inline

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [2]:
# Mention the path to the datastore
BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020"

# BASE_DIR = "/home/rithomas/project/AI-Music-Generation-Challenge-2020/"
ABC_DATA_DIR = os.path.join(BASE_DIR, "datasets", "abc_data")
ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", "abc")
PROCESSED_ABC_FILENAME = 'processed-abc-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [3]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
#json_path = preprocessor.process(ABC_DATA_DIR)
tokenizer = preprocessor.create_tokenizer()
print('---------------------------------------------------------')
print('ABC Extended Vocabulary:')
print(tokenizer.return_vocabulary())
print('---------------------------------------------------------')
tfrecord_path = preprocessor.save_as_tfrecord_dataset(tokenizer)

  0%|          | 1/31721 [00:00<1:23:50,  6.31it/s]

---------------------------------------------------------
ABC Extended Vocabulary:
{'word_to_idx': {'C,': '1', '^C,': '2', 'D,': '3', '^D,': '4', 'E,': '5', 'F,': '6', '^F,': '7', 'G,': '8', '^G,': '9', 'A,': '10', '^A,': '11', 'B,': '12', 'C': '13', '^C': '14', 'D': '15', '^D': '16', 'E': '17', 'F': '18', '^F': '19', 'G': '20', '^G': '21', 'A': '22', '^A': '23', 'B': '24', 'c': '25', '^c': '26', 'd': '27', '^d': '28', 'e': '29', 'f': '30', '^f': '31', 'g': '32', '^g': '33', 'a': '34', '^a': '35', 'b': '36', "c'": '37', "^c'": '38', "d'": '39', "^d'": '40', "e'": '41', "f'": '42', "g'": '43', "^g'": '44', "a'": '45', "^a'": '46', "B'": '47', '_C,': '48', '_D,': '49', '_E,': '50', '_G,': '51', '_A,': '52', '_B,': '53', '_C': '54', '_D': '55', '_E': '56', '_G': '57', '_A': '58', '_B': '59', '_c': '60', '_d': '61', '_e': '62', '_g': '63', '_a': '64', '_b': '65', "_c'": '66', "_d'": '67', "_e'": '68', "_g'": '69', "_a'": '70', "_b'": '71', '=C,': '72', '=E,': '73', '=F,': '74', '=G,': '75'

  0%|          | 75/31721 [00:06<49:03, 10.75it/s] 


KeyboardInterrupt: 