## Install dependencies

In [1]:
import os
import tensorflow as tf
%matplotlib inline

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

### Audio Data
- Turning the full audio into short examples (4 seconds by default, but adjustable with flags)
- Inferring the fundamental frequency (or "pitch") with CREPE
- Computing the loudness features

- Create TFRecord Dataset consisting sequence examples like --> **[ Audio, f0_feature, loudness_feature ]**

#### Each tune be indexed such that using its ID, we can find its ABC Notation as well as related audio files
- A tune can be associated with more than one audio file (Different audio lengths!!)

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data and preprocessed audio files indexed according to the different tunes

## Initialize common variables

In [2]:
# Mention the path to the datastore
BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/"
ABC_DATA_DIR = os.path.join(BASE_DIR, "datasets", "abc_data")
AUDIO_DATA_DIR = os.path.join(BASE_DIR, "datasets", "audio")
ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'abc')
AUDIO_TFRECORD_DIR = os.path.join(BASE_DIR, "tfrecords", 'audio')
PROCESSED_ABC_FILENAME = 'processed-abc-files'
PROCESSED_AUDIO_FILENAME = 'processed-audio-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [3]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
json_path = preprocessor.process(ABC_DATA_DIR)
tfrecord_path = preprocessor.save_as_tfrecord_dataset(
    os.path.join(ABC_TFRECORD_DIR, 'tunes_vocab.json')
)

Cool. Lets process these ABC files now!
Processing files and writing extracted information to CSV for easy processing from here onwards ...
CSV PATH --> /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/tfrecords/abc/processed-abc-files.csv
Found 2 files in the ABC Data Directory. Looking into these files now ..
---------------------------------------------------------
0. /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/thesessions_data_cleaned.abc


  1%|          | 847/151041 [00:00<00:17, 8463.41it/s]

Extracting information and storing to CSV now ..
---------------------------------------------------------


100%|██████████| 151041/151041 [00:14<00:00, 10486.10it/s]
100%|██████████| 366/366 [00:00<00:00, 7217.82it/s]


Stored tunes from file /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/thesessions_data_cleaned.abc to CSV!
---------------------------------------------------------
---------------------------------------------------------
1. /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/double_jigs_cleaned.abc
Extracting information and storing to CSV now ..
---------------------------------------------------------
Stored tunes from file /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/double_jigs_cleaned.abc to CSV!
---------------------------------------------------------
Number of tunes - 139193


  if (await self.run_code(code, result,  async_=asy)):
  0%|          | 0/139194 [00:00<?, ?it/s]

---------------------------------------------------------
ABC Extended Vocabulary:
{'word_to_idx': {'C,': '1', '^C,': '2', 'D,': '3', '^D,': '4', 'E,': '5', 'F,': '6', '^F,': '7', 'G,': '8', '^G,': '9', 'A,': '10', '^A,': '11', 'B,': '12', 'C': '204', '^C': '14', 'D': '191', '^D': '16', 'E': '17', 'F': '187', '^F': '19', 'G': '202', '^G': '21', 'A': '214', '^A': '23', 'B': '225', 'c': '25', '^c': '26', 'd': '27', '^d': '28', 'e': '29', 'f': '30', '^f': '31', 'g': '32', '^g': '33', 'a': '34', '^a': '35', 'b': '36', "c'": '37', "^c'": '38', "d'": '39', "^d'": '40', "e'": '41', "f'": '42', "g'": '43', "^g'": '44', "a'": '45', "^a'": '46', "B'": '47', '_C,': '48', '_D,': '49', '_E,': '50', '_G,': '51', '_A,': '52', '_B,': '53', '_C': '54', '_D': '55', '_E': '56', '_G': '57', '_A': '58', '_B': '59', '_c': '60', '_d': '61', '_e': '62', '_g': '63', '_a': '64', '_b': '65', "_c'": '66', "_d'": '67', "_e'": '68', "_g'": '69', "_a'": '70', "_b'": '71', '=C,': '72', '=E,': '73', '=F,': '74', '=G,'

  0%|          | 86/139194 [00:16<7:13:17,  5.35it/s]


KeyboardInterrupt: 

In [4]:
preprocessor.visualize_stats()

#dataset = preprocessor.load_tfrecord_dataset()
#print(dataset)

AttributeError: 'ABCPreProcessor' object has no attribute 'visualize_stats'