## Install dependencies

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
%matplotlib inline

from data.abc import ABCPreProcessor

## Overview of Preprocessing steps

In this notebook, we will be preprocessing two types of data --> **ABC Notation** data and **Audio** data

### ABC Notation Data

- Strip away **Tune body**, **key**, **meter**, **rhythm** and store all other fields of an ABC track as metadata
- Use key and meter as conditioning symbols when generating a tune
- Tokenize according to vocabulary of musical transcription tokens

- Create a TFRecord Dataset consisting sequence examples like --> **[ tune, meter, key, rhythm ]**

At the end of the file, we should merge both the datasets, to obtain a single TFRecord file containing preprocessed ABC data indexed according to the different tunes

## Initialize common variables

In [5]:
# Mention the path to the datastore
BASE_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/AI-Music-Generation-Challenge-2020"
PROJECT_DIR = "/home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020"
ABC_TFRECORD_DIR = os.path.join(PROJECT_DIR, "tfrecords", "abc")
ABC_DATA_DIR = os.path.join(PROJECT_DIR, 'datasets', 'abc_data')
PROCESSED_ABC_FILENAME = 'processed-abc-files'

#BASE_DIR = "/home/rithomas/"
#ABC_DATA_DIR = os.path.join(BASE_DIR, "data", "ABC")
#ABC_TFRECORD_DIR = os.path.join(BASE_DIR, "cache", "ABC", "Double-Jigs")
#PROCESSED_ABC_FILENAME = 'processed-abc-files'

## Preprocessing - ABC Notation Dataset

#### To understand the underlying distribution of tunes in Irish music, it can be helpful to visualize the following quantities:
- Maximum length of tunes in each category
- Number of tunes in each category
- Number of tunes in each key
- Number of tunes in each meter

In [6]:
preprocessor = ABCPreProcessor(ABC_TFRECORD_DIR, PROCESSED_ABC_FILENAME)
csv_path = preprocessor.process(ABC_DATA_DIR)
tokenizer = preprocessor.create_tokenizer()
print('---------------------------------------------------------')
print('ABC Extended Vocabulary:')
print(tokenizer.return_vocabulary())
print('---------------------------------------------------------')
tfrecord_path = preprocessor.save_as_tfrecord_dataset(tokenizer)

100%|██████████| 366/366 [00:00<00:00, 5652.37it/s]
  0%|          | 0/366 [00:00<?, ?it/s]

Cool. Lets process these ABC files now!
Processing files and writing extracted information to CSV for easy processing from here onwards ...
CSV PATH --> /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/tfrecords/abc/processed-abc-files.csv
Found 1 files in the ABC Data Directory. Looking into these files now ..
---------------------------------------------------------
0. /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/double_jigs.abc
Extracting information and storing to CSV now ..
---------------------------------------------------------
Stored tunes from file /home/richhiey/Desktop/workspace/projects/AI_Music_Challenge_2020/datasets/abc_data/double_jigs.abc to CSV!
---------------------------------------------------------
Number of tunes - 365
---------------------------------------------------------
ABC Extended Vocabulary:
{'word_to_idx': {'C,': '1', '^C,': '2', 'D,': '3', '^D,': '4', 'E,': '5', 'F,': '6', '^F,': '7', 'G,': '8', 

100%|█████████▉| 365/366 [00:29<00:00, 12.19it/s]

Done saving to TFRecord Dataset!



