This notebook will help you manage your dataset folder and convert the source `.xml` files to NoteSequence protos for further processing.

**INSTRUCTIONS**

* Put all song `.xml` files in a single folder.
    * Each song must have at least two peforming levels of difficulty.
    * Files must follow the naming convention:
    
    `[Song Name]_[Performance Level]-[Song Segment]-[Hand].xml`
        * Song Name uses only `A-Z`, `a-z` and `0-9`, no spaces or any other characters. This is a unique identifier, so make sure it is unique and it is spelled exactly the same in each file.
        * Performance Level is one of `['_beg', '_int', '_adv']`.
        * Song Segment is a unique string for given a Song Name and a Performance Level. Song Segments must match exactly across the Performance Levels of a song.
        * Hand is one of `['lh', 'rh', 'bh']`.
* `SourceFolderManager` can do the following for you: 
    * Traverse your chosen directory for `.xml` files, and build an index classifying the type of musical compositions which each `.xml` file holds.
    * Collate files into `source -> target` pairs according to a set of criteria to your preference. This is done using the previously built index, so if anything changes in the meantime, rebuild the index.
    * Convert the collated pairs (which are stored as `.xml` file paths) to NoteSequence protos, serialize and save them as `.tfrecord` files in a directory of your choice.

The four cells below are all you need to execute to get started. See the comments in `preprocess.py` for insight what happens backstage.

**DEPENDENCIES**

In [27]:
import data_load
import os

import importlib
importlib.reload(data_load)

<module 'data_load' from '/Users/vesko/GitHub/UoE-dissertation/model/build_dataset/data_load.py'>

**PARAMETERS**

In [28]:
load_config = dict()

load_config['source_xml_dir'] = "../assets/data/raw/"
load_config['out_collated_dir'] = "../assets/data/collated/M/"

load_config['ext_meta'] = "../assets/data/raw/_songs_metadata.csv"

In [29]:
load_config['genres'] = ['Pop', 'Rock', 'Film', 'Religious', 
                         'Traditional', 'Musical', 'Country', 'Contemporary Piano']

load_config['test_set'] = ['myfavoritethings', 'Something', 'WhereEverybodyKnowsYourName',
                           'girlcrush', 'ImagineMe', 'withorwithoutyou', 'cantstopthefeeling', 
                           'SmellsLikeTeenSpirit', 'Itookapillinibiza', 'wethreekings', 
                           'whereareyouchristmas', 'AllThingsAreWorking', 'LikeImGonnaLoseYou', 
                           'RememberMe','letmeloveyou', 'WalkingInMemphis', 'WishYouWereHere', 
                           'neversaynever', 'WerewolvesOfLondon', 'RightHereWaiting']

load_config['eval_set'] = ['yellow', 'whowantstoliveforever', 'Angie', 'aintNoSunshine',
                           'everytimeyougoaway', 'MaybeImAmazed', 'Levon', 'AnotherDayInParadise', 
                           'AllOutOfLove', 'sweetemotion', 'circleoflife', 'CheapThrills', 
                           'californication', 'ochristmastree', 'aslongasyouremine', 
                           'ValseAmelie', 'sevenyears', 'BennieandtheJets', 'thecircleoflife',
                           'partofyourworld']

# Collate Files

In [30]:
manager = data_load.SourceFolderManager()
manager.build_index(src_folder = load_config['source_xml_dir'],
                    ext_meta = load_config['ext_meta'])

In [35]:
manager.files_index.to_csv(os.path.join(load_config['out_collated_dir'], 'filex_index.csv'))

In [31]:
manager.collate(hand=('rh', 'bh'),
                level=[('beg', 'int'),
                       ('int', 'adv'),
                       ('beg', 'adv'),
                       ('int', 'int'),
                       ('adv', 'adv'),],
                DoubleNoteVal=False,
                WholeSong=False,
                eval_set=load_config['eval_set'],
                test_set=load_config['test_set'])


INFO: Successfully collated Train: 4379 pairs from 154 unique songs.
INFO: Successfully collated Eval: 612 pairs from 20 unique songs.
INFO: Successfully collated Test: 593 pairs from 20 unique songs.

Count of pairs by level:
{'train': Counter({('int', 'int'): 1101, ('beg', 'int'): 1087, ('adv', 'adv'): 735, ('int', 'adv'): 735, ('beg', 'adv'): 721}), 'eval': Counter({('beg', 'int'): 156, ('int', 'int'): 156, ('adv', 'adv'): 100, ('int', 'adv'): 100, ('beg', 'adv'): 100}), 'test': Counter({('beg', 'int'): 157, ('int', 'int'): 157, ('adv', 'adv'): 93, ('int', 'adv'): 93, ('beg', 'adv'): 93})}

Count of pairs by segment type:
{'train': Counter({'chorus1': 566, 'chorus2': 545, 'verse2': 462, 'verse1': 460, 'intro': 428, 'chorus3': 374, 'bridge': 249, 'verse3': 199, 'outro': 178, 'prechorus1': 163, 'prechorus2': 149, 'instrumental': 73, 'chorus4': 70, 'postchorus1': 70, 'verse4': 68, 'bridge1': 53, 'bridge2': 48, 'postchorus2': 36, 'prechorus3': 25, 'verse5': 17, 'chorus5': 17, 'section1

In [32]:
manager.serialize_collated(load_config['out_collated_dir']) # .xml files --> Protocol buffers in TensorFlow Record containers

INFO: Saved ../assets/data/collated/M/train_inputs.tfrecord.
INFO: Saved ../assets/data/collated/M/train_targets.tfrecord.
INFO: Saved ../assets/data/collated/M/eval_inputs.tfrecord.
INFO: Saved ../assets/data/collated/M/eval_targets.tfrecord.
INFO: Saved ../assets/data/collated/M/test_inputs.tfrecord.
INFO: Saved ../assets/data/collated/M/test_targets.tfrecord.


## Quick & Dirty Fix

Update all `NoteSequence` in `.tfrecords` to show the key from the metadata `.csv` file. Make sure to first create a directory called `fixed`.

In [34]:
from magenta.music import note_sequence_io

# Iterate over .tfrecord files in a dir
for src_file in os.listdir(load_config['out_collated_dir']):
    if src_file.endswith('.tfrecord'):
        src_file_path = os.path.join(load_config['out_collated_dir'], src_file)
        target_path = os.path.join(load_config['out_collated_dir'], 'fixed', src_file)
        
        with note_sequence_io.NoteSequenceRecordWriter(target_path) as inputs_writer:
        
            # Iterate over the records in the .tfrecord file
            for record in note_sequence_io.note_sequence_record_iterator(src_file_path):

                key = int(manager.files_index.loc[record.id]['key_mag'])
                record.key_signatures[0].key = key
            
                inputs_writer.write(record)
                
        print('INFO: Successfully updated {} to match the tonal keys from songs_metadata.csv.'.format(src_file))

INFO: Successfully updated train_inputs.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated train_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated eval_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated eval_inputs.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated test_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated test_inputs.tfrecord to match the tonal keys from songs_metadata.csv.


#### Debugging

In [61]:
from magenta.music import note_sequence_io
TFRECORD_FILE = '../assets/data/collated/B/eval_targets.tfrecord'

note_seqs = []
selected = None
for record in note_sequence_io.note_sequence_record_iterator(TFRECORD_FILE):
    note_seqs.append(record)
    
    if record.id == "circleoflife_adv-chorus1-bh.xml":
        selected = record
#     note_seqs.append(record)

# print(note_seqs[0].key_signatures[0].key)
# note_seqs[0].total_time
# note_seqs[0]


#### Test if working as expected

### Debugging