This notebook will help you manage your dataset folder and convert the source `.xml` files to NoteSequence protos for further processing.

**INSTRUCTIONS**

* Put all song `.xml` files in a single folder.
    * Each song must have at least two peforming levels of difficulty.
    * Files must follow the naming convention:
    
    `[Song Name]_[Performance Level]-[Song Segment]-[Hand].xml`
        * Song Name uses only `A-Z`, `a-z` and `0-9`, no spaces or any other characters. This is a unique identifier, so make sure it is unique and it is spelled exactly the same in each file.
        * Performance Level is one of `['_beg', '_int', '_adv']`.
        * Song Segment is a unique string for given a Song Name and a Performance Level. Song Segments must match exactly across the Performance Levels of a song.
        * Hand is one of `['lh', 'rh', 'bh']`.
* `SourceFolderManager` can do the following for you: 
    * Traverse your chosen directory for `.xml` files, and build an index classifying the type of musical compositions which each `.xml` file holds.
    * Collate files into `source -> target` pairs according to a set of criteria to your preference. This is done using the previously built index, so if anything changes in the meantime, rebuild the index.
    * Convert the collated pairs (which are stored as `.xml` file paths) to NoteSequence protos, serialize and save them as `.tfrecord` files in a directory of your choice.

The four cells below are all you need to execute to get started. See the comments in `preprocess.py` for insight what happens backstage.

**DEPENDENCIES**

In [4]:
import data_load
import os

import importlib
importlib.reload(data_load)

<module 'data_load' from '/Users/vesko/GitHub/UoE-dissertation/arranger_model/build_dataset/data_load.py'>

**PARAMETERS**

In [9]:
load_config = dict()

load_config['source_xml_dir'] = "../assets/data/raw/"
load_config['out_collated_dir'] = "../assets/data/collated/_stats/"

load_config['ext_meta'] = "../assets/data/raw/_songs_metadata.csv"

In [10]:
load_config['genres'] = ['Pop', 'Rock', 'Film', 'Religious', 
                         'Traditional', 'Musical', 'Country', 'Contemporary Piano']

load_config['test_set'] = ['myfavoritethings', 'Something', 'WhereEverybodyKnowsYourName',
                           'girlcrush', 'ImagineMe', 'withorwithoutyou', 'cantstopthefeeling', 
                           'SmellsLikeTeenSpirit', 'Itookapillinibiza', 'wethreekings', 
                           'whereareyouchristmas', 'AllThingsAreWorking', 'LikeImGonnaLoseYou', 
                           'RememberMe','letmeloveyou', 'WalkingInMemphis', 'WishYouWereHere', 
                           'neversaynever', 'WerewolvesOfLondon', 'RightHereWaiting']

load_config['eval_set'] = ['yellow', 'whowantstoliveforever', 'Angie', 'aintNoSunshine',
                           'everytimeyougoaway', 'MaybeImAmazed', 'Levon', 'AnotherDayInParadise', 
                           'AllOutOfLove', 'sweetemotion', 'circleoflife', 'CheapThrills', 
                           'californication', 'ochristmastree', 'aslongasyouremine', 
                           'ValseAmelie', 'sevenyears', 'BennieandtheJets', 'thecircleoflife',
                           'partofyourworld']

# Collate Files

In [11]:
manager = data_load.SourceFolderManager()
manager.build_index(src_folder = load_config['source_xml_dir'],
                    ext_meta = load_config['ext_meta'])

In [12]:
manager.files_index.to_csv(os.path.join(load_config['out_collated_dir'], 'filex_index.csv'))

In [30]:
manager.collate(hand=('rh', 'rh'),
                level=[('beg', 'int')],
                DoubleNoteVal=False,
                WholeSong=False,
                eval_set=load_config['eval_set'],
                test_set=load_config['test_set'])


INFO: Successfully collated Train: 1087 pairs from 152 unique songs.
INFO: Successfully collated Eval: 156 pairs from 20 unique songs.
INFO: Successfully collated Test: 157 pairs from 20 unique songs.

Count of pairs by level:
{'train': Counter({('beg', 'int'): 1087}), 'eval': Counter({('beg', 'int'): 156}), 'test': Counter({('beg', 'int'): 157})}

Count of pairs by segment type:
{'train': Counter({'chorus1': 142, 'chorus2': 136, 'verse2': 117, 'verse1': 116, 'intro': 100, 'chorus3': 97, 'bridge': 60, 'verse3': 50, 'prechorus1': 44, 'outro': 41, 'prechorus2': 40, 'chorus4': 17, 'postchorus1': 17, 'instrumental': 17, 'verse4': 16, 'bridge1': 13, 'bridge2': 12, 'postchorus2': 9, 'prechorus3': 8, 'verse5': 4, 'chorus5': 4, 'section1': 3, 'section2': 3, 'section3': 3, 'section4': 3, 'postchorus3': 2, 'verse6': 2, 'chorus6': 2, 'bridge3': 2, 'section5': 2, 'intro1': 1, 'intro2': 1, 'instrumental2': 1, 'rap': 1, 'section6': 1}), 'eval': Counter({'chorus1': 19, 'chorus2': 18, 'verse1': 16, '

In [14]:
manager.serialize_collated(load_config['out_collated_dir']) # .xml files --> Protocol buffers in TensorFlow Record containers

INFO: Saved ../assets/data/collated/_stats/train_inputs.tfrecord.
INFO: Saved ../assets/data/collated/_stats/train_targets.tfrecord.
INFO: Saved ../assets/data/collated/_stats/eval_inputs.tfrecord.
INFO: Saved ../assets/data/collated/_stats/eval_targets.tfrecord.
INFO: Saved ../assets/data/collated/_stats/test_inputs.tfrecord.
INFO: Saved ../assets/data/collated/_stats/test_targets.tfrecord.


## Quick & Dirty Fix

Update all `NoteSequence` in `.tfrecords` to show the key from the metadata `.csv` file. Make sure to first create a directory called `fixed`.

In [15]:
from magenta.music import note_sequence_io

# Iterate over .tfrecord files in a dir
for src_file in os.listdir(load_config['out_collated_dir']):
    if src_file.endswith('.tfrecord'):
        src_file_path = os.path.join(load_config['out_collated_dir'], src_file)
        target_path = os.path.join(load_config['out_collated_dir'], 'fixed', src_file)
        
        with note_sequence_io.NoteSequenceRecordWriter(target_path) as inputs_writer:
        
            # Iterate over the records in the .tfrecord file
            for record in note_sequence_io.note_sequence_record_iterator(src_file_path):

                key = int(manager.files_index.loc[record.id]['key_mag'])
                record.key_signatures[0].key = key
            
                inputs_writer.write(record)
                
        print('INFO: Successfully updated {} to match the tonal keys from songs_metadata.csv.'.format(src_file))

INFO: Successfully updated train_inputs.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated train_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated eval_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated eval_inputs.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated test_targets.tfrecord to match the tonal keys from songs_metadata.csv.
INFO: Successfully updated test_inputs.tfrecord to match the tonal keys from songs_metadata.csv.


In [None]:
TFRECORD_FILE = './data/collated/B/eval_targets.tfrecord'

note_seqs = []
for record in note_sequence_io.note_sequence_record_iterator(TFRECORD_FILE):
    note_seqs.append(record)

print(note_seqs[0].key_signatures[0].key)
note_seqs[0].total_time
note_seqs[0]

#### Test if working as expected

### Debugging