This notebook will help you manage your dataset folder and convert the source `.xml` files to NoteSequence protos for further processing.

**INSTRUCTIONS**

* Put all song `.xml` files in a single folder.
    * Each song must have at least two peforming levels of difficulty.
    * Files must follow the naming convention:
    
    `[Song Name]_[Performance Level]-[Song Segment]-[Hand].xml`
        * Song Name uses only `A-Z`, `a-z` and `0-9`, no spaces or any other characters. This is a unique identifier, so make sure it is unique and it is spelled exactly the same in each file.
        * Performance Level is one of `['_beg', '_int', '_adv']`.
        * Song Segment is a unique string for given a Song Name and a Performance Level. Song Segments must match exactly across the Performance Levels of a song.
        * Hand is one of `['lh', 'rh', 'bh']`.
* `SourceFolderManager` can do the following for you: 
    * Traverse your chosen directory for `.xml` files, and build an index classifying the type of musical compositions which each `.xml` file holds.
    * Collate files into `source -> target` pairs according to a set of criteria to your preference. This is done using the previously built index, so if anything changes in the meantime, rebuild the index.
    * Convert the collated pairs (which are stored as `.xml` file paths) to NoteSequence protos, serialize and save them as `.tfrecord` files in a directory of your choice.

The four cells below are all you need to execute to get started. See the comments in `preprocess.py` for insight what happens backstage.

**DEPENDENCIES**

In [1]:
import data_load

**PARAMETERS**

In [2]:
load_config = dict()

load_config['genres'] = ['Pop', 'Rock', 'Film', 'Religious', 
                         'Traditional', 'Musical', 'Country', 'Contemporary Piano']

load_config['source_xml_dir'] = "/Users/vesko/Google Drive/Docs/Education/Edinburgh/Classes/DISS/Data/MSc 2018 Research/Preprocessed Dataset"
load_config['out_protos_dir'] = "./data/collated/B/"

# Do not include those files in the dataset 
# This still needs to be implemented
load_config['test_set'] = ['']
load_config['eval_set'] = ['']

# Collate Files

In [3]:
manager = data_load.SourceFolderManager()
manager.build_index(load_config['source_xml_dir'])

In [5]:
manager.collate(hand=('rh', 'bh'),
                includeWholeSong=False,
                level=[('int', 'adv'), ('beg', 'adv'), ('beg', 'int')],
                limit=None
               )

# Hint: You can access the collated list with `manager.collated_index`

INFO: Skipping "Thunder" because of mismatching segments or hand parts.
INFO: Skipping "chariotsoffire" because of mismatching segments or hand parts.
INFO: Skipping "chariotsoffire" because of mismatching segments or hand parts.

INFO: Successfully collated 3239 pairs from 194 unique songs.

Count of pairs by level:
Counter({('beg', 'int'): 1400, ('int', 'adv'): 928, ('beg', 'adv'): 914})

Count of pairs by segment type:
Counter({'chorus1': 411, 'chorus2': 397, 'verse1': 338, 'verse2': 336, 'intro': 315, 'chorus3': 279, 'bridge': 171, 'verse3': 138, 'outro': 132, 'prechorus1': 123, 'prechorus2': 115, 'instrumental': 66, 'chorus4': 51, 'verse4': 49, 'postchorus1': 48, 'bridge1': 36, 'bridge2': 33, 'postchorus2': 27, 'prechorus3': 21, 'verse5': 13, 'section1': 12, 'section2': 12, 'section3': 12, 'section4': 12, 'chorus5': 10, 'section5': 9, 'postchorus3': 7, 'bridge3': 7, 'chorus6': 6, 'instrumental1': 6, 'section6': 6, 'verse6': 4, 'instrumental2': 4, 'intro1': 3, 'intro2': 3, 'outro1'

In [6]:
manager.serialize_collated(target_dir = load_config['out_protos_dir'])

INFO: Created ./data/collated/B/ directory.


TypeError: 'NoneType' object is not iterable