# Load Data

This notebook will help you manage your dataset folder and convert the source `.xml` files to NoteSequence protos for further processing.

In [2]:
import constants
import load_data

**INSTRUCTIONS**


* Put all song `.xml` files in a single folder.
    * Each song must have at least two peforming levels of difficulty.
    * Files must follow the naming convention:
    
    `[Song Name]_[Performance Level]-[Song Segment]-[Hand].xml`
        * Song Name uses only `A-Z`, `a-z` and `0-9`, no spaces or any other characters. This is a unique identifier, so make sure it is unique and it is spelled exactly the same in each file.
        * Performance Level is one of `['_beg', '_int', '_adv']`.
        * Song Segment is a unique string for given a Song Name and a Performance Level. Song Segments must match exactly across the Performance Levels of a song.
        * Hand is one of `['lh', 'rh', 'bh']`.
* `SourceFolderManager` can do the following for you: 
    * Traverse your chosen directory for `.xml` files, and build an index classifying the type of musical compositions which each `.xml` file holds.
    * Collate files into `source -> target` pairs according to a set of criteria to your preference. This is done using the previously built index, so if anything changes in the meantime, rebuild the index.
    * Convert the collated pairs (which are stored as `.xml` file paths) to NoteSequence protos, serialize and save them as `.tfrecord` files in a directory of your choice.

The four cells below are all you need to execute to get started. See the comments in `preprocess.py` for insight what happens backstage.

In [3]:
manager = load_data.SourceFolderManager(src_folder = constants.SOURCE_XML_DIR)

In [4]:
manager.build_index()

# Hint: To take a peek at the index, run `manager.files_index`
#
# Hint 2: You can slice the dataframe like this:
# manager.files_index.loc[(manager.files_index['segment'] == 'wholeSong') 
#                       & (manager.files_index['hand'] == 'bh')]

In [5]:
manager.collate(hand='bh', 
                includeWholeSong=False,
                level=[('int', 'adv'), ('beg', 'adv'), ('beg', 'int')],
                limit=3
               )

# Hint: You can access the collated list with `manager.collated_index`

INFO: Skipping "Thunder" because of mismatching segments.
INFO: Skipping "Thunder" because of mismatching segments.


In [7]:
manager.serialize_collated(target_dir = constants.COLLATED_NOTE_SEQ_DIR)

# Notes

### Good tutorial on building TFRecord files


https://medium.com/@WuStangDan/step-by-step-tensorflow-object-detection-api-tutorial-part-2-converting-dataset-to-tfrecord-47f24be9248d

### Useful scripts for reading TFRecord

We can read the contents of the TFRecord file holding NoteSequence records like this:

The following is a more general record iterator, which accepts as a second argument the protocol buffer class to be used for deserialization. Yields a generator.

Frequently used protos classes include:
    * `tf.train.SequenceExample`
    * `music_pb2.NoteSequence`