This notebook will help you define and run pipelines to process your data. This includes data augmentation, slicing, stretching and encoding among others. If you want to use this notebook, you are expected to have already collated your original `.xml` with the help of `1.1. Collate Files.ipynb`.

Pipelines are a data processing module which transforms input data types to output data types. The idea as well as bits & pieces are borrowed [the Magenta project](https://github.com/tensorflow/magenta/tree/master/magenta/pipelines).


**INSTRUCTIONS**
 
First, adjust the definition of the pipelines inside `pipeline_graph_def`. Then run `build_dataset`. This will create 4 files, two sets of train and evaluate. The first set is the inputs, and the second set is the targets.

**DEPENDENCIES**

In [6]:
# General
import os
import re 
import pandas as pd

# The processing manager which glues everything
from utils import data_processing

# For Augmentation
from utils.data_processing import TransposerToC, TransposerToRange, Reverser

# For Processing
from magenta.pipelines.note_sequence_pipelines import Quantizer, Splitter
from utils.data_processing import PerformanceExtractor, MetadataExtractor, ParserToText

# Other
from magenta.protobuf import music_pb2
from magenta.pipelines import pipelines_common, dag_pipeline

**PARAMETERS**

In [7]:
pipeline_config = dict()

pipeline_config['data_source_dir'] = "./data/collated/B/"
pipeline_config['data_target_dir'] = "./data/processed/plover/"

# How many steps per quarter note
pipeline_config['steps_per_quarter'] = 4

pipeline_config['min_events'] = 1
pipeline_config['max_events'] = 10000

pipeline_config['MIN_MIDI_PITCH'] = 0 # Inclusive.
pipeline_config['MAX_MIDI_PITCH'] = 127 # Inclusive.

**DEFINITIONS**

In [16]:
def pipeline_graph_def(collection_name,
                       config):
    """Returns the Pipeline instance which creates the RNN dataset.

    Args:
        collection_name:
        config: dict() with configuration settings

    Returns:
        A pipeline.Pipeline instance.
    """
    
    
    # User Variables
    metadata_df = pd.read_csv(os.path.join(pipeline_config['data_source_dir'], 'filex_index.csv'), index_col=0)
    metadata_attr = ['segment', 'mode', 'genre', 'key', 'time_sig']
    split_hop_size_seconds = 9999
    
    # Do Not Modify those
    train_mode = re.match(r'train(?=_)', collection_name)
    key = collection_name
    dag = {}
    
    # Input must NOT be quantized
    splitter = Splitter(
        hop_size_seconds=split_hop_size_seconds,
        name='Splitter_' + key)
    
    # `Quantizer` takes note data in seconds and snaps, or quantizes, 
    # everything to a discrete grid of timesteps. It maps `NoteSequence` 
    # protocol buffers to `NoteSequence` protos with quanitzed times. 
    quantizer = Quantizer(
        steps_per_quarter=pipeline_config['steps_per_quarter'], 
        name='Quantizer_' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence
        
    reverser = Reverser(
        True if train_mode else False, 
        name='Reverser' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence

#     transposer = TransposerToRange(
#         range(-12, 12) if train_mode else [0],
#         min_pitch = pipeline_config['MIN_MIDI_PITCH'],
#         max_pitch = pipeline_config['MAX_MIDI_PITCH'],
#         name='Transposer_' + key)
#         # input_type=music_pb2.NoteSequence
#         # output_type=music_pb2.NoteSequence
        
    transposerToC = TransposerToC(
        name='TransposerToC' + key)

    perf_extractor = PerformanceExtractor(
        min_events=pipeline_config['min_events'],
        max_events=pipeline_config['max_events'],
        num_velocity_bins=0,
        name='PerformanceExtractor_' + key)
        # input_type = music_pb2.NoteSequence
        # output_type = magenta.music.MetricPerformance

    meta_extractor = MetadataExtractor(
        metadata_df = metadata_df,
        attributes=metadata_attr,
        name = 'MetadataExtractor' + key)
    
    parser = ParserToText(
        name='ParserToText' + key)
        # input_type = magenta.music.MetricPerformance
        # output_type = str

    # Reverse
    # Split
    
    ### Pipelines Map ###
    #
    # DagInput > Splitter > Quantizer > Reverser > TransposerToC > PerformanceExtractor > 'MetricPerformance'
    # DagInput > MetadataExtractor > 'metadata'
    # 
    # {'MetricPerformance', 'meta'} > ParserToText > DagOutput
    #
    # Also available: TransposerToRange (for augmentation)
    
    dag[splitter] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[quantizer] = splitter
    dag[reverser] = quantizer
    dag[transposerToC] = reverser
    dag[perf_extractor] = transposerToC
    
    dag[meta_extractor] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[parser] = meta_extractor
    
    dag[parser] = { 'MetricPerformance' : perf_extractor, 
                    'metadata' : meta_extractor }
    
    dag[dag_pipeline.DagOutput(key)] = parser
        
    return dag_pipeline.DAGPipeline(dag)

# Build Dataset

In [17]:
data_processing.build_dataset(pipeline_config, pipeline_graph_def)

INFO: Collated data sourced from ./data/collated/B/.

INFO: Building train_inputs dataset...
INFO: Augmenting by reversing.
INFO: Transposing all to C.
INFO:tensorflow:

Completed.

INFO:tensorflow:Processed 2540 inputs total. Produced 41584 outputs.
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performance_lengths_in_bars:
  [-inf,1): 3336
  [1,10): 38248
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_more_than_1_program: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_too_short: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated_timewise: 0
INFO:tensorflow:DAGPipeline_TransposerToCtrain_inputs_transpositions_generated: 41584

INFO: Building train_targets dataset...
INFO: Augmenting by reversing.
INFO: Transposing all to C.
INFO:tensorflow:

Completed.

INFO:tensorflow:Pr

# Build Vocabulary

In [18]:
data_processing.build_vocab(pipeline_config,
                            source_vocab_from=['train_inputs.txt', 'train_targets.txt'])

INFO: Collecting tokens from ./data/processed/plover/train_inputs.txt
INFO: Collecting tokens from ./data/processed/plover/train_targets.txt
INFO: Vocabulary built.
INFO: Tokens collected {'ON7', 'ON119', 'OFF14', 'OFF70', 'OFF30', 'OFF24', 'OFF38', 'OFF117', 'OFF123', 'ON44', '6over4', 'OFF40', 'OFF122', 'OFF5', 'ON51', 'OFF6', 'OFF71', 'OFF72', 'ON84', 'OFF74', 'SHIFT10', 'OFF61', 'ON24', 'ON97', 'ON110', 'OFF19', 'OFF4', 'OFF63', 'OFF125', 'ON87', 'OFF82', 'SHIFT15', 'outro', 'Contemporary', 'OFF97', 'ON5', 'ON39', 'F', 'ON81', 'OFF32', 'Db', 'ON67', 'OFF7', 'ON18', 'OFF3', 'ON103', 'OFF41', 'Bb', 'ON54', 'instrumental', 'OFF108', 'ON73', 'SHIFT6', 'ON109', 'OFF84', 'OFF2', 'ON11', 'ON78', 'OFF39', 'OFF56', 'OFF20', 'ON40', 'rap', 'OFF33', 'ON125', 'SHIFT1', 'ON43', 'ON36', 'OFF78', 'OFF115', 'ON74', 'ON22', 'ON120', 'OFF76', 'OFF54', 'OFF60', 'Religious', 'OFF64', 'SHIFT4', 'SHIFT7', 'OFF102', 'SHIFT9', 'OFF89', 'ON102', 'ON32', 'OFF34', 'prechorus', 'ON35', 'ON86', 'Gb', 'ON47', '