This notebook will help you define and run pipelines to process your data. This includes data augmentation, slicing, stretching and encoding among others. If you want to use this notebook, you are expected to have already collated your original `.xml` with the help of `1.1. Collate Files.ipynb`.

Pipelines are a data processing module which transforms input data types to output data types. The idea as well as bits & pieces are borrowed [the Magenta project](https://github.com/tensorflow/magenta/tree/master/magenta/pipelines).


**INSTRUCTIONS**
 
First, adjust the definition of the pipelines inside `pipeline_graph_def`. Then run `build_dataset`. This will create 4 files, two sets of train and evaluate. The first set is the inputs, and the second set is the targets.

**DEPENDENCIES**

In [2]:
# General
import os
import re 
import pandas as pd

# The processing manager which glues everything
import data_processing

# Augmentation Pipelines
from data_processing import TransposerToC, TransposerToRange, Reverser

# Processing Pipelines
from magenta.pipelines.note_sequence_pipelines import Quantizer, Splitter
from data_processing import PerformanceExtractor, MetadataExtractor, ParserToText

# Other
from magenta.protobuf import music_pb2
from magenta.pipelines import pipelines_common, dag_pipeline

**PARAMETERS**

In [89]:
pipeline_config = dict()

pipeline_config['data_source_dir'] = "../assets/data/collated/L/"
pipeline_config['data_target_dir'] = "../assets/data/processed/warbler/"

In [90]:
# How many steps per quarter note
pipeline_config['steps_per_quarter'] = 4

pipeline_config['min_events'] = 1
pipeline_config['max_events'] = 9999999

pipeline_config['MIN_MIDI_PITCH'] = 0 # Inclusive.
pipeline_config['MAX_MIDI_PITCH'] = 127 # Inclusive.

**DEFINITIONS**

In [91]:
def pipeline_graph_def(collection_name,
                       config):
    """Returns the Pipeline instance which creates the RNN dataset.

    Args:
        collection_name:
        config: dict() with configuration settings

    Returns:
        A pipeline.Pipeline instance.
    """
    
    
    # User Variables
    metadata_df = pd.read_csv(os.path.join(pipeline_config['data_source_dir'], 'filex_index.csv'), index_col=0)
    metadata_attr = []
    split_hop_size_seconds = 99999
    
    # Do Not Modify those
    train_mode = re.match(r'train(?=_)', collection_name)
    key = collection_name
    dag = {}
    
    # Input must NOT be quantized
    splitter = Splitter(
        hop_size_seconds=split_hop_size_seconds,
        name='Splitter_' + key)
    
    # `Quantizer` takes note data in seconds and snaps, or quantizes, 
    # everything to a discrete grid of timesteps. It maps `NoteSequence` 
    # protocol buffers to `NoteSequence` protos with quanitzed times. 
    quantizer = Quantizer(
        steps_per_quarter=pipeline_config['steps_per_quarter'], 
        name='Quantizer_' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence
        
    reverser = Reverser(
        True if train_mode else False, 
        name='Reverser' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence
        
    transposerToC = TransposerToC(
        name='TransposerToC' + key)

#     transposer = TransposerToRange(
#         range(-12, 12) if train_mode else [0],
#         min_pitch = pipeline_config['MIN_MIDI_PITCH'],
#         max_pitch = pipeline_config['MAX_MIDI_PITCH'],
#         name='TransposerToRange_' + key)
#         # input_type=music_pb2.NoteSequence
#         # output_type=music_pb2.NoteSequence

    perf_extractor = PerformanceExtractor(
        min_events=pipeline_config['min_events'],
        max_events=pipeline_config['max_events'],
        num_velocity_bins=0,
        name='PerformanceExtractor_' + key)
        # input_type = music_pb2.NoteSequence
        # output_type = magenta.music.MetricPerformance

    meta_extractor = MetadataExtractor(
        metadata_df = metadata_df,
        attributes=metadata_attr,
        name = 'MetadataExtractor' + key)
    
    parser = ParserToText(
        name='ParserToText' + key)
        # input_type = magenta.music.MetricPerformance
        # output_type = str

    
    ### Pipelines Full Map ###
    #
    # DagInput > Splitter > Quantizer > Reverser > TransposerToC > TransposerToRange > PerformanceExtractor > 'MetricPerformance'
    # DagInput > MetadataExtractor > 'metadata'
    # 
    # {'MetricPerformance', 'meta'} > ParserToText > DagOutput
    #
    
    dag[splitter] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[quantizer] = splitter
    dag[reverser] = quantizer
    dag[transposerToC] = reverser
    dag[perf_extractor] = transposerToC
    
    dag[meta_extractor] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[parser] = meta_extractor
    
    dag[parser] = { 'MetricPerformance' : perf_extractor, 
                    'metadata' : meta_extractor }
    
    dag[dag_pipeline.DagOutput(key)] = parser
        
    return dag_pipeline.DAGPipeline(dag)

# Build Dataset

In [92]:
data_processing.build_dataset(pipeline_config, pipeline_graph_def)

INFO: Collated data sourced from ./data/collated/L/.

INFO: Building train_inputs dataset...
INFO: Augmenting by reversing.
INFO: Transposing all to C.
INFO: Prepending metadata tokens for attributes: []
INFO:tensorflow:

Completed.

INFO:tensorflow:Processed 721 inputs total. Produced 1442 outputs.
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performance_lengths_in_bars:
  [-inf,1): 8
  [1,10): 186
  [10,20): 604
  [20,30): 248
  [30,40): 316
  [40,50): 58
  [50,100): 22
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_more_than_1_program: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_too_short: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated_timewise: 0
INFO:tensorflow:DAGPipeline_TransposerToCtrain_inputs_transpositions_generated: 1442

INFO: Building train_targets 

# Build Vocabulary

In [93]:
# Uncomment if you use metadata
# data_processing.build_vocab(pipeline_config,
#                             source_vocab_from=['train_inputs.txt', 'train_targets.txt'])
data_processing.build_vocab(pipeline_config)

INFO: Vocabulary built.
INFO: Tokens collected {'ON114', 'ON126', 'OFF80', 'OFF14', 'OFF84', 'OFF88', 'OFF100', 'ON122', 'OFF2', 'ON56', 'ON90', 'ON103', 'ON2', 'ON121', 'ON3', 'ON6', 'OFF6', 'OFF38', 'ON83', 'ON106', 'ON115', 'OFF124', 'OFF120', 'ON95', 'ON33', 'OFF110', 'ON61', 'ON42', 'ON108', 'OFF73', 'ON20', 'ON65', 'OFF116', 'OFF50', 'ON58', 'ON29', 'OFF11', 'ON55', 'ON19', 'OFF8', 'OFF49', 'OFF92', 'ON12', 'OFF94', 'ON32', 'ON43', 'OFF99', 'OFF1', 'OFF112', 'ON104', 'OFF27', 'ON125', 'ON10', 'ON54', 'OFF3', 'ON13', 'OFF20', 'SHIFT10', 'ON105', 'OFF10', 'ON63', 'ON30', 'OFF33', 'OFF19', 'ON27', 'ON62', 'ON41', 'ON64', 'ON22', 'ON24', 'ON60', 'ON66', 'ON71', 'ON73', 'ON46', 'OFF35', 'OFF53', 'ON99', 'SHIFT13', 'OFF104', 'OFF17', 'OFF95', 'ON98', 'ON75', 'SHIFT5', 'OFF46', 'ON52', 'ON123', 'OFF81', 'OFF34', 'OFF93', 'OFF60', 'SHIFT8', 'ON111', 'OFF63', 'OFF114', 'OFF45', 'ON15', 'SHIFT3', 'OFF82', 'OFF98', 'ON45', 'ON109', 'OFF13', 'ON16', 'ON0', 'ON120', 'OFF70', 'ON1', 'ON5', 'ON