This notebook will help you define and run pipelines to process your data. This includes data augmentation, slicing, stretching and encoding among others. If you want to use this notebook, you are expected to have already collated your original `.xml` with the help of `1.1. Collate Files.ipynb`.

Pipelines are a data processing module which transforms input data types to output data types. The idea as well as bits & pieces are borrowed [the Magenta project](https://github.com/tensorflow/magenta/tree/master/magenta/pipelines).


**INSTRUCTIONS**
 
First, adjust the definition of the pipelines inside `pipeline_graph_def`. Then run `build_dataset`. This will create 4 files, two sets of train and evaluate. The first set is the inputs, and the second set is the targets.

**DEPENDENCIES**

In [1]:
from utils import data_processing
from utils.data_processing import TransposerToC, TransposerToRange, PerformanceExtractor, MetadataExtractor, ParserToText

from magenta.protobuf import music_pb2
from magenta.pipelines import pipelines_common, dag_pipeline, note_sequence_pipelines

import os
import re 
import pandas as pd

**PARAMETERS**

In [2]:
pipeline_config = dict()

pipeline_config['data_source_dir'] = "./data/collated/B/"
pipeline_config['data_target_dir'] = "./data/processed/test/"

# How many steps per quarter note
pipeline_config['steps_per_quarter'] = 4

pipeline_config['min_events'] = 1
pipeline_config['max_events'] = 10000

pipeline_config['MIN_MIDI_PITCH'] = 0 # Inclusive.
pipeline_config['MAX_MIDI_PITCH'] = 127 # Inclusive.

In [3]:
# metadata_df = pd.read_csv(os.path.join(pipeline_config['data_source_dir'], 'filex_index.csv'), index_col=0)
# metadata_df.loc['aintNoSunshine_int-bridge-bh.xml']['segment']

**DEFINITIONS**

In [4]:
def pipeline_graph_def(collection_name,
                       config):
    """Returns the Pipeline instance which creates the RNN dataset.

    Args:
        collection_name:
        config: dict() with configuration settings

    Returns:
        A pipeline.Pipeline instance.
    """
    
    metadata_df = pd.read_csv(os.path.join(pipeline_config['data_source_dir'], 'filex_index.csv'), index_col=0)
    train_mode = re.match(r'train(?=_)', collection_name)
    key = collection_name
    dag = {}
    
    quantizer = note_sequence_pipelines.Quantizer(
        steps_per_quarter=pipeline_config['steps_per_quarter'], 
        name='Quantizer_' + key)
        # `Quantizer` takes note data in seconds and snaps, or quantizes, 
        # everything to a discrete grid of timesteps. It maps `NoteSequence` 
        # protocol buffers to `NoteSequence` protos with quanitzed times. 
        #
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence

#     transposer = TransposerToRange(
#         range(-12, 12) if train_mode else [0],
#         min_pitch = pipeline_config['MIN_MIDI_PITCH'],
#         max_pitch = pipeline_config['MAX_MIDI_PITCH'],
#         name='Transposer_' + key)
#         # input_type=music_pb2.NoteSequence
#         # output_type=music_pb2.NoteSequence
        
    transposerToC = TransposerToC(
        name='TransposerToC' + key)

    perf_extractor = PerformanceExtractor(
        min_events=pipeline_config['min_events'],
        max_events=pipeline_config['max_events'],
        num_velocity_bins=0,
        name='PerformanceExtractor_' + key)
        # input_type = music_pb2.NoteSequence
        # output_type = magenta.music.MetricPerformance

    meta_extractor = MetadataExtractor(
        metadata_df = metadata_df,
        attributes=['segment', 'mode', 'genre'],
        name = 'MetadataExtractor' + key)
    
    parser = ParserToText(
        name='ParserToText' + key)
        # input_type = magenta.music.MetricPerformance
        # output_type = str

    # Reverse
    # Split
    
    # DagInput > Quantizer > TransposerToC > PerformanceExtractor > 'MetricPerformance'
    # DagInput > MetadataExtractor > >>>>>>>>>>>>>>>>>>>>>>>>>>>> > 'meta'
    # 
    # {'MetricPerformance', 'meta'} > ParserToText > DagOutput
    #
    dag[quantizer] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[transposerToC] = quantizer
    dag[perf_extractor] = transposerToC
    
    dag[meta_extractor] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[parser] = meta_extractor
    
    dag[parser] = { 'MetricPerformance' : perf_extractor, 
                    'meta' : meta_extractor }
    
    dag[dag_pipeline.DagOutput(key)] = parser
        
    return dag_pipeline.DAGPipeline(dag)

# Build Dataset

In [5]:
data_processing.build_dataset(pipeline_config, pipeline_graph_def)

INFO: Collated data sourced from ./data/collated/B/.

INFO: Building train_inputs dataset...
INFO: Transposing all to C.
INFO:tensorflow:

Completed.

INFO:tensorflow:Processed 2540 inputs total. Produced 2540 outputs.
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performance_lengths_in_bars:
  [-inf,1): 11
  [1,10): 542
  [10,20): 1133
  [20,30): 365
  [30,40): 395
  [40,50): 66
  [50,100): 28
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_more_than_1_program: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_too_short: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated_timewise: 0
INFO:tensorflow:DAGPipeline_TransposerToCtrain_inputs_transpositions_generated: 2540

INFO: Building train_targets dataset...
INFO: Transposing all to C.
INFO:tensorflow:

Completed.

INFO:tensor

# Build Vocabulary

In [6]:
data_processing.build_vocab(pipeline_config)

INFO: Vocabulary built.
