This notebook will help you define and run pipelines to process your data. This includes data augmentation, slicing, stretching and encoding among others. If you want to use this notebook, you are expected to have already collated your original `.xml` with the help of `1.1. Collate Files.ipynb`.

Pipelines are a data processing module which transforms input data types to output data types. The idea as well as bits & pieces are borrowed [the Magenta project](https://github.com/tensorflow/magenta/tree/master/magenta/pipelines).


**INSTRUCTIONS**
 
First, adjust the definition of the pipelines inside `pipeline_graph_def`. Then run `build_dataset`. This will create 4 files, two sets of train and evaluate. The first set is the inputs, and the second set is the targets.

**DEPENDENCIES**

In [290]:
import importlib
importlib.reload(data_processing)

<module 'data_processing' from '/Users/vesko/GitHub/UoE-dissertation/arranger_model/build_dataset/data_processing.py'>

In [291]:
# General
import os
import re 
import pandas as pd

# The processing manager which glues everything
import data_processing

# Augmentation Pipelines
from data_processing import TransposerToC, TransposerToRange, Reverser

# Processing Pipelines
from magenta.pipelines.note_sequence_pipelines import Quantizer, Splitter
from data_processing import PerformanceExtractor, MetadataExtractor, ParserToText, QuantizedSplitter

# Other
from magenta.protobuf import music_pb2
from magenta.pipelines import pipelines_common, dag_pipeline

**PARAMETERS**

In [292]:
pipeline_config = dict()

pipeline_config['data_source_dir'] = "../assets/data/collated/C/"
pipeline_config['data_target_dir'] = "../assets/data/processed/hummingbird/"

In [293]:
# How many steps per quarter note
pipeline_config['steps_per_quarter'] = 4

pipeline_config['min_events'] = 1
pipeline_config['max_events'] = 9999999

pipeline_config['MIN_MIDI_PITCH'] = 0 # Inclusive.
pipeline_config['MAX_MIDI_PITCH'] = 127 # Inclusive.

**DEFINITIONS**

In [294]:
def pipeline_graph_def(collection_name,
                       config):
    """Returns the Pipeline instance which creates the RNN dataset.

    Args:
        collection_name:
        config: dict() with configuration settings

    Returns:
        A pipeline.Pipeline instance.
    """
    
    
    # User Variables
    metadata_df = pd.read_csv(os.path.join(pipeline_config['data_source_dir'], 'filex_index.csv'), index_col=0)
    metadata_attr = []
    split_hop_size_seconds = 99999
    hop_bars = list(range(0,300,5))
    
    # Do Not Modify those
    train_mode = re.match(r'train(?=_)', collection_name)
    key = collection_name
    dag = {}
    
    # Input must NOT be quantized
    splitter = Splitter(
        hop_size_seconds=split_hop_size_seconds,
        name='Splitter_' + key)
    
    # `Quantizer` takes note data in seconds and snaps, or quantizes, 
    # everything to a discrete grid of timesteps. It maps `NoteSequence` 
    # protocol buffers to `NoteSequence` protos with quanitzed times. 
    quantizer = Quantizer(
        steps_per_quarter=pipeline_config['steps_per_quarter'], 
        name='Quantizer_' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence
        
    # Input MUST BE quantized
    quant_splitter = QuantizedSplitter(
        hop_bars=hop_bars,
        name='QuantizedSplitter_' + key)
        
    reverser = Reverser(
        True if train_mode else False, 
        name='Reverser' + key)
        # input_type=music_pb2.NoteSequence
        # output_type=music_pb2.NoteSequence
        
    transposerToC = TransposerToC(
        name='TransposerToC' + key)

#     transposer = TransposerToRange(
#         range(-12, 12) if train_mode else [0],
#         min_pitch = pipeline_config['MIN_MIDI_PITCH'],
#         max_pitch = pipeline_config['MAX_MIDI_PITCH'],
#         name='TransposerToRange_' + key)
#         # input_type=music_pb2.NoteSequence
#         # output_type=music_pb2.NoteSequence

    perf_extractor = PerformanceExtractor(
        min_events=pipeline_config['min_events'],
        max_events=pipeline_config['max_events'],
        num_velocity_bins=0,
        name='PerformanceExtractor_' + key)
        # input_type = music_pb2.NoteSequence
        # output_type = magenta.music.MetricPerformance

    meta_extractor = MetadataExtractor(
        metadata_df = metadata_df,
        attributes=metadata_attr,
        name = 'MetadataExtractor' + key)
    
    parser = ParserToText(
        name='ParserToText' + key)
        # input_type = magenta.music.MetricPerformance
        # output_type = str

    
    ### Pipelines Full Map ###
    #
    # DagInput > Splitter > Quantizer > QuantizedSplitter > Reverser > TransposerToC > TransposerToRange > PerformanceExtractor > 'MetricPerformance'
    # DagInput > MetadataExtractor > 'metadata'
    # 
    # {'MetricPerformance', 'meta'} > ParserToText > DagOutput
    #
    
    dag[splitter] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[quantizer] = splitter
    dag[quant_splitter] = quantizer
    dag[reverser] = quant_splitter
    dag[transposerToC] = reverser
    dag[perf_extractor] = transposerToC
    
    dag[meta_extractor] = dag_pipeline.DagInput(music_pb2.NoteSequence)
    dag[parser] = meta_extractor
    
    dag[parser] = { 'MetricPerformance' : perf_extractor, 
                    'metadata' : meta_extractor }
    
    dag[dag_pipeline.DagOutput(key)] = parser
        
    return dag_pipeline.DAGPipeline(dag)

# Build Dataset

In [295]:
data_processing.build_dataset(pipeline_config, pipeline_graph_def)

INFO: Target ../assets/data/processed/hummingbird/.
INFO: Collated data sourced from ../assets/data/collated/C/.

INFO: Building train_inputs dataset...
INFO: Augmenting by reversing.
INFO: Transposing all to C.
INFO: Prepending metadata tokens for attributes: []
INFO:tensorflow:

Completed.

INFO:tensorflow:Processed 295 inputs total. Produced 59000 outputs.
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performance_lengths_in_bars:
  [-inf,1): 42380
  [1,10): 16620
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_more_than_1_program: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_discarded_too_short: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated: 0
INFO:tensorflow:DAGPipeline_PerformanceExtractor_train_inputs_performances_truncated_timewise: 0
INFO:tensorflow:DAGPipeline_TransposerToCtrain_inputs_transpositions_generated: 59000

INFO: Building train_targets datase

# Build Vocabulary

In [90]:
# Uncomment if you use metadata
# data_processing.build_vocab(pipeline_config,
#                             source_vocab_from=['train_inputs.txt', 'train_targets.txt'])
data_processing.build_vocab(pipeline_config)

INFO: Vocabulary built.
INFO: Tokens collected {'OFF75', 'OFF25', 'ON8', 'ON78', 'OFF96', 'ON113', 'OFF81', 'OFF90', 'ON50', 'SHIFT15', 'ON115', 'OFF69', 'OFF34', 'ON30', 'ON109', 'ON46', 'ON122', 'OFF8', 'OFF122', 'OFF78', 'ON42', 'OFF35', 'ON2', 'OFF105', 'ON75', 'OFF64', 'ON59', 'SHIFT0', 'OFF15', 'OFF100', 'OFF111', 'OFF63', 'OFF57', 'ON117', 'OFF9', 'ON62', 'ON118', 'OFF59', 'OFF124', 'OFF77', 'ON32', 'ON25', 'ON27', 'OFF11', 'OFF27', 'OFF102', 'ON125', 'ON83', 'OFF14', 'ON85', 'ON19', 'ON99', 'ON39', 'OFF18', 'OFF113', 'OFF1', 'ON64', 'ON93', 'ON80', 'OFF40', 'ON108', 'OFF85', 'OFF0', 'ON34', 'OFF41', 'SHIFT14', 'ON45', 'ON88', 'ON91', 'ON100', 'ON84', 'ON7', 'OFF66', 'OFF68', 'OFF87', 'OFF94', 'ON101', 'OFF55', 'OFF97', 'ON9', 'OFF103', 'OFF80', 'OFF98', 'ON82', 'OFF118', 'OFF61', 'ON55', 'OFF44', 'OFF2', 'ON24', 'OFF3', 'ON56', 'OFF60', 'ON4', 'OFF114', 'ON72', 'SHIFT2', 'SHIFT8', 'OFF49', 'SHIFT1', 'OFF70', 'ON35', 'ON89', 'OFF6', 'OFF33', 'OFF42', 'OFF108', 'ON21', 'OFF67', '

# Remove Blank Lines synchronously in two files

In [93]:
for dataset_type in ['eval', 'train', 'test']:
    inputs_file_name = dataset_type +'_inputs.txt'
    targets_file_name = dataset_type + '_targets.txt'

    inputs_path = os.path.join(pipeline_config['data_target_dir'], inputs_file_name)
    targets_path = os.path.join(pipeline_config['data_target_dir'], targets_file_name)

    with open(inputs_path, 'r') as i, open(targets_path, 'r') as t:
        inputs = [l for l in i.readlines()]
        targets = [l for l in t.readlines()]

    assert len(inputs) == len(targets)

    to_remove = []
    for i in range(len(inputs)):
        if inputs[i] == '\n' and targets[i] == '\n':
            to_remove.append(i)

    print('INFO: {} corresponding matching empty lines found in {}.'.format(len(to_remove), dataset_type))

    # Write to disk
    with open(os.path.join(pipeline_config['data_target_dir'], 'fixed', inputs_file_name), 'w') as f:
        for i, line in enumerate(inputs):
            if i not in to_remove:
                f.write(line)
    
    with open(os.path.join(pipeline_config['data_target_dir'], 'fixed', targets_file_name), 'w') as f:
        for i, line in enumerate(targets):
            if i not in to_remove:
                f.write(line)

INFO: 5166 corresponding matching empty lines found in eval.
INFO: 84504 corresponding matching empty lines found in train.
INFO: 4802 corresponding matching empty lines found in test.
