# Overview

After discussion with my advisor, I wanted to come in and redesign the infrastructure for the system at a high level. Many of the systems already exist to handle all of the different parts of the system, but they are done in a fragmented way without much thought put into the engineering. I'd like to approach this by doing the following. 

1. Describe a general framework for Data Science and Machine learning projects and the different components that need to exist to successfully run experiments
2. Outline how the current project handles those different components
3. Propose at a high level how a newly engineered archictecture can handle those components in a modular, reliable, and sustainable way. 

## 1. Components of a Machine Learning project
Although every ML project is going to drastically differ in it's problem and the implementation of it's related solution, there are several components that will stay the same across most ML projects. These components can broadly be broken into 3 categories. 

1. Data
2. Models
3. Evaluation

The data includes all of the code and infrastructure that is involved in gathering, storing, processing the data that is used to interface the model both for training and evaluation purposes. The models will train using the data to learn the target function they are trying to represent, and in some cases will actually be used in real life situations to output predictions for a commercial or business use case. As is often the case, there may be several models that are developed and used to find the best way to represent the target function given the data that exists. The model selection and development is going to depend on the evaluation of the models once they have been trained. Evaluation is as simple as testing the model on data that has never been seen before to get a sense of how well it has learned. In our case, the evaluation can either be quantitative or qualitative – the former relying on statistical metrics that are specific to the data representation of the problem, while the latter may rely on more subjective measures that involve realy human feedback, such as a listening test. 

Because each of these components remains mostly consistent across different projects, we will define a process that outlines the full end-2-end flow of developing and using ML models. The flow looks something like this 

![Machine Learning Components](ML-Components.png)

Each of the boxes represents a different category of work, and the different colored lines represent the different flows that the system will use with the different components. The boxes will correspond with a code module to perform tasks such, whereas the flows will correspond with "experiments" or scripts that use the code modules to run the experiments. The different code modules are roughly defined below, with further expansion later on. 

### 1.1 Code Modules

* Data 
    * Input Raw Data Reader 
    * Input Raw Data Pre-Processing to Model Input Features (Intermediate Data Form)
    * Input Feature Cache Data Writer
    * Input Feature Cache Data Reader 
    * Output Feature Post-Processing to Raw Data
    * Output Raw Data Writer 
    * Output Raw Data Reader
* Models 
    * Model Definition 
        * NN Architecture
        * Forward Pass
        * Loss function 
    * Model Training Loop 
    * Trained model writer 
    * Trained model reader 
    * Model Inference
* Evaluation 
    * Quantitative Evaluator 
        * Uses Model Output and Training Output
    * Qualitative Evaluator 
        * Uses Raw Data Output and presents it to Humans for feedback
    * Metric Writer 
    * Metric Reader 
    * Interpretation 
        * Graphs
        * Tables 
        * Visualizations

Note: There is no definition in the evaluation model for a production system. This is because the "production" system won't necessarily exist in code, although it is possible. For now, this will simply mean taking a MIDI file and sending it to a piano synthesizer. This will be used for the qualitative listening tests, which also will not exist in code.

### 1.2 Code Experiments 
* Initial Data Processing 
    * Reads in raw data (both input and output), converts it to featurized form, and stores in cache 
        * Module: Data
* Training Loop 
    * Reads in featurized training input and output 
        * Module: Data
    * Instantiaties new Model 
        * Module: Model
    * Loop for x number of Epochs 
        * Create batches 
            * Module: Data
        * Make a forward pass 
            * Module: Model
        * Calculate the Loss
            * Module: Model 
        * Backpropagate Loss Module: Model
        * Calculate validation loss 
            * Module: Model
        * Print training and validation loss 
    * Save trained model 
        * Module: Data
* Test Inference and Quantitave Evaluation 
    * Read in featurized test input, output, and saved model 
        * Module: Data, Model
    * Run test input through model 
        * Module: Model
    * Calculate and write quantitative evaluation metrics with model output and test output 
        * Module: Evaluation
    * Post-Process model output to raw output data and write to cache 
        * Module: Data
* Production Interence
    * Read production input and trained model
        * Module: Data, Model 
    * Get model output from data 
        * Module: Model
    * Post-Process output to raw output data 
        * Module: Data 
    * Write raw data
        * Module: Data
* Qualitative Evaluation
    * Read in raw model output 
    * Send raw data to production system, gather human feedback
    * Calculate Metrics 
    * Write Metrics 

## 2. Current implementations in the existing system 
Because I am working from an already existing system, all of components of the system exist. However, they don't exist in a maintable way. To put it bluntly, it's all spaghetti code. However, I have read through much of the existing code and have a somewhat decent understanding of where each of the different components exist. I won't go into details about every different code module component and where it specifically lives in the existing system, mostly because I don't know the specifics for every different part. However, I do have a broad understanding of where the different components live at a high level. 

Most of, if not all of, the Data code lives inside of the pyScoreParser package. This package is responsible for reading in training input MusicXML and training output MIDI, featurizing it for training, and doing this process in reverse; taking featurized model output and converting that data to MIDI. It contains code that also performs an alignment between the training MusicXML and MIDI to ensure that all of the notes are aligned with each other for the training process. 

All of the model definitions exist in a file named nnModel.py. I also believe that some of the training loop code also exists inside of the nnModel package as well. 

The model_run.py file contains both the Training Loop, and all Evaluation code. You can configure the script to run a training job, generate MIDI output (which would then be used for qualitative evalution), and also perform a quantitative evaluation 

## 3. API Proposal 
Given all of the previous information, we'll present an API that will follow the architecture diagram and make use of the existing system code. The API will be presented as Python function and class definitions. We'll also present the folder structure for the project to help structure the data, models, Python Modules, and Scripts, as well as talk a little bit about the development process and how it relates to the repository 

#### 3.1 Jupyter Notebooks 
I typically do all of my development in Jupyter Notebooks because of the interactive benefits they provide. One example is the development of model training. It is nice to only have to load the training data into memory one time and then make updates to the training code instead of reloading the training data and running the entire script every time a change is made. However, Jupyter Notebooks can lead to very messy and unmaintainable code. The general strategy that I'll employ is to move the code I write in Jupyter Notebook's to Python module's and packages often, and always keep the notebooks lightweight and simple. The notebooks will ultimately be the source for running the data, training, and evaluation scripts. This can happen either through the Jupyter Notebook UI, or by converting the Notebook to a script using %nbconvert. The latter strategy is useful for running longer training jobs. 

### 3.2 Folder Structure 
The folder structure will roughly look like this 

```plain
.
├── virtuosoData
│   ├── input
│   │   ├── development
│   │   └── production
│   └── output
└── virtuosoNet
    ├── data
    │   ├── test
    │   └── train
    ├── models
    ├── notebooks
    │   ├── data_processing
    │   ├── inference
    │   │   ├── production
    │   │   └── test
    │   └── training
    ├── pyScoreParser
    └── src
        ├── data
        ├── evaluation
        ├── experiments
        │   ├── data_processing
        │   ├── inference
        │   │   ├── production
        │   │   └── test
        │   └── training
        └── models
```

Both virtuosoData and virtuosoNet will be separate repositories, as this follows the sturcture of the original project. pyScoreParser is left it it's own folder, as it is also a seperate repository and will be a submodule underneath virtuosoNet. 

virtuosoData will contain the raw MusicXML and MIDI data and is separated by input and output data. The input data holds both MusicXML and MIDI data, and is used for the input to the model. The output will contain the raw MIDI file that comes as a result of an a model output. This will be used for the qualitative evaluation. The input data is split into development and production data. The development data can be split into whatever train/validation/test split is necessary for training and quantitative. The production folder will contain data that is to be used for inference with the end goal of passing raw output data to a production system, which will be used for qualitative evaluation. It's important to note that only the raw data exists in this repository, as the intermediate cached form will live inside of virtuosoNet. 

virtuosoNet will follow a simple folder structure that models the architecture diagram. The data folder will hold the cached version of both the development and production data sets. The models folder will hold any pre-trained models, and should ideally be organized by different models and different versions. The notebooks folder will hold all of the development notebooks, which should also be named with versions to keep organization clear. The subfolders inside of notebooks correspond with the different flows through the system. As was mentioned before, notebooks should be lightweight and simple, with most of the heavy lifting done in Python packages. The src folder is going to hold all such packages. The data, evaluation, and models folder corresponds to the different ML components defined in the architecture diagram. The experiments subfolders correspond with the notebooks subfolders, indicating that the experiment type run inside of notebooks should use the module code from src/experiments. As a concrete example, any notebook under notebooks/data_processing will make heavy use of the python packages defined inside of src/experiments/data_processings. 

## 3.2 API

In [3]:
class Score:
    def __init__():
        pass

class Performance:
    def __init__():
        pass

### 3.2.1 Data

src/data/data_reader.py

In [2]:
# src/data/data_reader.py

# Input Raw Data Reader 
#     * Input Raw Data Pre-Processing to Model Input Features (Intermediate Data Form)
#     * Input Feature Cache Data Writer
#     * Input Feature Cache Data Reader 
#     * Output Feature Post-Processing to Raw Data
#     * Output Raw Data Writer 
#     * Output Raw Data Reader
from typing import Dict, List, Tuple



def read_raw_development_data(path: str, split: dict) -> Dict[str, List[Tuple(Score, Performance)]]:
    '''Takes in the path to the raw training data and an optional split dictionary which defines 
    the specific subfolders which are to be used for the different splits 
    
    This method will read all of the MusicXml and MIDI files defined in the split, and create a Score 
    and related Performance object that contains the note-aligned object features for every performance.

    It will return a dictionary keyed by split type that holds a List of Tuples containing each Score and 
    Performance object. 
    '''
    pass

def read_raw_input_production_data(path: str) -> List[Score]:
    '''Takes in the path to the raw production data

    Reads all of the musicxml files and parses each file into a Score object. 

    Returns a list of all Score objects
    '''

def read_cached_development_data(file_path: str) -> List[Tuple[List[T], List[T]]]:
    '''Reads in the cached object that was stored after pre-processing
    
    If file doesn't exists, throws error. 

    If the read object doesn't match the format for the cached data, throws error
    '''

src/data/data_writer.py

In [4]:
# src/data/data_writer.py

# Input Raw Data Reader 
#     * Input Raw Data Pre-Processing to Model Input Features (Intermediate Data Form)
#     * Input Feature Cache Data Writer
#     * Input Feature Cache Data Reader 
#     * Output Feature Post-Processing to Raw Data
#     * Output Raw Data Writer 
#     * Output Raw Data Reader
from typing import Dict, List, Tuple

def write_cached_development_data(file_path: str, data: None, split_type: str) -> None:
    '''Writes the pre-processed data object to the specified path
    
    The data object should be a list of a two tuple of lists, containting the 
    score and performance features in vector form. Each data object should only contain 
    the performances for a single split, and the file_path should contain an organization 
    such that file and folder name make sense for each split.
    '''

def write_raw_production_midi(file_path:str, midi) -> None:
    '''Writes the specified midi object to a midi file at the specified path'''

src/data/pre_processer.py

In [8]:
# src/data/pre_processor.py

# Input Raw Data Reader 
#     * Input Raw Data Pre-Processing to Model Input Features (Intermediate Data Form)
#     * Input Feature Cache Data Writer
#     * Input Feature Cache Data Reader 
#     * Output Feature Post-Processing to Raw Data
#     * Output Raw Data Writer 
#     * Output Raw Data Reader
from typing import Tuple, List, TypeVar 

T = TypeVar('T')

def featurize_score_and_performances(data: List[Tuple[Score, Performance]]) -> List[Tuple[List[T], List[T]]]:
    '''Takes in a list of data performances and transforms the Score and Performance objects to featurized vectors'''



src/data/post_processor.py

In [None]:
# src/data/pre_processor.py

# Input Raw Data Reader 
#     * Input Raw Data Pre-Processing to Model Input Features (Intermediate Data Form)
#     * Input Feature Cache Data Writer
#     * Input Feature Cache Data Reader 
#     * Output Feature Post-Processing to Raw Data
#     * Output Raw Data Writer 
#     * Output Raw Data Reader
from typing import Tuple, List, TypeVar 
from pretty_midi import PrettyMIDI

T = TypeVar('T')

def defeaturize_output_to_midi(data: List[Sequence[T]]) -> List[PrettyMIDI]:
    '''Takes a list of outputs from the model and converts model outputs to MIDI objects
    
    Returns the list of performances as PrettyMIDI objects
    '''

### 3.2.2 Models
There are going to be several different models that are developed and maintained. For the purposes of this project, I am going to define a base model experiment class that will define a NN model but also expose functions for training, model reading and writing, and model inference. Every new model should make use of this class. Each file inside of the models folder should contain a pyTorch class model definition and also a ModelExperiment class definition. The ModelExperiment class will be used in the training script 

src/models/model_experiment.py

In [8]:
# * Model Definition 
#     * NN Architecture
#     * Forward Pass
#     * Loss function 
# * Model Training Loop 
# * Trained model writer 
# * Trained model reader 
# * Model Inference



class ModelExperiment:
    def __init__(self, nn_model):
        self.nn_model = nn_model

    def train_model(self):
        pass
    def write_model(self):
        pass
    def read_model(self):
        pass
    def model_inference(self):
        pass