# Deep Learning for the Auto-Generated Music Composition

Name: Jianqiao Li, Zhiying Cui

NetID: jl7136, zc2191

## Introduction
- This project is based on the DeepJ model from the github repository [DeepJ: A model for style-specific music generation](https://github.com/calclavia/DeepJ) with some modifications.
- The reference paper is [Mao HH, Shin T, Cottrell G. DeepJ: Style-specific music generation. In2018 IEEE 12th International Conference on Semantic Computing (ICSC) 2018 Jan 31 (pp. 377-382). IEEE.](https://arxiv.org/abs/1801.00887).

## Goal of This Work
Our model aims to auto-generate a 10 to 30 seconds polyphony given a specific music style and random formatted inputs. Objectives are as follows:
- Rebuild the DeepJ model on our local environment and compared the results with the authors’ model.
- Replace the one-hot encoding strategy on style generation with pre-trained model.
- Try to develop a sparse representation of music as the authors recommended.

## Development Environment

- Python version: Python 3.6.
    - Consider the incompatible version of `grpcio` in Python 3.5 for the TensorFlow 2, we decide to use Python 3.6 which is different from the DeepJ repo.
- Framework: TensorFlow.
- Environment: Google Cloud.

### Some useful tips for setting up

- Set up a jupyter server on Google Cloud: 
    - [Running Jupyter Notebook on Google Cloud Platform in 15 min](https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52).
- Add a different version of kernel:
    - [How to add python 3.6 kernel alongside 3.5 on jupyter](https://stackoverflow.com/questions/43759610/how-to-add-python-3-6-kernel-alongside-3-5-on-jupyter).
    - [Jupyter Notebook Kernels: How to Add, Change, Remove](https://queirozf.com/entries/jupyter-kernels-how-to-add-change-remove).
    - [Run Jupyter Notebook script from terminal](https://deeplearning.lipingyang.org/2018/03/28/run-jupyter-notebook-script-from-terminal/).
    
### Requirements

- Install dependencies for the DeepJ.
```
pip install --ignore-installed -r requirements.txt
```
- Install `python-midi` module. The original [python-midi](https://github.com/vishnubob/python-midi) is no longer maintained. We have to find an alternative python-midi from the following repos:
    - ✅ Candidate 1: https://github.com/louisabraham/python3-midi
    - ❓ Candidate 2: https://github.com/sniperwrb/python-midi
    - ❓ Candidate 3: https://github.com/jameswenzel/mydy
- Download the dataset to `data/` folder. The Midi files come from [Piano-midi](http://www.piano-midi.de/).
- Details about the project directory:

In [None]:
import os

print('Project Directory')
os.chdir('/home/choi/DLFinalProject/')
!tree -L 1

Project Directory
[01;34m.[00m
├── LICENSE
├── README.md
├── [01;34m__pycache__[00m
├── [01;34marchives[00m
├── constants.py
├── [01;34mdata[00m
├── dataset.py
├── distribution.py
├── download.py
├── generate.py
├── main.ipynb
├── midi_util.py
├── model.py
├── [01;34mout[00m
├── requirements.txt
├── [01;34mscripts[00m
├── test.py
├── train.py
├── util.py
└── visualize.py

5 directories, 15 files


Before getting start, import all dependencies and modules required for this work.

In [None]:
import tensorflow as tf
import numpy as np
from keras.layers import Input, LSTM, Dense, Dropout, Lambda, Reshape, Permute
from keras.layers import TimeDistributed, RepeatVector, Conv1D, Activation
from keras.layers import Embedding, Flatten
from keras.layers.merge import Concatenate, Add
from keras.models import Model
import keras.backend as K
from keras import losses
from keras.callbacks import ModelCheckpoint, LambdaCallback
from keras.callbacks import EarlyStopping, TensorBoard

import argparse
import midi
import os

from constants import *             # store constant parameters for the model
from dataset import *               # load dataset and parse to formated inputs
from generate import *              # generate music
from model import *                 # model architectures
from midi_util import midi_encode   # util funcs for midi
from util import *                  # util funcs

## Objective 1: Rebuild the DeepJ Model

### Dataset

We adopted the same dataset as the authors used in training DeepJ. [Piano-midi](http://www.piano-midi.de/midicoll.htm) contains dataset of classical piano solo pieces. The pieces of each composer's are recorded by using a Midi sequencer. There are 571 pieces composed by 26 composers with a total duration of 36.7 hours of MIDI files in this dataset till Feb. 2020.

Before focusing on the model, we first need to download the dataset and parse them into formatted inputs. All utility functions to process the Midi file are coded in `dataset.py`. There are several functions need to be paid attention to:

- `load_all(styles, batch_size, time_steps)`: Load all Midi files and parse them into four inputs, that are `note_data`, `note_target`, `beat_data`, `style_data`, and one label `note_target`.
- `clamp_midi(sequence)`: Clamp the Midi based on the `MIN` and `MAX` notes. In the paper, the authors truncates a standard pitch to range from 36 to 84 to reduce input dimension. 
- `stagger(data, time_steps)`: Chop the sequence data by `time_steps`. This function returns two variables: `dataX` the sequence of data in the current time step, and `dataY` the sequence of data in the next time step which is the predicted target.

Now, load all Midi files in `data/` folder and see how each input data looks like.

In [None]:
train_data, train_labels = load_all(styles, BATCH_SIZE, SEQ_LEN) # actually we can safely remove BATCH_SIZE

print('note_data')
print(train_data[0])

print('note_target')
print(train_data[1]) # aka train_labels

print('beat_data')
print(train_data[2])

print('style_data')
print(train_data[3])

Noted that all constant parameters for the model are saved in `constants.py` file. The following table gives the meaning of each variable. (Here we use syntax of Java to present the constant type).

| Variable          | Value/Type        | Representation            |
| :---------------- | :---------------: | :------------------------ |
| genre             | List<String>      | Genre of music            |
| styles            | List<List<String>>| Directory of dataset      |
| NUM_STYLES        | styles.size()     | Numbers of styles         |
| DEFAULT_RES       | 96                | Resolution                |
| MIDI_MAX_NOTES    | 128               | Notes range [1, 128]      |
| MAX_VELOCITY      | 127               | Velocity range [0, 127]   |
| NUM_OCTAVES       | 4                 | Number of octaves         |
| OCTAVE            | 12                | Notes in every octave     |
| MIN_NOTE          | 36                | Minimum note              |
| MAX_NOTE          | MIN_NOTE + NUM_OCTAVES * OCTAVE       | Maximum note                  |
| NUM_NOTES         | MAX_NOTE - MIN_NOTE   | Number of notes between MIN_NOTE and MAX_NOTE |
| BEATS_PER_BAR     | 4                 | Number of beats in a bar  |
| NOTES_PER_BEAT    | 4                 | Notes per quarter note    |
| NOTES_PER_BAR     | NOTES_PER_BEAT * BEATS_PER_BAR    | The quickest note is a half-note  |
| BATCH_SIZE        | 16                | Training batch size       |
| SEQ_LEN           | 8 * NOTES_PER_BAR | Data sequence length      |
| OCTAVE_UNITS      | 64                | Hyperparameter in octave convolution layer    |
| STYLE_UNITS       | 64                | Hyperparameter in style embedding |
| NOTE_UNITS        | 3                 | Note dimension            |
| TIME_AXIS_UNITS   | 256               | Hyperparameter used for LSTMs in Time-Axis    |
| NOTE_AXIS_UNITS   | 128               | Hyperparameter used for LSTMs in Note-Axis    |


### Model architecture
refer to the paper
model.summary()


Consider the computing expense, we eliminated the genre "baroque".
- A summarized dataset for [Piano-Midi](http://www.piano-midi.de/): https://www.kaggle.com/soumikrakshit/classical-music-midi
- Some other source of dataset
    - https://www.mfiles.co.uk/
    - https://www.kaggle.com/programgeek01/anime-music-midi


In [None]:
def build_models(time_steps=SEQ_LEN, input_dropout=0.2, dropout=0.5):
    """
    Build the LSTM model
    """
    notes_in = Input((time_steps, NUM_NOTES, NOTE_UNITS))
    beat_in = Input((time_steps, NOTES_PER_BAR))
    style_in = Input((time_steps, NUM_STYLES))
    # Target input for conditioning
    chosen_in = Input((time_steps, NUM_NOTES, NOTE_UNITS))

    # Dropout inputs
    notes = Dropout(input_dropout)(notes_in)
    beat = Dropout(input_dropout)(beat_in)
    chosen = Dropout(input_dropout)(chosen_in)

    # Distributed representations
    style_l = Dense(STYLE_UNITS, name='style')
    style = style_l(style_in)

    """ Time axis """
    time_out = time_axis(dropout)(notes, beat, style)

    """ Note Axis & Prediction Layer """
    naxis = note_axis(dropout)
    notes_out = naxis(time_out, chosen, style)

    model = Model([notes_in, chosen_in, beat_in, style_in], [notes_out])
    model.compile(optimizer='nadam', loss=[primary_loss])

    """ Generation Models """
    time_model = Model([notes_in, beat_in, style_in], [time_out])

    note_features = Input((1, NUM_NOTES, TIME_AXIS_UNITS), name='note_features')
    chosen_gen_in = Input((1, NUM_NOTES, NOTE_UNITS), name='chosen_gen_in')
    style_gen_in = Input((1, NUM_STYLES), name='style_in')

    # Dropout inputs
    chosen_gen = Dropout(input_dropout)(chosen_gen_in)
    style_gen = style_l(style_gen_in)

    note_gen_out = naxis(note_features, chosen_gen, style_gen)

    note_model = Model([note_features, chosen_gen_in, style_gen_in], note_gen_out)

    return model, time_model, note_model

### Training
Training was performed using stochastic gradient descent with the Nesterov Adam optimizer. The loss function is as follows:

$$
\begin{equation}
    \begin{split}
        & L_{play} = \sum {t_{play}\log y_{play} + (1 - t_{play}) log(1-y_{play})}\\
        & L_{rplay} = \sum {t_{play}(t_{rplay}\log y_{rplay} + (1 - t_{rplay}) log(1 - y_{rplay}))}\\
        & L_{dynamics} = \sum {t_{play}(t_{dynamics} - y_{dynamics})^2}
    \end{split}
\end{equation}
$$

Play and replay are treated as logistic regression problems trained using binary cross entropy, as defined in a Biaxial LSTM. Dynamics(velocity) is trained using mean squared error.

In [None]:
def train(models):
    """
    Train the model
    """
    print('Loading data')
    train_data, train_labels = load_all(styles, BATCH_SIZE, SEQ_LEN)

    cbs = [
        ModelCheckpoint(MODEL_FILE, monitor='loss', save_best_only=True, save_weights_only=True),
        EarlyStopping(monitor='loss', patience=5),
        TensorBoard(log_dir='out/logs', histogram_freq=1)
    ]

    print('Training')
    models[0].fit(train_data, train_labels, epochs=2, callbacks=cbs, batch_size=BATCH_SIZE)

In [None]:
models = build_models()
models[0].summary()
models[1].summary() # time_model
models[2].summary() # note_model

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 128, 48, 3)] 0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 128, 6)]     0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 128, 48, 3)   0           input_1[0][0]                    
__________________________________________________________________________________________________
style (Dense)                   multiple             448         input_3[0][0]                    
______________________________________________________________________________________________

In [None]:
"""
Loads all MIDI files as a piano roll. Prepare dataset
(For Keras)
"""

time_steps = SEQ_LEN

note_data = []
beat_data = []
style_data = []

note_target = []

# TODO: Can speed this up with better parallel loading. Order gaurentee.
stylesEnum = [y for x in styles for y in x]

for style_id, style in enumerate(stylesEnum):
    style_hot = one_hot(style_id, NUM_STYLES)
    # Parallel process all files into a list of music sequences
    seqs = Parallel(n_jobs=multiprocessing.cpu_count(), backend='threading')(delayed(load_midi)(f) for f in get_all_files([style]))

    for seq in seqs:
        if len(seq) >= time_steps:
            # Clamp MIDI to note range
            seq = clamp_midi(seq)
            # Create training data and labels
            train_data, label_data = stagger(seq, time_steps)
            note_data += train_data
            note_target += label_data

            beats = [compute_beat(i, NOTES_PER_BAR) for i in range(len(seq))]
            beat_data += stagger(beats, time_steps)[0]

            style_data += stagger([style_hot for i in range(len(seq))], time_steps)[0]

note_data = np.array(note_data)
beat_data = np.array(beat_data)
style_data = np.array(style_data)
note_target = np.array(note_target)

train_data = [note_data, note_target, beat_data, style_data]
train_labels = [note_target]

In [None]:
print("note_data:", train_data[0].shape)
print("beat_data:", train_data[1].shape)
print("style_data:", train_data[2].shape)
print("note_target:", train_data[3].shape)

note_data: (2306, 128, 48, 3)
beat_data: (2306, 128, 48, 3)
style_data: (2306, 128, 16)
note_target: (2306, 128, 6)


In [None]:
cbs = [
    ModelCheckpoint(MODEL_FILE, monitor='loss', save_best_only=True, save_weights_only=True),
    EarlyStopping(monitor='loss', patience=3),
    TensorBoard(log_dir='out/logs', histogram_freq=1)
]

models[0].fit(train_data, train_labels, epochs=1, callbacks=cbs, batch_size=BATCH_SIZE)





<keras.callbacks.History at 0x7f4480457048>

### Generation
style, input -> outputs

In [None]:
models[0].load_weights(MODEL_FILE)

In [None]:
# parser = argparse.ArgumentParser(description='Generates music.')
# parser.add_argument('--bars', default=32, type=int, help='Number of bars to generate')
# parser.add_argument('--styles', default=None, type=int, nargs='+', help='Styles to mix together')
# args = parser.parse_args()

models = build_or_load()

stylesGene = [compute_genre(i) for i in range(len(genre))]

# if args.styles:
#     # Custom style
#     styles = [np.mean([one_hot(i, NUM_STYLES) for i in args.styles], axis=0)]

write_file('output', generate(models, 32, stylesGene))

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            [(None, 128, 48, 3)] 0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            [(None, 128, 6)]     0                                            
__________________________________________________________________________________________________
dropout_17 (Dropout)            (None, 128, 48, 3)   0           input_5[0][0]                    
__________________________________________________________________________________________________
style (Dense)                   multiple             448         input_7[0][0]                    
____________________________________________________________________________________________

100%|██████████| 512/512 [23:13<00:00,  2.72s/it]

Writing file out/samples/output_0.mid
Writing file out/samples/output_1.mid





## Objective 2: Music Style Classification
literature references

## Next Steps
no need to finish objective 3

### Some thoughts
- 2 gnere: classic, jazz, EDM?
- create a table to present constant value
- make a repository on github
- music classification 
    - CNN wave images
    - https://www.analyticsvidhya.com/blog/2021/06/music-genres-classification-using-deep-learning-techniques/
    - http://cs229.stanford.edu/proj2018/report/21.pdf
    - Midi
    - https://github.com/sandershihacker/midi-classification-tutorial/blob/master/midi_classifier.ipynb
    - ByteDance https://arxiv.org/abs/2010.14805# 
    - ByteDance dataset https://arxiv.org/abs/2010.07061 Github https://github.com/bytedance/GiantMIDI-Piano
    
### what is the next steps
- change the DeepJ model to adapt more genre rather than one specific composer?
- or pre-train a music classification model to initialize the input -> change one-hot representation
- forget about the sparse input, enough explanation is ok
