# Deep Learning for the Auto-Generated Music Composition

Name: Jianqiao Li, Zhiying Cui

NetID: jl7136, zc2191

## Introduction
- This project is based on the DeepJ model from the github repository [DeepJ: A model for style-specific music generation](https://github.com/calclavia/DeepJ) with some modifications.
- The reference paper is [Mao HH, Shin T, Cottrell G. DeepJ: Style-specific music generation. In2018 IEEE 12th International Conference on Semantic Computing (ICSC) 2018 Jan 31 (pp. 377-382). IEEE.](https://arxiv.org/abs/1801.00887).

## Goal of This Work
Our model aims to auto-generate a 10 to 30 seconds polyphony given a specific music style and random formatted inputs. Objectives are as follows:
- Rebuild the DeepJ model on our local environment and compared the results with the authors’ model.
- Replace the one-hot encoding strategy on style generation with pre-trained model.
- Try to develop a sparse representation of music as the authors recommended.

## Development Environment

- Python version: Python 3.6.
    - Consider the incompatible version of `grpcio` in Python 3.5 for the TensorFlow 2, we decide to use Python 3.6 which is different from the DeepJ repo.
- Framework: TensorFlow.
- Environment: Google Cloud.

### Some useful tips for setting up

- Set up a jupyter server on Google Cloud: 
    - [Running Jupyter Notebook on Google Cloud Platform in 15 min](https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52).
- Add a different version of kernel:
    - [How to add python 3.6 kernel alongside 3.5 on jupyter](https://stackoverflow.com/questions/43759610/how-to-add-python-3-6-kernel-alongside-3-5-on-jupyter).
    - [Jupyter Notebook Kernels: How to Add, Change, Remove](https://queirozf.com/entries/jupyter-kernels-how-to-add-change-remove).
    - [Run Jupyter Notebook script from terminal](https://deeplearning.lipingyang.org/2018/03/28/run-jupyter-notebook-script-from-terminal/).
    
### Requirements

- Install dependencies for the DeepJ.
```
pip install --ignore-installed -r requirements.txt
```
- Install `python-midi` module. The original [python-midi](https://github.com/vishnubob/python-midi) is no longer maintained. We have to find an alternative python-midi from the following repos:
    - ✅ Candidate 1: https://github.com/louisabraham/python3-midi
    - ❓ Candidate 2: https://github.com/sniperwrb/python-midi
    - ❓ Candidate 3: https://github.com/jameswenzel/mydy
- Download the dataset to `data` folder. The Midi files come from [Piano-midi](http://www.piano-midi.de/).
- Details about the project directory:

In [1]:
import os

print('Project Directory')
os.chdir('/home/choi/DLFinalProject/')
!tree -L 1

Project Directory
[01;34m.[00m
├── LICENSE
├── README.md
├── [01;34m__pycache__[00m
├── [01;34marchives[00m
├── constants.py
├── [01;34mdata[00m
├── dataset.py
├── distribution.py
├── download.py
├── generate.py
├── [01;34mimages[00m
├── main.ipynb
├── midi_util.py
├── model.py
├── nohup.out
├── [01;34mout[00m
├── requirements.txt
├── [01;34mscripts[00m
├── test.py
├── train.py
├── util.py
└── visualize.py

6 directories, 16 files


Before getting start, import all dependencies and modules required for this work.

In [2]:
import tensorflow as tf
import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping, TensorBoard

from constants import *             # store constant parameters for the model
from dataset import *               # load dataset and parse to formated inputs
from generate import *              # generate music
from model import *                 # model architectures
from midi_util import midi_encode   # util funcs for midi
from util import *                  # util funcs

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.6.2


## Objective 1: Rebuild the DeepJ Model

### Dataset

We adopted the same dataset as the authors used in training DeepJ. [Piano-midi](http://www.piano-midi.de/midicoll.htm) contains dataset of classical piano solo pieces. The pieces of each composer's are recorded by using a Midi sequencer. There are 571 pieces composed by 26 composers with a total duration of 36.7 hours of MIDI files in this dataset till Feb. 2020. There are also some alternative dataset such as [mfiles](https://www.mfiles.co.uk/).

Consider the computing power of our Google Cloud server, we eliminated one genre "baroque" and reduce the number of composer to 6. Details can be found in `constant.py` and all Midi data is saved in `data` folder. Before focusing on the model, we first need to download the dataset and parse them into formatted inputs. All utility functions to process the Midi file are coded in `dataset.py`. There are several functions need to be paid attention to:

- `load_all(styles, batch_size, time_steps)`: Load all Midi files and parse them into four inputs, that are `note_data`, `note_target`, `beat_data`, `style_data`, and one label `note_target`.
- `clamp_midi(sequence)`: Clamp the Midi based on the `MIN` and `MAX` notes. In the paper, the authors truncates a standard pitch to range from 36 to 84 to reduce input dimension. 
- `stagger(data, time_steps)`: Chop the sequence data by `time_steps`. This function returns two variables: `dataX` the sequence of data in the current time step, and `dataY` the sequence of data in the next time step which is the predicted target.

Now, load all Midi files in `data/` folder and see how each input data looks like.

In [3]:
# load data
train_data, train_labels = load_all(styles, BATCH_SIZE, SEQ_LEN) # actually we can safely remove BATCH_SIZE

In [4]:
print("note_data:", train_data[0].shape)
print("note_target:", train_data[1].shape) # aka train_labels
print("beat_data:", train_data[2].shape)
print("style_data:", train_data[3].shape)

note_data: (2306, 128, 48, 3)
note_target: (2306, 128, 48, 3)
beat_data: (2306, 128, 16)
style_data: (2306, 128, 6)


Noted that all constant parameters for the model are saved in `constants.py` file. The following table gives the meaning of each variable. (Here we use syntax of Java to present the constant type).

| Variable          | Value/Type        | Representation            |
| :---------------- | :---------------: | :------------------------ |
| genre             | List<String>      | Genre of music            |
| styles            | List<List<String>>| Directory of dataset      |
| NUM_STYLES        | styles.size()     | Numbers of styles         |
| DEFAULT_RES       | 96                | Resolution                |
| MIDI_MAX_NOTES    | 128               | Notes range [1, 128]      |
| MAX_VELOCITY      | 127               | Velocity range [0, 127]   |
| NUM_OCTAVES       | 4                 | Number of octaves         |
| OCTAVE            | 12                | Notes in every octave     |
| MIN_NOTE          | 36                | Minimum note              |
| MAX_NOTE          | MIN_NOTE + NUM_OCTAVES * OCTAVE       | Maximum note                  |
| NUM_NOTES         | MAX_NOTE - MIN_NOTE   | Number of notes between MIN_NOTE and MAX_NOTE |
| BEATS_PER_BAR     | 4                 | Number of beats in a bar  |
| NOTES_PER_BEAT    | 4                 | Notes per quarter note    |
| NOTES_PER_BAR     | NOTES_PER_BEAT * BEATS_PER_BAR    | The quickest note is a half-note  |
| BATCH_SIZE        | 16                | Training batch size       |
| SEQ_LEN           | 8 * NOTES_PER_BAR | Data sequence length      |
| OCTAVE_UNITS      | 64                | Dim of hyperparameter in octave convolution layer |
| STYLE_UNITS       | 64                | Dim of hyperparameter in style embedding          |
| NOTE_UNITS        | 3                 | Three outputs: play prob, replay prob and dynamics|
| TIME_AXIS_UNITS   | 256               | Dim of hyperparameter used for LSTMs in Time-Axis |
| NOTE_AXIS_UNITS   | 128               | Dim of hyperparameter used for LSTMs in Note-Axis |


### Model architecture

The DeepJ architecture is the following.

![DeepJ](./images/DeepJ-architecture.png)

The corresponding codes are written in `model.py`.

In [5]:
def build_models(time_steps=SEQ_LEN, input_dropout=0.2, dropout=0.5):
    """
    Build the LSTM model
    """
    notes_in = Input((time_steps, NUM_NOTES, NOTE_UNITS))   # Note input
    beat_in = Input((time_steps, NOTES_PER_BAR))            # Context
    style_in = Input((time_steps, NUM_STYLES))              # Style
    # Target input for conditioning, feed-forward
    chosen_in = Input((time_steps, NUM_NOTES, NOTE_UNITS))  # Chosen notes

    # Dropout inputs
    notes = Dropout(input_dropout)(notes_in)
    beat = Dropout(input_dropout)(beat_in)
    chosen = Dropout(input_dropout)(chosen_in)

    # Distributed representations
    style_l = Dense(STYLE_UNITS, name='style')
    style = style_l(style_in)

    """ Time axis """
    time_out = time_axis(dropout)(notes, beat, style)

    """ Note Axis """
    naxis = note_axis(dropout)              # 1D Convolution

    """ Prediction Layer """
    notes_out = naxis(time_out, chosen, style)

    """ Build Model """
    model = Model([notes_in, chosen_in, beat_in, style_in], [notes_out])
    model.compile(optimizer='nadam',        # Nesterov Adam optimizer
                  loss=[primary_loss])      # Loss function

    """ Generation Models """
    time_model = Model([notes_in, beat_in, style_in], [time_out])

    note_features = Input((1, NUM_NOTES, TIME_AXIS_UNITS), name='note_features')
    chosen_gen_in = Input((1, NUM_NOTES, NOTE_UNITS), name='chosen_gen_in')
    style_gen_in = Input((1, NUM_STYLES), name='style_in')

    # Dropout inputs
    chosen_gen = Dropout(input_dropout)(chosen_gen_in)
    style_gen = style_l(style_gen_in)

    note_gen_out = naxis(note_features, chosen_gen, style_gen)
    note_model = Model([note_features, chosen_gen_in, style_gen_in], note_gen_out)

    return model, time_model, note_model

In [6]:
models = build_models()
models[0].summary() # LSTM model: params 1,268,388

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 128, 48, 3)] 0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 128, 6)]     0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 128, 48, 3)   0           input_1[0][0]                    
__________________________________________________________________________________________________
style (Dense)                   multiple             448         input_3[0][0]                    
______________________________________________________________________________________________

In [7]:
models[1].summary() # Time axis: params 912,606
models[2].summary() # Note axis: params 356,230

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 128, 48, 3)] 0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 128, 6)]     0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 128, 48, 3)   0           input_1[0][0]                    
__________________________________________________________________________________________________
style (Dense)                   multiple             448         input_3[0][0]                    
____________________________________________________________________________________________

### Training

Training was performed using stochastic gradient descent with the Nesterov Adam optimizer. The loss function is as follows:

$$
\begin{equation*}
    \begin{split}
        & L_{play} = \sum {t_{play}\log y_{play} + (1 - t_{play}) log(1-y_{play})}\\
        & L_{rplay} = \sum {t_{play}(t_{rplay}\log y_{rplay} + (1 - t_{rplay}) log(1 - y_{rplay}))}\\
        & L_{dynamics} = \sum {t_{play}(t_{dynamics} - y_{dynamics})^2} \\
        & L_{primary} = L_{play} + L_{rplay} + L_{dynamics}
    \end{split}
\end{equation*}
$$

Play and replay are treated as logistic regression problems trained using binary cross entropy, as defined in a Biaxial LSTM. Dynamics(velocity) is trained using mean squared error. Training module is in file `train.py`.

Noted that we decrease the `patience` parameter in `EarlyStopping` to 3, and reduce the number of `epochs` to 100. In our preliminary experiment, we find that it commonly spends ~800s for training one epoch on our server. And it requires the epochs ~120 to get the relatively optimal model.

In [None]:
def train(models):
    cbs = [
        ModelCheckpoint(MODEL_FILE, monitor='loss', save_best_only=True, save_weights_only=True),
        EarlyStopping(monitor='loss', patience=3),
        TensorBoard(log_dir='out/logs', histogram_freq=1)
    ]

    print('Training')
    models[0].fit(train_data, train_labels, epochs=200, callbacks=cbs, batch_size=BATCH_SIZE)

train(models)

### Generation

After having the trained model, we need to auto generate the music. Authors performed generation by sampling from the model’s probability distribution using a coin flip to determine whether to play a note or not. After deciding to play a note, they sample from the replay probability to determine if the note should be re-attacked. Dynamics level is directly used from the model given that the note is played.

In our work, we decide to use a different method to generate music. We are going to provide a piece of Midi file cut from the training dataset and observe how does the model work. Further, in DeepJ model, authors use an adaptive temperature adjustment to avoid long period of silence, which is a tricky and smart method we adopt the same. All util functions related to music generation are in `generate.py`.

In [None]:
def generateModified(models, num_bars, styles, start_notes):
    print('Generating with styles:', styles)

    _, time_model, note_model = models
    generations = [MusicGeneration(style) for style in styles]

    for t in tqdm(range(NOTES_PER_BAR * num_bars)):
        # Produce note-invariant features
        ins = process_inputs([g.build_time_inputs() for g in generations])
        
        # Use starts notes
        ins[0][0] = start_notes[t]
        ins[0][1] = start_notes[t]

        # Pick only the last time step
        note_features = time_model.predict(ins)
        note_features = np.array(note_features)[:, -1:, :]

        # Generate each note conditioned on previous
        for n in range(NUM_NOTES):
            ins = process_inputs([g.build_note_inputs(note_features[i, :, :, :]) for i, g in enumerate(generations)])
            predictions = np.array(note_model.predict(ins))

            for i, g in enumerate(generations):
                # Remove the temporal dimension
                g.choose(predictions[i][-1], n)

        # Move one time step
        yield [g.end_time(t) for g in generations]

# generate music
print('Load model from file.')
models[0].load_weights(MODEL_FILE)

stylesGene = [compute_genre(i) for i in range(len(genre))]

write_file('output', generateModified(models, 32, stylesGene, train_data[0]))

## Objective 2: Music Style Classification

As we can see above, the DeepJ model actually generates music by music genre rather than one specific composer. It mixes all composers' composition sytle under the same genre into an one hot encoding. From the perspective of dataset, it seems insignificant to train the model using the dataset classified by composers. 

Besides, the music genres such as Baroque, Classicism and Romanticism are known as different period of ages. It is hard for a human with basic music knowledge to distinguish which period a given piece belongs to. But for a deep learning model, it probably to distiguish the certain pattern hidden behind the notes and beats.

### Next steps

Therefore, there are two major tasks for our next step:

- Train the our model using different dataset, which is classifed by highly varied genres
- Pre-train a music style classification model and replace the one-hot encoding `style_in`

For our last objective, it is recommended by authors to use another representations to speed up the training process. In our current stage of work, we'd better primarily work on the first objectives.

### Some possible solutions to build music style classification
- Identify wave images using CNN
    - https://www.analyticsvidhya.com/blog/2021/06/music-genres-classification-using-deep-learning-techniques/
    - http://cs229.stanford.edu/proj2018/report/21.pdf
- Midi
    - https://github.com/sandershihacker/midi-classification-tutorial/blob/master/midi_classifier.ipynb
    - ByteDance https://arxiv.org/abs/2010.14805# [Dataset](https://arxiv.org/abs/2010.07061) [Repo](https://github.com/bytedance/GiantMIDI-Piano)
