$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# MIDI Music Generated with Recurrent Generative Adversarial Networks

Sean Russell

Email: slrussel@rams.colostate.edu

## Overview

Creativity has long been considered a uniquely human task. However, as any good computer scientest would do, I would like to see how computers can do at a task often considered to be in the realm of human creativity: music. In this project I wished to examine [this paper](https://arxiv.org/pdf/1611.09904.pdf) published by PhD student Olof Mogren at Chalmers University of Technology, that discusses a method for using recurrent neural networks to generate music. My goal was to get their model to generate music, and to see what sort of modifications I might be able to make to change the quality of that music.

I have found that the model generates "music" in a very loose sense of the term. I had to make some sacrifices due to time constraints and hardware limitations, however the model does successfully create a midi file that sounds approximately "musical". More importantly, I consider this level of ability to create even random musical-ish sounds with such relative ease very encouraging. Read on for more.

## Background: Midi format, recurrent neural nets, generative aversarial training

There are several important concepts that this project involves that would benefit from a small introduction.

Firstly, the MIDI file format. MIDI was designed originally to allow computers an easy interface with electronic instruments, but it has expanded to be a useful format for creating songs on its own. At its simplist, a MIDI file consists of a bunch of NOTE ON and NOTE OFF events for each note (eg B flat, C sharp). Taken in sequence, a MIDI file can then play complex music by knowing when to turn notes on and off.

One of the advantages of MIDI over other audio formats (like mp3) is the ease with which audio can be edited. Mp3 files store audio data as raw waveforms, so editing them can involve tricky signal analysis techniques. For example if you wanted to change the key of a song for a MIDI file, you would iterate through every note on and off event and simply change the value of the note thats being manipulated. Doing the same on an mp3 file on the other hand would involve shifting frequencies by the correct amount to create the right pitch, and then some normalization step so that the audio doesn't get all distorted.

Its because of all this that MIDI is an excellent format to control programmatically. And for this reason that the MIDI format is the one selected by Mogren for use when generating music with neural nets. (Quick side note on the limitations of MIDI: it sounds less authentic than raw waveforms, and the music actually generated by MIDI files can vary across devices.)

The next concept to understand are recurrent neural networks. They are very similar to traditional neural nets, however they are able to keep track of state. They do this by accepting input not only from the regular input vectors, but also the recurrent net from the previous timestep. This makes them much better at dealing with time series than traditional neural nets. In addition, they can be more flexible because they can be used for variable sized inputs. Some applications where recurrent neural networks are very powerful include natural language processing and stock market analysis.

The final important concept is the generative adversarial training method. As implied in the name, this is an architecural setup for machine learning that allows machine learning models to generate realistic data. In the realm of song, realistic data means music that sounds like it was created by a human.

The adversarial part of generative adversarial training describes how it makes authentic-looking data. In essence, there are two machine learning models that play a game with each other. One of the models takes in random inputs and attempts to turn it into authentic-looking data. This one is called the generator. Its opponent then takes in data attempts to determine whether the data is real or generated. This one is called discriminator.

By going back and forth until the discriminator cannot tell the difference between authentic data and generated data, the idea is that the generator will become good at not only foooling the discriminator into thinking it is generating real data, but also anyone else who might come across it. For instance, people. It is in this fashion that generative adversarial training is able to create data that appears to be authentic to a human observer.

## Discussion of the paper and the code

Originally, I wanted to implement recurrent nets with generative adversarial training on my own. However, I quickly came upon a paper titled ['C-RNN-GAN: Continuous recurrent neural networks with adversarial training'](https://arxiv.org/pdf/1611.09904.pdf) by PhD student Olof Mogren. Instead of duplicating their work, I wanted to see how well their approach worked.

In short, they applied a generative adversarial approach to training recurrent neural networks to imitate classical music. They then evaluated the resulting generative model using several musical metrics. The objective of the paper was mostly exploratory and to generate a proof of concept demonstrating that recurrent neural networks funcioned well when combined with the adversarial approach to generation. In order to do this they chose to apply their network to the problem of audio generation. I believe this was in part because of the relative ease with which music can be judged by humans.

The code used for generation of the model is publicly available on github. The basic structure is has been discussed with regards to adversarial training and recurrent neural networks. It downloads a bunch of MIDI files represinting a wide array of classical music from a number of composers off of the internet then uses this data to train the model.

## What I did, How Well It Worked, and Challenges

My main goal was to generate some sort of audio to hear with my own ears how well this setup works to generate music, and so that I could dabble with settings and such myself. So, here are a few samples:
- From early iterations in training
- From later iterations in training
- Audio created by the authors of the paper for comparison

Convincing the code to work was a rather significant undertaking. The implementation is done in Python 2.7, with an old version of tensorflow (I could not determine exactly which version). The main challenges were time and computer resources.

Neural networks inherently take a lot of time to train. Thats just how it is right now. This limited the amount of testing I could perform quickly.

However, one obstacle I had not anticipated was a memory block. The model required a very large amount of memory to train, and even taking fairly drastic measures to limit memory usage, the model would run out of memory after a certain number of iterations and be killed. This is an ongoing issue that for which I have not been able to find a satisfactory fix.

However, I was able to get the model to work for smaller numbers of iterations. While I wouldn't exactly call the MIDI files it generated music, it does sound vaguely musical. I would almost describe it as experimental, as though someone who didn't really know much about piano was playing around on one. In addition, it seems apparent that the model was learning something. If you compare MIDIs generated from the first few iterations of training with those from further in there is a world of difference. I think that means that there is more potential in this model than what I was able to unlock. I think if I were to manage to circumvent the memory barrier the model would be able to generate much more convincing music.

I noticed one odd thing for which I do not have an explanation. The audio generated by my model has a tendency to involve more chords and multiple notes being played at the same time than the audio that the author of paper has supplied. I am not entirely sure what could have caused this, or whether it is even significant at all or just an artifact of probability.

## What I learned:

It seems that recurrent generative adversarial neural networks at least have potential. While I can't say for sure that they are better than any other approach for generating audio, they certainly are worth more research. I believe that with more time and computational resources more could be accomplished.

On a personal level, I think I've learned about one of the important limitations of machine learning. Computational resources make all of the difference, and having a more powerful computer enables more ability to make accurate models. When I think of resource limitations of machine learning, I tend to normally think in terms of time. However, machine learning models can be very memory intensive too, and that is something I will have to keep in mind going forward.

## Notes for running the program:

For myself, and for you if you are so inclined to attempt to run this program. I have supplied a modified version of the original source code from https://arxiv.org/pdf/1611.09904.pdf

To run, execute run/job1.sh. There are other scripts in the run directory, but I have not modified them to work yet. job1.sh will train the network and generate example midi files all in one go (and if necessary download the training data to the computer. It does this in the home directory so make sure to delete this data if free space is needed). The midi files will be saved to a timestamped directory in train/*timestamp*/generated_data. The program works in training epochs, each epoch will generate 1 sample midi file.

It took approximately 3 hours to run 500 epochs on my computer using the GPU optimized version of Tensorflow. My computer has a GTX 970 graphics card and 8gb of ram. If speed is an issue, reducing the songlength, max_epoch, and works_per_composer flags can speed up runtime (potentially at the expense of accuracy for the model)

Several issues could arise while running the program:
- Memory issues: Linux will kill your process if it uses up too much memory. This program can be extremely memory intensive. If there are issues involving memory, the flags works_per_composer and songlength can be reduced to use less memory. These flags are found near the top of rnn_gan.py.
- Download issues: If the program fails while trying to download midi files to be used for training, this is often caused by bad download links. Near the top of music_data_utils.py are a bunch of lines of the form `sources['classical']['`*composer*`'] = ['`*web address*`']`. When it fails the last line printed before the error messages contains the dead address. Find the corresponding line in music_data_utils.py and comment it out.
- Random crashing: Sometimes the model crashes while training due to some specific set of circumstances that result in a bad multiplication or something. I haven't determined the source of this bug and it is fairly infrequent, rerunning the model even with the same parameters is pretty likely to just work.