### Objective
This project tries to create new jazz music using deep learning Machine Learning models, in particular the RNN model (Recurrent Neural Network), which has the capability to learn from a sequence of past patterns in time to inform its next prediction.

### Data
We will use 5 jazz song covers by Doug McKenzie downloaded from [his website](https://bushgrafts.com/midi/). These are predominantly piano covers, some of which have light accompaniments of other instruments such as bass guitar.

They are stored and ingested in MIDI format (Musical Instrument Digital Interface), which is a format used to store instructions to play music, such as the note (a single sound), the pitch (the frequency of the sound), and more.

Where NLP models (Natural Language Processing) ingest words as training inputs, music uses notes as inputs.

Hence our first step is to extract the notes and chords (several notes played at the same time) from the MIDI files. We will use the library [Music21 from MIT](http://web.mit.edu/music21/) to do this.

Before we continue, I would like to credit the works by [Shubham Gupta](https://www.hackerearth.com/blog/developers/jazz-music-using-deep-learning/) and [Sigurður Skúli](https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5), which I've used as references.

### Setup

In [1]:
# Importing all the libraries used
import glob
import pickle
import numpy as np
import pandas as pd
from music21 import converter, instrument, note, stream, chord
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Activation
from keras.layers import BatchNormalization as BatchNorm
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

In [2]:
# Linking notebook to the google drive folder where the files are stored - This step is needed because the project is created in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Extract Data

First, we have to extract the notes and chords from our MIDI files.



In [3]:
# Parse all notes from the midi files in the songs folder
def extract_notes():
    notes = []

    for file in glob.glob("/content/drive/MyDrive/Colab Notebooks/songs/*.mid"):
        midi = converter.parse(file)
        print("Parsing %s" % file)
        notes_to_parse = None

        try: # partition each instrument into different parts
            instr = instrument.partitionByInstrument(midi)
            notes_to_parse = instr.parts[0].recurse() 
        except: # file has notes in a flat structure
            notes_to_parse = midi.flat.notes

        for element in notes_to_parse:
            if isinstance(element, note.Note):
              #if the element is a note, extract the pitch
                notes.append(str(element.pitch))
            elif isinstance(element, chord.Chord):
              #if the element is a chord, extract the normal order of the chord (a list of integers)
                notes.append('.'.join(str(n) for n in element.normalOrder))

    # store the parsed output into an external file
    with open('/content/drive/MyDrive/Colab Notebooks/model/data/notes', 'wb') as filepath:
        pickle.dump(notes, filepath)

    return notes

In [4]:
# Extract all notes from the 5 songs and store them into an output file called 'notes'
extract_notes()

Parsing /content/drive/MyDrive/Colab Notebooks/songs/afine-2.mid
Parsing /content/drive/MyDrive/Colab Notebooks/songs/A Sleepin' Bee.mid
Parsing /content/drive/MyDrive/Colab Notebooks/songs/AfterYou.mid
Parsing /content/drive/MyDrive/Colab Notebooks/songs/accustomed.mid


['B-4',
 'F4',
 '0.2',
 '0.2',
 'F4',
 'F3',
 'G#4',
 '11.2',
 'G4',
 'G5',
 'F3',
 'F5',
 'G#4',
 'B4',
 'D5',
 'G4',
 'F3',
 'G5',
 'E-4',
 'A4',
 'A3',
 '9.2',
 'F3',
 'A4',
 'F#5',
 'A6',
 'D6',
 '6.9',
 'E-4',
 'A3',
 'F3',
 'B-3',
 'C#4',
 'B-4',
 '4',
 'C#5',
 'A5',
 'F3',
 '10.1',
 '1.4.7',
 '1.7',
 '1.4.7.9',
 '10.1',
 'E4',
 'F3',
 '5.9',
 'D5',
 'D4',
 'F3',
 'F4',
 'A4',
 'F3',
 'F2',
 'F1',
 'F4',
 '9.0.2.5',
 '10.0',
 'C5',
 '5.10',
 'F2',
 'F1',
 '3.5',
 '7.10',
 'E-4',
 'F3',
 '7.10',
 'E-4',
 'F3',
 'E-4',
 'D5',
 '6.9',
 'G4',
 'D6',
 'F3',
 '0.3.7',
 'C5',
 'F3',
 'F4',
 'B-4',
 '0.2',
 'F4',
 'A4',
 'F3',
 'B-4',
 'A4',
 'F4',
 'B-4',
 '0.2',
 'C5',
 'F3',
 '10.0.3.5',
 '10.0.5',
 '9.0.2',
 'A3',
 'F4',
 '2.8',
 'G#2',
 '5.7.11',
 '2.4',
 'G2',
 'F4',
 '0.3.6.9',
 'F3',
 'B2',
 '3.5',
 'B-3',
 'G3',
 'C1',
 '3.4.8',
 'C5',
 'B-3',
 'A3',
 '2.3.5',
 'A4',
 'F3',
 '11.1.3.5.8',
 '5.9',
 '0.2.6',
 'G2',
 'B-3',
 'F3',
 'G2',
 'G3',
 'D4',
 'G4',
 '6.10',
 'F#4',
 'G4',

This list consists of note pitches (e.g. 'B-4' for Bflat note in Octave Number 4, 'F4' for F note in Octave Number 4) and chords (represented as several integers separated by dots).

Now that we have a list of notes and chords, we will use these to create both the input features and the target variable for training purposes.

Each input feature is arbitrarily set at 100 consecutive notes and the target variable is the note which comes directly after (i.e. the 101th note). The output variable is what the model will try to predict and be scored against during training.

In [None]:
def create_input_output(notes, unique_note_count):
    # Extract and sort all unique pitches from the extracted note list above
    pitchnames = sorted(set(item for item in notes))
    # Create a dictionary to map pitches to integers
    note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

    network_input = []
    network_output = []

    # Predict the 101th note using 100 notes at a time
    sequence_length = 100

    # create input note sequences and the corresponding next note for training purposes
    for i in range(0, len(notes) - sequence_length, 1):
        sequence_in = notes[i:i + sequence_length]
        sequence_out = notes[i + sequence_length]
        network_input.append([note_to_int[char] for char in sequence_in])
        network_output.append(note_to_int[sequence_out])

    input_count = len(network_input)
    # reshape the input into a format compatible with LSTM layers
    network_input = np.reshape(network_input, (input_count, sequence_length, 1))
    # normalize input
    network_input = network_input / float(unique_note_count)
    # one hot encode the output vector
    network_output = np_utils.to_categorical(network_output)

    return (network_input, network_output)

In [19]:
# get the full length of notes extracted
unique_note_count = len(set(notes))

In [None]:
# Create input and output for RNN training
# Input - for each record, a list of 100 notes normalized
# Output - for each record, a one-hot encoded version of the output note
create_input_output(notes, unique_note_count)

(array([[[0.8       ],
         [0.95072464],
         [0.02028986],
         ...,
         [0.51304348],
         [0.28115942],
         [0.9826087 ]],
 
        [[0.95072464],
         [0.02028986],
         [0.02028986],
         ...,
         [0.28115942],
         [0.9826087 ],
         [0.95072464]],
 
        [[0.02028986],
         [0.02028986],
         [0.95072464],
         ...,
         [0.9826087 ],
         [0.95072464],
         [0.06376812]],
 
        ...,
 
        [[0.89275362],
         [0.59130435],
         [0.85217391],
         ...,
         [0.88115942],
         [0.95942029],
         [0.80869565]],
 
        [[0.59130435],
         [0.85217391],
         [0.21449275],
         ...,
         [0.95942029],
         [0.80869565],
         [0.86376812]],
 
        [[0.85217391],
         [0.21449275],
         [0.9884058 ],
         ...,
         [0.80869565],
         [0.86376812],
         [0.86376812]]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 

### Create Neural Network Architecture

The RNN model chosen is [LSTM (Long Short Term Memory)](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) which is capable of learning long-term dependencies.

Our model consists of:
- Stacked LSTM layers (2 layers) with each layer consisting of 128 hidden nodes
- A dense layer consisting of 256 nodes using 'relu' (rectified linear unit) activation
- An output dense layer with the same node count as all the possible unique note we have parsed

Categorical cross entropy is chosen as the loss function because our output is essentially a multi-class classification.

In [23]:
def create_network(network_input, unique_note_count):
    # create the structure of the neural network
    model = Sequential()
    model.add(LSTM(
        128,
        input_shape=(network_input.shape[1], network_input.shape[2]), #100 by 1
        recurrent_dropout=0.2,
        return_sequences=True #return_sequences = True to feed the output of LSTM array to another LSTM layer 
    ))
    model.add(LSTM(128, return_sequences=False))
    model.add(BatchNorm())
    model.add(Dense(256, activation='relu'))
    model.add(BatchNorm())
    model.add(Dense(unique_note_count, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    return model

In [24]:
model = create_network(network_input, unique_note_count)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 100, 128)          66560     
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 batch_normalization (BatchN  (None, 128)              512       
 ormalization)                                                   
                                                                 
 dense (Dense)               (None, 256)               33024     
                                                                 
 batch_normalization_1 (Batc  (None, 256)              1024      
 hNormalization)                                                 
                                                                 
 dense_1 (Dense)             (None, 345)               8

### Train Model

Now we can begin feeding our input and output into the training model. We are using 400 epochs for this training.

At every single training epoch, we will assess the loss value generated and save the weights if the loss is better (lower in value) compared to the previous epoch. This way we can ensure that we only save progressively better weights and use them as checkpoints to assess how they are performing.

In [None]:
def train(model, network_input, network_output):
    # Create filepath to store the weights of the various epochs
    filepath = "/content/drive/MyDrive/Colab Notebooks/model/weights/weights-{epoch:02d}-{loss:.4f}.hdf5"
    checkpoint = ModelCheckpoint(
        filepath,
        monitor='loss',
        verbose=0,
        save_best_only=True, #save weights only if the epoch results in lower loss
        mode='min'
    )
    callbacks_list = [checkpoint]

    # Run the model with 400 epochs
    model.fit(network_input, network_output, epochs=400, batch_size=32, callbacks=callbacks_list)

In [None]:
train(model, network_input, network_output)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

### Prepare Prediction Input

Building on the training input function from earlier in the notebook, we will repurpose it to create 2 different inputs:
- Network_input: the list of 100-note inputs to randomize for prediction
- Normalized_input: to recreate the training model architecture

In [31]:
def prepare_prediction_input(notes, pitchnames, unique_note_count):
    note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

    sequence_length = 100
    network_input = []
    for i in range(0, len(notes) - sequence_length, 1):
        sequence_in = notes[i:i + sequence_length]
        network_input.append([note_to_int[char] for char in sequence_in])

    input_count = len(network_input)

    # reshape the input into a format compatible with LSTM layers
    normalized_input = np.reshape(network_input, (input_count, sequence_length, 1))
    # normalize input
    normalized_input = normalized_input / float(unique_note_count)

    return (network_input, normalized_input)

### Generate Notes

We will generate 500 notes which will give us a song of about 2 minutes to listen to. To start us off, we are randomly selecting one of our 100-note inputs from the training

In [32]:
def generate_notes(model, network_input, pitchnames, unique_note_count):
    int_to_note = dict((number, note) for number, note in enumerate(pitchnames))

    # pick a random 100-note training input to start off our prediction
    start = np.random.randint(0, len(network_input)-1)
    pattern = network_input[start]
    prediction_output = []

    # generate 500 notes
    for note_index in range(500):
        # normalize 1 record of prediction input and predict the output
        prediction_input = np.reshape(pattern, (1, len(pattern), 1))
        prediction_input = prediction_input / float(unique_note_count)
        prediction = model.predict(prediction_input, verbose=0)

        # Return the index of the output vector with the highest value
        index = np.argmax(prediction)
        # Map the predicted integer back to the corresponding note 
        result = int_to_note[index]
        # Store the predicted note into an output list and append the predicted note to the initial training input
        prediction_output.append(result)
        pattern.append(index)
        # Drop the first note and keep the latest 100 note for the next note prediction cycle 
        pattern = pattern[1:len(pattern)]

    return prediction_output

### Create Output Midi

Now we need to string back together the predicted notes into a midi song structure. 

To assess how training progress across epochs, we will load the weights at epoch 1, 100, 204, 303 and 382 separately into the model. These will give us 5 song outputs to listen to.

In [33]:
def create_midi(prediction_output):
    offset = 0 #offset is the time position in a song
    output_notes = []

    # recreate note and chord
    for pattern in prediction_output:
        # if the prediction is a chord
        if ('.' in pattern) or pattern.isdigit():
            # for each note in the chord, break it apart and convert the normal order number back to actual note
            notes_in_chord = pattern.split('.')
            notes = []
            for current_note in notes_in_chord:
                new_note = note.Note(int(current_note))
                new_note.storedInstrument = instrument.Piano()
                notes.append(new_note)
            new_chord = chord.Chord(notes)
            new_chord.offset = offset
            output_notes.append(new_chord)
        # if the prediction is a note
        else:
            new_note = note.Note(pattern)
            new_note.offset = offset
            new_note.storedInstrument = instrument.Piano()
            output_notes.append(new_note)

        # add time offset to indicate the position of the next note in the song.
        # If offset = 0, the next note will be played together with the first note instead of afterwards
        offset += 0.5

    midi_stream = stream.Stream(output_notes)

    # Indicate the name of the midi files to be created
    # midi_stream.write('midi', fp='/content/drive/MyDrive/Colab Notebooks/model/output_epoch1.mid')
    # midi_stream.write('midi', fp='/content/drive/MyDrive/Colab Notebooks/model/output_epoch100.mid')
    # midi_stream.write('midi', fp='/content/drive/MyDrive/Colab Notebooks/model/output_epoch204.mid')
    # midi_stream.write('midi', fp='/content/drive/MyDrive/Colab Notebooks/model/output_epoch303.mid')
    midi_stream.write('midi', fp='/content/drive/MyDrive/Colab Notebooks/model/output_epoch382.mid')

### String Generation End to End

In [34]:
def generate_end_to_end():
    # Load the notes used to train the model
    with open('/content/drive/MyDrive/Colab Notebooks/model/data/notes', 'rb') as filepath:
        notes = pickle.load(filepath)

    # Get all pitch names
    pitchnames = sorted(set(item for item in notes))
    unique_note_count = len(set(notes))

    # Prepare prediction input
    network_input, normalized_input = prepare_prediction_input(notes, pitchnames, unique_note_count)
    
    # recreate rnn architecture that we use for traiing
    model = create_network(normalized_input, unique_note_count)

    # Load the weights to each node in the model
    # model.load_weights('/content/drive/MyDrive/Colab Notebooks/model/weights/weights-01-5.4835.hdf5')
    # model.load_weights('/content/drive/MyDrive/Colab Notebooks/model/weights/weights-100-2.0635.hdf5')
    # model.load_weights('/content/drive/MyDrive/Colab Notebooks/model/weights/weights-204-0.8069.hdf5')
    # model.load_weights('/content/drive/MyDrive/Colab Notebooks/model/weights/weights-303-0.2552.hdf5')
    model.load_weights('/content/drive/MyDrive/Colab Notebooks/model/weights/weights-382-0.1449.hdf5')
    
    prediction_output = generate_notes(model, network_input, pitchnames, unique_note_count)
    create_midi(prediction_output)

In [35]:
generate_end_to_end()

Once the midi files are generated, we can use software such as garage band or convert it to mp3 via website such as [zamzar.com](https://www.zamzar.com/) to give it a listen.

### Result

From listening to the MIDI files, we can observe massive improvements across epochs: 
- In epoch 1, 1 note is played over and over for the entirety of the song
- In epoch 100, the model has now learned to play more than 1 note although there are still instances of repetitive notes and absence of any chords
- In epoch 382, the model has learned to play chords and have little to no repetitive notes. However, all notes have the same duration and there are no pauses between notes

### Next Steps
1. Ingest rest notes (offsets) as inputs in order for the model to learn pauses in songs
2. Try to create structure of actual songs, e.g. verse-chorus-verse-chorus-bridge
3. Add an additional instrument, e.g. bass