### A fully connected autoencoder network

That looks like this:

```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                660       
_________________________________________________________________
dense_2 (Dense)              (None, 32)                672       
=================================================================
Total params: 1,332
Trainable params: 1,332
Non-trainable params: 0
```

The input layer takes a sequence of 32 notes and/or chords, the first Dense layer has 10 nodes, and hereby compress the input to a "latent space". The second Dense layer reconstruct the input from the latent space.

Since it is a very small network, the bottleneck cannot be very much smaller than the input, at this level.

In [19]:
# Imports

# Keras
from keras.layers import Input, Dense
from keras.models import Model
import keras.utils as utils
# in case of need for activity regularizers
from keras import regularizers
# earlystopping prevents overfitting
from keras.callbacks import EarlyStopping

# For midi
from music21 import converter, instrument, note, chord
from music21.instrument import Guitar
from music21 import midi, stream

# To calculate training time
import time

# To create scaler for normalizing embedded chords/notes
# and rescaling back to embedding after training
from sklearn.preprocessing import MinMaxScaler

import numpy as np
np.set_printoptions(threshold=10e6)

# bottleneck
encoding_dim = 10

# for training
epochs = 400
batch_size = 450

### The dataset

The notes for the dataset has been parsed in the notebook "Midi Parsing".
The textfile contains a long string of notes and chords. 

Here, I split the string, and convert it to a list of strings.

The last ten elements look like this:

```
['A2', 'E2', '9.1.4', 'A2', 'E2', 'A2', 'E2', '9.1', '9.1']
``` 

'A2', 'E2' etc. are notes, and their pitch.

The numbers, e.g. '9.1.4' means three separate notes, played simultaneously - aka a chord.

This is a chord representation in their *normal order* - which is a concept I don't fully understand. It has something to do with semitone intervals.

These are representations that are understood by the **music21** library as different chords.

```len(notes) = 598820```

so all the songs are compressed into a long sequence with length 598820

In [20]:
# The dataset
newTextfile = open('notes.txt', 'r')
newNotes = newTextfile.readlines()
newTextfile.close()

notes = []
for line in newNotes:
    notes = line.split(',')

### Preparing the dataset

Here, I'm creating embeddings of all of the notes/chords. The embeddings and their notes/chords becomes a dictionary, called note_to_int.

A snippet from note to int:

```
'9.11.2.3': 441,
 '9.11.2.4': 442,
 '9.11.2.5': 443,
 '9.11.3': 444,
 '9.11.4': 445,
 '9.2': 446,
 'A2': 447,
 'A3': 448,
 'A4': 449,
 'A5': 450,
 'A6': 451,
 'B-2': 452,
 'B-3': 453,
 'B-4': 454,
```

In [21]:
# Preparing dataset
sequence_length = 32

# sort all unique elements of notes-list
pitchnames = sorted(set(item for item in notes))

# create a dictionary to map pitches to integers
note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

network_input = []

#  create input sequences and the corresponding outputs
for i in range(0, len(notes) - sequence_length, sequence_length):
    sequence_in = notes[i:i + sequence_length] 
    network_input.append([note_to_int[char] for char in sequence_in])

### Normalizing input, and creating upscaler for later

I get the max value from the network input.

Then I create the feature range (0,max network input) for a scaler from *sklearn.preprocessing.MinMaxScaler*.

And I use the max value to normalize the network_input.

In [22]:
# Get max value from network input
maxr = max(max(network_input))

# create feature range for upscaler
feature_range = (0,maxr)
# prepare scaler for later
predictScaler = MinMaxScaler(feature_range=feature_range)

# saving feature range, useful elsewhere
np.save("feature_range.npy", feature_range)
           

# normalize input
network_input = np.asarray(network_input)
normalized_input = network_input / maxr

### Creating train and test set

I split the network_input by 2/3 to my training set, and keep the last 1/3 for my test set. 
Then I save it for later use.

In [23]:
# Split
split_point = int(normalized_input.shape[0] * 2 / 3)

x_train, x_test = normalized_input[0:split_point,:], normalized_input[split_point:-1,:]

np.savez("music.npz", x_train=x_train,x_test=x_test)

x_train.shape, x_test.shape

((12475, 32), (6237, 32))

### Creating the network

The weights are initalized with random normal distribution, as this keeps them close to the dataset. The relu actvation function gives the best result. All the values in the network are positive, so that's not a surprise. 
And relu prevents vanishing gradients. I experienced slow convergence with sigmoid.

In [24]:
# size of encoded representation
input_dim = network_input.shape[1]

# input placeholder
input_song = Input(shape=(input_dim,))

# encoder
encoded = Dense(encoding_dim, kernel_initializer='random_normal',
               bias_initializer='zeros', activation='relu')(input_song)

#decoder
decoded = Dense(input_dim, kernel_initializer='random_normal',
                bias_initializer='zeros', activation='relu')(encoded)

### The autoencoder

In [25]:
# The autoencoder maps the input to its reconstruction
# input=input song, output = decoded song

autoencoder = Model(input_song, decoded)

### The encoder and decoder

These aren't really necessary for making predictions in this example. It just exemplifies that encoding and decoding can be broken down to separate models and trained. I don't use it for making predictions later.

In [26]:
# Separate encoder model
encoder = Model(input_song, encoded)

### The decoder

In [27]:
# Separate decoder model

# create placeholder for encoded (32 dim) input
encoded_input = Input(shape=(encoding_dim,))

# retrieve last layer of autoencoder model
decoder_layer = autoencoder.layers[-1]

# make decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

### Training

The model is in practice trying to estimate the distance between calculated input and true input, and these are not likelihood estimations, but just number representations. Therefore I chose *mean squared error* as a loss function.

Chose rmsprop as I knew it was a good optimizer, gives better result than adam and adadelta. But, I don't have a good explanation at the moment.

Use earlystopping with a patience of 20 epochs, and minimum change 10e-5. 

In [28]:
# use per-pixel binary crossentropy-loss and Adadelta optimizer
autoencoder.compile(optimizer='rmsprop', loss='mean_squared_error')

earlystop = EarlyStopping(monitor='val_loss', min_delta=10e-5, patience=20,
                          verbose=1, mode='auto')

callbacks_list = [earlystop]

# train the model
start = time.time()

model_info = autoencoder.fit(x_train, x_train, 
                epochs=epochs,
                batch_size=batch_size,
                shuffle=True,
                callbacks=callbacks_list,
                validation_split=0.3)


end = time.time()

print("time to train", end-start)

autoencoder.save("autoencoder.h5")

Train on 8732 samples, validate on 3743 samples
Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400

Epoch 80/400
Epoch 81/400
Epoch 82/400
Epoch 83/400
Epoch 84/400
Epoch 85/400
Epoch 86/400
Epoch 87/400
Epoch 88/400
Epoch 89/400
Epoch 90/400
Epoch 91/400
Epoch 92/400
Epoch 93/400
Epoch 94/400
Epoch 95/400
Epoch 96/400
Epoch 97/400
Epoch 98/400
Epoch 99/400
Epoch 100/400
Epoch 101/400
Epoch 102/400
Epoch 103/400
Epoch 104/400
Epoch 105/400
Epoch 106/400
Epoch 107/400
Epoch 108/400
Epoch 109/400
Epoch 110/400
Epoch 111/400
Epoch 112/400
Epoch 113/400
Epoch 114/400
Epoch 115/400
Epoch 116/400
Epoch 117/400
Epoch 118/400
Epoch 119/400
Epoch 120/400
Epoch 121/400
Epoch 122/400
Epoch 123/400
Epoch 124/400
Epoch 125/400
Epoch 126/400
Epoch 127/400
Epoch 128/400
Epoch 129/400
Epoch 130/400
Epoch 131/400
Epoch 132/400
Epoch 133/400
Epoch 134/400
Epoch 135/400
Epoch 136/400
Epoch 137/400
Epoch 138/400
Epoch 139/400
Epoch 140/400
Epoch 141/400
Epoch 142/400
Epoch 143/400
Epoch 144/400
Epoch 00144: early stopping
time to train 8.981409788131714


### Prepare to play music

Read and borrowed parts from [this:](https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5)

In [29]:
# get all pitch names
int_to_note = dict((number, note) for number, note in enumerate(pitchnames))

# This is just necessary to save the dictionary for use elsewhere
# It's not possible to save a dictionary object, but lists are no problem
# I zip them back to a dictionary when needed
keys = list(int_to_note.keys())
values = list(int_to_note.values())
np.savez("int_to_note.npz", keys=keys, values=values)

In [30]:
def createPattern(input_sequence):
    """
    Function that map integers from note_to_int-dictionary
    back to string representation of notes and chords.
    
    Input: sequence of 32 integers representing a short song
    
    Output: Note and chord representations as strings
    """
    
    
    prediction_output = []

    # generate notes
    for note_index in input_sequence:
                
        result = int_to_note[note_index]
        prediction_output.append(result)
    return prediction_output

In [31]:
def createMusic21Object(prediction_output):
    """
    Creates a music21 stream object.
    Does not add offset as it should, 
    and needs upgrading to read pauses and tempo. 
    
    Input: list of string representation of chords and notes.
    
    Output: list of music21.note.Note and / music21.chord.Chord objects      
    """
    
    
    
    offset = 0
    output_notes = []

    # create note and chord objects based on the values generated by the model
    for pattern in prediction_output:
        
        # pattern is a chord
        if ('.' in pattern) or pattern.isdigit():
            notes_in_chord = pattern.split('.')
            notes = []

            for current_note in notes_in_chord:
                new_note = note.Note(int(current_note))
                new_note.storedInstrument = Guitar()
                notes.append(new_note)

            new_chord = chord.Chord(notes)
            new_chord.offset = offset
            output_notes.append(new_chord)

        # pattern is a note
        else:
            new_note = note.Note(pattern)
            new_note.offset = offset
            new_note.storedInstrument = Guitar()
            output_notes.append(new_note)

        # increase offset each iteration so that notes do not stack
        offset += 0.5
        
    return output_notes

#### Showing how createMusic21Object works


```
lisa = ['C2', 'D2', 'E2', 'F2', 'G2', 'G2', 'A2', 'A2', 'A2', 'A2', 'G2']

lisa_test = createMusic21Object(lisa)
lisa_test

[<music21.note.Note C>,
 <music21.note.Note D>,
 <music21.note.Note E>,
 <music21.note.Note F>,
 <music21.note.Note G>,
 <music21.note.Note G>,
 <music21.note.Note A>,
 <music21.note.Note A>,
 <music21.note.Note A>,
 <music21.note.Note A>,
 <music21.note.Note G>] 
```

### Inference


In [32]:
# making a prediction
decoded_song = autoencoder.predict(x_test)

# rescaling the result to fit embedding
song = (predictScaler.fit_transform(decoded_song)).astype('int')

# choose sequence no. 30
newsong = createPattern(song[0])

In [33]:
play = createMusic21Object(newsong)

In [34]:
midi_stream = stream.Stream(play)

sp = midi.realtime.StreamPlayer(midi_stream)

sp.play()

### Compare test set and decoded test set

In [35]:
decoded_song.mean(), x_test.mean(), decoded_song.std(), x_test.std()

(0.7784982, 0.8663733687342035, 0.2851482, 0.2015395137870709)

In [36]:
# autoencoder.summary()