### A fully connected autoencoder network

That looks like this:

```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                660       
_________________________________________________________________
dense_2 (Dense)              (None, 32)                672       
=================================================================
Total params: 1,332
Trainable params: 1,332
Non-trainable params: 0
```

The input layer takes a sequence of 32 notes and/or chords, the first Dense layer has 10 nodes, and hereby compress the input to a "latent space". The second Dense layer reconstruct the input from the latent space.

Since it is a very small network, the bottleneck cannot be very much smaller than the input, at this level.

In [1]:
# Imports

# Keras
from keras.layers import Input, Dense
from keras.models import Model
import keras.utils as utils
# in case of need for activity regularizers
from keras import regularizers
# earlystopping prevents overfitting
from keras.callbacks import EarlyStopping

# For midi
from music21 import converter, instrument, note, chord
from music21.instrument import Guitar
from music21 import midi, stream

# To calculate training time
import time

# To create scaler for normalizing embedded chords/notes
# and rescaling back to embedding after training
from sklearn.preprocessing import MinMaxScaler

import numpy as np
np.set_printoptions(threshold=10e6)

# bottleneck
encoding_dim = 10

# for training
epochs = 400
batch_size = 450

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### The dataset

The notes for the dataset has been parsed in the notebook "Midi Parsing".
The textfile contains a long string of notes and chords. 

Here, I split the string, and convert it to a list of strings.

The last ten elements look like this:

```
['E2', '9.1.4', 'A2', 'E2', 'A2', 'E2', '9.1', '', '9.1']
``` 

'A2', 'E2' etc. are notes, and their pitch.

The numbers, e.g. '9.1.4' means three separate notes, played simultaneously - aka a chord.

The empty text string '' means a note rest.

This is a chord representation in their *normal order* - which is a concept I don't fully understand. It has something to do with semitone intervals.

These are representations that are understood by the **music21** library as different chords.

```len(notes) = 598820```

so all the songs are compressed into a long sequence with length 598820

In [19]:
# The dataset
newTextfile = open('notesNew.txt', 'r')
newNotes = newTextfile.readlines()
newTextfile.close()

notes = []
for line in newNotes:
    notes = line.split(',')

### Preparing the dataset

Here, I'm creating embeddings of all of the notes/chords. The embeddings and their notes/chords becomes a dictionary, called note_to_int.

A snippet from note to int:

```
'9.11.2.3': 441,
 '9.11.2.4': 442,
 '9.11.2.5': 443,
 '9.11.3': 444,
 '9.11.4': 445,
 '9.2': 446,
 'A2': 447,
 'A3': 448,
 'A4': 449,
 'A5': 450,
 'A6': 451,
 'B-2': 452,
 'B-3': 453,
 'B-4': 454,
```

In [3]:
# Preparing dataset
sequence_length = 32

# sort all unique elements of notes-list
pitchnames = sorted(set(item for item in notes))

# create a dictionary to map pitches to integers
note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

network_input = []

#  create input sequences and the corresponding outputs
for i in range(0, len(notes) - sequence_length, sequence_length):
    sequence_in = notes[i:i + sequence_length] 
    network_input.append([note_to_int[char] for char in sequence_in])

### Normalizing input, and creating upscaler for later

I get the max value from the network input.

Then I create the feature range (0,max network input) for a scaler from *sklearn.preprocessing.MinMaxScaler*.

And I use the max value to normalize the network_input.

In [4]:
# Get max value from network input
maxr = max(max(network_input))

# create feature range for upscaler
feature_range = (0,maxr)
# prepare scaler for later
predictScaler = MinMaxScaler(feature_range=feature_range)

# saving feature range, useful elsewhere
np.save("feature_range.npy", feature_range)
           

# normalize input
network_input = np.asarray(network_input)
normalized_input = network_input / maxr

### Creating train and test set

I split the network_input by 2/3 to my training set, and keep the last 1/3 for my test set. 
Then I save it for later use.

In [5]:
# Split
split_point = int(normalized_input.shape[0] * 2 / 3)

x_train, x_test = normalized_input[0:split_point,:], normalized_input[split_point:-1,:]

np.savez("music.npz", x_train=x_train,x_test=x_test)

x_train.shape, x_test.shape

((11292, 32), (5645, 32))

### Creating the network

The weights are initalized with random normal distribution, as this keeps them close to the dataset. The relu actvation function gives the best result. All the values in the network are positive, so that's not a surprise. 
And relu prevents vanishing gradients. I experienced slow convergence with sigmoid.

In [6]:
# size of encoded representation
input_dim = network_input.shape[1]

# input placeholder
input_song = Input(shape=(input_dim,))

# encoder
encoded = Dense(encoding_dim, kernel_initializer='random_normal',
               bias_initializer='zeros', activation='relu')(input_song)

#decoder
decoded = Dense(input_dim, kernel_initializer='random_normal',
                bias_initializer='zeros', activation='relu')(encoded)

### The autoencoder

In [7]:
# The autoencoder maps the input to its reconstruction
# input=input song, output = decoded song

autoencoder = Model(input_song, decoded)

### The encoder and decoder

These aren't really necessary for making predictions in this example. It just exemplifies that encoding and decoding can be broken down to separate models and trained. I don't use it for making predictions later.

In [8]:
# Separate encoder model
encoder = Model(input_song, encoded)

### The decoder

In [9]:
# Separate decoder model

# create placeholder for encoded (32 dim) input
encoded_input = Input(shape=(encoding_dim,))

# retrieve last layer of autoencoder model
decoder_layer = autoencoder.layers[-1]

# make decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

### Training

The model is in practice trying to estimate the distance between calculated input and true input, and these are not likelihood estimations, but just number representations. Therefore I chose *mean squared error* as a loss function.

Chose rmsprop as I knew it was a good optimizer, gives better result than adam and adadelta. But, I don't have a good explanation at the moment.

Use earlystopping with a patience of 20 epochs, and minimum change 10e-5. 

In [20]:
# use per-pixel binary crossentropy-loss and Adadelta optimizer
autoencoder.compile(optimizer='rmsprop', loss='mean_squared_error')

earlystop = EarlyStopping(monitor='val_loss', min_delta=10e-5, patience=20,
                          verbose=1, mode='auto')

callbacks_list = [earlystop]

# train the model
start = time.time()

model_info = autoencoder.fit(x_train, x_train, 
                epochs=epochs,
                batch_size=batch_size,
                shuffle=True,
                callbacks=callbacks_list,
                validation_split=0.3)


end = time.time()

print("time to train", end-start)

Train on 7904 samples, validate on 3388 samples
Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 00032: early stopping
time to train 3.037297248840332


### Prepare to play music

Read and borrowed parts from [this](https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5), but I changed the code to include note rests.

In [11]:
# get all pitch names
int_to_note = dict((number, note) for number, note in enumerate(pitchnames))

# This is just necessary to save the dictionary for use elsewhere
# It's not possible to save a dictionary object, but lists are no problem
# I zip them back to a dictionary when needed
keys = list(int_to_note.keys())
values = list(int_to_note.values())
np.savez("int_to_note.npz", keys=keys, values=values)

In [12]:
def createPattern(input_sequence):
    """
    Function that map integers from note_to_int-dictionary
    back to string representation of notes and chords.
    
    Input: sequence of 32 integers representing a short song
    
    Output: Note and chord representations as strings
    """
    
    
    prediction_output = []

    # generate notes
    for note_index in input_sequence:
                
        result = int_to_note[note_index]
        prediction_output.append(result)
    return prediction_output

### Inference


In [13]:
# making a prediction
decoded_song = autoencoder.predict(x_test)

# rescaling the result to fit embedding
song = (predictScaler.fit_transform(decoded_song)).astype('int')

# choose a sequence
newsong = createPattern(song[40])

### Create music21 stream

I'm creating a miditrack where each note has an offset of 0.5. 

In [15]:
mt = midi.MidiTrack(0)
dt = midi.DeltaTime(mt)
dt.time = 0.5
s1 = stream.Stream()

for item in newsong:
    
    if ('.' in item) or item.isdigit():
        # chord
        notes_in_chord = item.split('.')
        notes = []

        for current_note in notes_in_chord:
            new_note = note.Note(int(current_note))
            new_note.storedInstrument = instrument.Guitar()
            notes.append(new_note)

        new_chord = chord.Chord(notes)
        s1.append(new_chord)

    elif item is not '' and ('.' not in item):
        # notes
        new_note = note.Note(item)
        s1.append(new_note)


    elif item == '':
        # rest
        s1.append(note.Rest())

In [16]:
sp = midi.realtime.StreamPlayer(s1)

sp.play()

### Compare test set and decoded test set

In [17]:
decoded_song.mean(), x_test.mean(), decoded_song.std(), x_test.std()

(0.75711864, 0.8293800943207219, 0.29313558, 0.26314332054883544)

In [None]:
# autoencoder.summary()