In [1]:
from lib.csv_to_array import convert as pre_process

# George Frederick Handel (robot)

```
[0, 0, Header, format, nTracks, division]
[Track, 0, Start_track]
[Track, Time, Note_off_c, Channel, Note, Velocity]
```
checkout http://www.fourmilab.ch/webtools/midicsv/ for more info

In [2]:
gigue = './csv/handel_hwv-433_5_gigue_(c)yamada.mid.csv'
# gigue = './csv/total.csv'

ga = pre_process(gigue)

print "i.e: \n" + str(ga[32])

i.e: 
['1', ' 190', ' Note_on_c', ' 0', ' 68', ' 0']


we'll be defining a recurrant neural network that takes, as input, a midi 'sentence' - such as the one above - and will guess at the output from a 'word-level' - i.e. it will output a similar row of length 6.

we will be training on every element of this array, but they have different behaviors.
(for each index, starting at index 0 as 1):
   1. Track - because these are keyboard suites, this will be constant in this notebook (only one instrument!) but we will treat this as a 'word-set' with ```length(set(midi[:,0]))```
   1. Time - is not a word-set, it is a monotonically increasing value and will require somewhat special attention because we always need a numeric value > last numeric value by some arbitrary (learned) step?
      1. alternatively - make a word set of midi[:,1] - midi[:-1,1] - that should give us a pretty rich set, and shoul be sufficient. Or atleast make a set range(max(midi[:,1]))
   1. Action - yay for baroque music, the only real action here is 'Note_on_c' (cuz we are tapping that clavichord) but there are headers too. I'm interested to see how the recurrence will handle this
   1. Channel - another word-set, should be simple
   1. Pitch - another word-set, I'm curious about how key changes between pieces (i.e. word-sets with some limited degree of overlap) will be handled as we expand this to train on more music
   1. Velocity - let's treat this as a word-set...see what happens

Let's think about what each statement we input.
Simply put, we'll build a set of the keys present in each 'statement' - for most, not all of the values presented. This is for two reasons:
   - many of the values are limited by their medium (*Note* for example)
      - interesting caveat: do we train this to 'learn a piece' or 'learn an instrument'
        if the former, the vocabulary can be inferred from the set of data available. Cool!
        this also means that we would need to learn more pieces to come up with cool new
        ideas. After learning only one piece, we could only compose using the keys available
        to that piece. On the one hand, this is disappointing, because you'd think we'd want
        to experience the 'key' of the piece as an unspoken affector - such that we attempt to
        play all notes in the 'key' (or even all notes whatsoever) and learn which to play - 
        i.e. learning the instrument.
        On the other, we would focus on the essence of the piece, and would build a repertoire
        as we understand how to place pieces. Additionally, a feedback run of the system - or 
        perhaps two identical systems passing errors to one another - could be used to 
        'compose' - still we run the risk of overfitting.
        Finally, when learning new 'pieces' we do not initialize the vocabulary as we did with
        the first piece, instead we expand the set, preserve existing weights and initialize
        new ones. this will be some very interesting code - but we must MUST beware of
        overfitting! My intuition is that any overfitting we experience will dissolve as more 
        pieces are learned, but it would be a shame if we couldn't get the machine to
        understand the distinction between styles, jumping from one to another on a common
        note or remaining trapped in a bad key. However, the length of the sequence compounded 
        on the hidden layers will hopefully prevent this - or make it delightfully improbable
   - some values are sequences (i.e. the time step, especially if we consider timestep @
     n = timestep @ n - 1 plus some value @ n <= length of the piece. Perhaps we could
     could consider these 'vocabularies' of real numbers - but this stinks too much of
     of overfitting and would result in such mechanical performances. let's instead allow
     our system to choose a real number, and we'll test its error not against probability, but 
     against the real error of similarity between the prescribed step and our guess. This
     could be pretty neat, and maintains our analogy of 'nascent musician'-ship

Let's think about what the construct of each statement we will output - in fact, it's a series of separate rnns - one for each 'term' of the statement, each with its own unique vocabulary, each with one (or more!) hidden layers of arbitrary depth and full (or parametrically managed) connectivity with - their initial layers, subsequent hidden layers (if any), some weighted sequence of previous hidden layers (namely, input values and output weights) and their output layers. Does the sequence of previous layers need more than one layer? can we learn which previous sequences to prefer? It would be nice if this could provide some sense of flourish, of intention.
Finally, the output will be in the form of a vector length of vocabulary of each index's likelihood, checked for error against the next value in the sequence - the subsequent input.

let's dick around with just one predictor for now, including lstm or gated memories for that 
layer. To that end, it would benefit us to make this extensible enough to make it easy to
connect the hidden layers or other inputs (wait, can't the hidden layers be the same size?!)
and the pre-sequential hidden layers of those inputs. This will give us a lovely level of
connectivity as well as the 'sanctity' of data that had concerned us with the 'mixed bag' of 
data approach. maybe it would work anyway? mgiht be worth investigating

how are we going to handle the 'headers' shall we be setting them ourselves? let's read them 
anyway, I believe in learning...nevermind, we an't read them, they have alternate rates.

In [3]:
# let's pad

#padding 
s_len = max([len(g) for g in ga])
p_ga = [g + [None] * (s_len - len(g)) for g in ga]

notes = [g[4] for g in p_ga if g is not None]

data = notes
chars = list(set(notes))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

data has 2522 characters, 48 unique.


In [5]:
import numpy as np

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
    """
    inputs,targets are both list of integers.
    hprev is Hx1 array of initial hidden state
    returns the loss, gradients on model parameters, and last hidden state
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(hprev)
    loss = 0
    # forward pass
    for t in xrange(len(inputs)):
        xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
        ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
        loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
    # backward pass: compute gradients going backwards
    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])
    for t in reversed(xrange(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1 # backprop into y
        dWhy += np.dot(dy, hs[t].T)
        dby += dy
        dh = np.dot(Why.T, dy) + dhnext # backprop into h
        dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(dhraw, hs[t-1].T)
        dhnext = np.dot(Whh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
    """ 
    sample a sequence of integers from the model 
    h is memory state, seed_ix is seed letter for first time step
    """
    x = np.zeros((vocab_size, 1))
    x[seed_ix] = 1
    ixes = []
    for t in xrange(n):
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by
        p = np.exp(y) / np.sum(np.exp(y))
        ix = np.random.choice(range(vocab_size), p=p.ravel())
        x = np.zeros((vocab_size, 1))
        x[ix] = 1
        ixes.append(ix)
    return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while True:
    # prepare inputs (we're sweeping from left to right in steps seq_length long)
    if p+seq_length+1 >= len(data) or n == 0: 
        hprev = np.zeros((hidden_size,1)) # reset RNN memory
        p = 0 # go from start of data
    inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
    targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

    # sample from the model now and then
#     if n % 100 == 0:
#         sample_ix = sample(hprev, inputs[0], 200)
#         #txt = ''.join(ix_to_char[ix] for ix in sample_ix)
#         #print '----\n %s \n----' % (txt, )
#         print [ix_to_char[ix] for ix in sample_ix]
        
    # forward seq_length characters through the net and fetch gradient
    loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001
    if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress

    # perform parameter update with Adagrad
    for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                  [dWxh, dWhh, dWhy, dbh, dby], 
                                  [mWxh, mWhh, mWhy, mbh, mby]):
        mem += dparam * dparam
        param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

    p += seq_length # move data pointer
    n += 1 # iteration counter 

iter 0, loss: 96.780021
iter 100, loss: 99.476972
iter 200, loss: 99.155917
iter 300, loss: 98.563555
iter 400, loss: 97.838707
iter 500, loss: 96.990253
iter 600, loss: 95.994777
iter 700, loss: 94.816457
iter 800, loss: 93.510016
iter 900, loss: 92.124258
iter 1000, loss: 90.676350
iter 1100, loss: 89.185807
iter 1200, loss: 87.680817
iter 1300, loss: 86.180209
iter 1400, loss: 84.708533
iter 1500, loss: 83.275258
iter 1600, loss: 81.861108
iter 1700, loss: 80.507925
iter 1800, loss: 79.160043
iter 1900, loss: 77.810920
iter 2000, loss: 76.489888
iter 2100, loss: 75.206052
iter 2200, loss: 73.941959
iter 2300, loss: 72.714622
iter 2400, loss: 71.520853
iter 2500, loss: 70.324548
iter 2600, loss: 69.136681
iter 2700, loss: 67.981954
iter 2800, loss: 66.861618
iter 2900, loss: 65.793330
iter 3000, loss: 64.706179
iter 3100, loss: 63.688216
iter 3200, loss: 62.678156
iter 3300, loss: 61.723505
iter 3400, loss: 60.791027
iter 3500, loss: 59.878329
iter 3600, loss: 59.016415
iter 3700, lo

KeyboardInterrupt: 

In [8]:
sample_ix = sample(hprev, inputs[0], 2523)
print sample_ix

[47, 43, 7, 40, 17, 40, 17, 42, 15, 42, 41, 41, 6, 43, 12, 42, 15, 42, 41, 12, 41, 8, 10, 46, 32, 45, 32, 45, 6, 43, 27, 23, 12, 45, 26, 45, 6, 16, 1, 45, 26, 45, 32, 45, 6, 43, 7, 16, 43, 27, 43, 27, 15, 18, 26, 34, 34, 21, 18, 19, 19, 27, 43, 12, 45, 26, 14, 23, 8, 15, 15, 2, 26, 15, 33, 15, 33, 19, 27, 19, 27, 18, 43, 38, 20, 38, 20, 10, 43, 45, 10, 45, 27, 23, 27, 24, 15, 39, 15, 40, 14, 40, 14, 20, 45, 47, 14, 24, 40, 23, 27, 24, 18, 39, 15, 40, 7, 24, 18, 39, 15, 40, 19, 24, 43, 24, 27, 40, 27, 24, 18, 39, 15, 40, 14, 40, 14, 20, 32, 19, 47, 23, 23, 7, 18, 12, 12, 44, 7, 40, 12, 44, 10, 43, 45, 23, 14, 23, 20, 21, 20, 32, 42, 15, 11, 23, 12, 23, 12, 23, 12, 16, 23, 12, 12, 18, 44, 43, 7, 17, 12, 12, 10, 42, 15, 46, 6, 46, 6, 43, 27, 43, 43, 39, 20, 39, 20, 24, 41, 27, 43, 11, 43, 12, 45, 26, 45, 23, 18, 23, 26, 45, 6, 6, 21, 21, 4, 18, 39, 18, 22, 9, 41, 6, 43, 27, 43, 28, 28, 46, 6, 6, 47, 3, 47, 40, 17, 40, 42, 15, 42, 15, 42, 12, 24, 40, 12, 40, 18, 25, 10, 25, 38, 38, 15, 24,