In [1]:
from lib.csv_to_array import convert as pre_process

# George Frederick Handel (robot)

```
[0, 0, Header, format, nTracks, division]
[Track, 0, Start_track]
[Track, Time, Note_off_c, Channel, Note, Velocity]
```
checkout http://www.fourmilab.ch/webtools/midicsv/ for more info

In [2]:
gigue = './csv/handel_hwv-433_5_gigue_(c)yamada.mid.csv'
# gigue = './csv/total.csv'

ga = pre_process(gigue)

print "i.e: \n" + str(ga[32])

i.e: 
['1', ' 190', ' Note_on_c', ' 0', ' 68', ' 0']


we'll be defining a recurrant neural network that takes, as input, a midi 'sentence' - such as the one above - and will guess at the output from a 'word-level' - i.e. it will output a similar row of length 6.

we will be training on every element of this array, but they have different behaviors.
(for each index, starting at index 0 as 1):
   1. Track - because these are keyboard suites, this will be constant in this notebook (only one instrument!) but we will treat this as a 'word-set' with ```length(set(midi[:,0]))```
   1. Time - is not a word-set, it is a monotonically increasing value and will require somewhat special attention because we always need a numeric value > last numeric value by some arbitrary (learned) step?
      1. alternatively - make a word set of midi[:,1] - midi[:-1,1] - that should give us a pretty rich set, and shoul be sufficient. Or atleast make a set range(max(midi[:,1]))
   1. Action - yay for baroque music, the only real action here is 'Note_on_c' (cuz we are tapping that clavichord) but there are headers too. I'm interested to see how the recurrence will handle this
   1. Channel - another word-set, should be simple
   1. Pitch - another word-set, I'm curious about how key changes between pieces (i.e. word-sets with some limited degree of overlap) will be handled as we expand this to train on more music
   1. Velocity - let's treat this as a word-set...see what happens

Let's think about what each statement we input.
Simply put, we'll build a set of the keys present in each 'statement' - for most, not all of the values presented. This is for two reasons:
   - many of the values are limited by their medium (*Note* for example)
      - interesting caveat: do we train this to 'learn a piece' or 'learn an instrument'
        if the former, the vocabulary can be inferred from the set of data available. Cool!
        this also means that we would need to learn more pieces to come up with cool new
        ideas. After learning only one piece, we could only compose using the keys available
        to that piece. On the one hand, this is disappointing, because you'd think we'd want
        to experience the 'key' of the piece as an unspoken affector - such that we attempt to
        play all notes in the 'key' (or even all notes whatsoever) and learn which to play - 
        i.e. learning the instrument.
        On the other, we would focus on the essence of the piece, and would build a repertoire
        as we understand how to place pieces. Additionally, a feedback run of the system - or 
        perhaps two identical systems passing errors to one another - could be used to 
        'compose' - still we run the risk of overfitting.
        Finally, when learning new 'pieces' we do not initialize the vocabulary as we did with
        the first piece, instead we expand the set, preserve existing weights and initialize
        new ones. this will be some very interesting code - but we must MUST beware of
        overfitting! My intuition is that any overfitting we experience will dissolve as more 
        pieces are learned, but it would be a shame if we couldn't get the machine to
        understand the distinction between styles, jumping from one to another on a common
        note or remaining trapped in a bad key. However, the length of the sequence compounded 
        on the hidden layers will hopefully prevent this - or make it delightfully improbable
   - some values are sequences (i.e. the time step, especially if we consider timestep @
     n = timestep @ n - 1 plus some value @ n <= length of the piece. Perhaps we could
     could consider these 'vocabularies' of real numbers - but this stinks too much of
     of overfitting and would result in such mechanical performances. let's instead allow
     our system to choose a real number, and we'll test its error not against probability, but 
     against the real error of similarity between the prescribed step and our guess. This
     could be pretty neat, and maintains our analogy of 'nascent musician'-ship

Let's think about what the construct of each statement we will output - in fact, it's a series of separate rnns - one for each 'term' of the statement, each with its own unique vocabulary, each with one (or more!) hidden layers of arbitrary depth and full (or parametrically managed) connectivity with - their initial layers, subsequent hidden layers (if any), some weighted sequence of previous hidden layers (namely, input values and output weights) and their output layers. Does the sequence of previous layers need more than one layer? can we learn which previous sequences to prefer? It would be nice if this could provide some sense of flourish, of intention.
Finally, the output will be in the form of a vector length of vocabulary of each index's likelihood, checked for error against the next value in the sequence - the subsequent input.

let's dick around with just one predictor for now, including lstm or gated memories for that 
layer. To that end, it would benefit us to make this extensible enough to make it easy to
connect the hidden layers or other inputs (wait, can't the hidden layers be the same size?!)
and the pre-sequential hidden layers of those inputs. This will give us a lovely level of
connectivity as well as the 'sanctity' of data that had concerned us with the 'mixed bag' of 
data approach. maybe it would work anyway? mgiht be worth investigating

how are we going to handle the 'headers' shall we be setting them ourselves? let's read them 
anyway, I believe in learning...nevermind, we an't read them, they have alternate rates.

In [3]:
# let's pad

#padding 
s_len = max([len(g) for g in ga])

inval_action = ['']

p_ga = [g + [None] * (s_len - len(g)) for g in ga]

notes = [g[4] for g in p_ga]

n = set(notes)
nm = [(ind, na) for ind, na in enumerate(n)]

vocab_size, data_length = len(n), len(ga)

print("there are fewer than " + str(vocab_size) 
      + " unique notes in piece with nearly "
      + str(data_length) + " events!")

note_for_ind = lambda y: [x for x in nm if x[0] == y][0]
ind_for_note = lambda y: [x for x in nm if x[1] == y][0]

there are fewer than 48 unique notes in piece with nearly 2522 events!


In [4]:
#the fun begins
import numpy as np

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

In [5]:
## borrowed courtesy http://karpathy.github.io/2015/05/21/rnn-effectiveness/
## by way of https://gist.github.com/karpathy/d4dee566867f8291f086
## """
##   inputs,targets are both list of integers.
##   hprev is Hx1 array of initial hidden state
##   returns the loss, gradients on model parameters, and last hidden state
##   """
def lossFun(inputs, targets, hprev):
    xs = {} # one-hots of inputs
    hs = {} # hidden states
    ys = {} # output states
    ps = {} # probabilities
    
    hs[-1] = np.copy(hprev) #how many hidden states u want to go back?
    loss = 0 #hello beautiful - ah, the perfection if the unstarted
    
    #forward pass!
    for t in xrange(len(inputs)): #1 for each val in seq - remind u of rk4?
        xs[t] = np.zeros((vocab_size,1)) # 1-of-k shaped
        xs[t][inputs[t]] = 1 #one is hot! note the cool use of indexing
        #ooh, sexy - we're getting the hidden state by weighing the dot 
        #prods of the input -> hidden and last_hidden -> hidden
        # THAT'S WHY WE SET THE LAST HIDDEN TO IND -1!!!! MEMORY!
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh)
        ys[t] = np.dot(Why, hs[t]) + by #unnormed log prob for next word
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) #sigmoid those probs
        #apply loss - log linear
        loss += -np.log(ps[t][targets[t],0])
        
    #backward pass! - might want to split this when we've got parallelized
    #interconnected networks
    
    dWxh = np.zeros_like(Wxh) #scope alert!
    dWhh = np.zeros_like(Whh) #scope alert! - need to package these for
    dWhy = np.zeros_like(Why) #scope alert! - parallelization
    
    dbh = np.zeros_like(bh) #scope alert!
    dby = np.zeros_like(by) #scope alert!
    
    dhnext = np.zeros_like(hs[0])
    
    #that's right, we backprop to every sequence we visit - t has power
    for t in reversed(xrange(len(inputs))):
        dy = np.copy(ps[t]) #y grab smoothed prob? - its where we leave nn
        dy[targets[t]] -= 1 #actual value 'applied' to space, will fix
        dWhy += np.dot(dy, hs[t].T) #bp into y weights
        dby += dy
        dh = np.dot(Why.T, dy) + dhnext #bp into h
        dhraw = (1-hs[t]*hs[t]) * dh #take bp and filter via tanh(u)`
        dbh += dhraw #makes sense, the bias accums via the bp of output
        dWxh += np.dot(dhraw, xs[t].T) #bp hidden into inputs weights
        dWhh += np.dot(dhraw, hs[t-1].T) #bp into t-1 hidden state weights
        dhnext = np.dot(Whh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5,5, out=dparam) #clip to mitigate exploding grad
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

In [6]:
#generator - basically a forward only run
#   """ 
#   sample a sequence of integers from the model 
#   h is memory state, seed_ix is seed letter for first time step
#   """
def sample(h, seed_ix, n):
    x = np.zeros((vocab_size,1))
    x[seed_ix] = 1 #builds initial input - catalyst
    ixes = [] #sequence out!
    #forward pass, tidy loop
    for t in xrange(n):
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by
        p = np.exp(y) / np.sum(np.exp(y))
        #p= option sets the probabilities of 'random' choice
        ix = np.random.choice(range(vocab_size), p=p.ravel())
        x = np.zeros((vocab_size, 1))
        x[ix] = 1
        ixes.append(ix)
    return ixes

In [7]:
#lots of definitions here
n = 0 #iteration counter
p = 0 #position in `events` array, `nm` - more generally `p_ga`

mWxh = np.zeros_like(Wxh) #needed by adagrad, could be generalized?
mWhh = np.zeros_like(Whh) #ditto
mWhy = np.zeros_like(Why) #ditto
mbh = np.zeros_like(bh) #ditto
mby = np.zeros_like(by) #ditto

smooth_loss = -1 * np.log(1.0/vocab_size) * seq_length

In [8]:
#loop this shit - forevah?
while True:
    #sweeping events in steps seq_length long
    if p + seq_length + 1 >= data_length or n == 0: #beinning & end: training data
        hprev = np.zeros((hidden_size,1)) #empty memory - save instead?
        p = 0 #start at the very beginning
        
    #transform first chunk of data into indices
    inputs = [ind_for_note(no)[0] for no in notes[p:p+seq_length]]
    #transform shifted+1 chunk of data for target vals
    targets = [ind_for_note(no)[0] for no in notes[p+1:p+seq_length+1]]
    
    #grab a sample every now and again
    if n % 100 == 0:
        #using the 'current' hidden layer, grab the current letter and 
        #give me 200 more
        sample_ix = sample(hprev, inputs[0], 200)
        sample_notes = [note_for_ind(i)[1] for i in sample_ix]
        #print sample_notes #what good does this do me?
    
    #run the current seq through the net, and fetch the gradients
    loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001 #wow, such voodoo
    
    if n % 100 == 0:
        print 'iter %d, loss %f,' % (n, smooth_loss) #see that loss shrinkin?
    
    #adagrad - let's make this thing it's own bit?!
    #seems like afterthought, Karpathy!
    #such a business, but gives us tuple:
    #(cur-weights, del-weights, mem-weights? wat r mem-weights?)
    for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
                                  [dWxh, dWhh, dWhy, dbh, dby],
                                  [mWxh, mWhh, mWhy, mbh, mby]):
        mem += dparam * dparam #square delta weights into memory? y square?
        param += -learning_rate * dparam / np.sqrt(mem + 1e-8) #adagrad!
        
    p += seq_length
    n += 1 
    #let's go again!

iter 0, loss 96.780038,
iter 100, loss 100.105122,
iter 200, loss 99.519638,
iter 300, loss 98.606640,
iter 400, loss 97.555256,
iter 500, loss 96.454854,
iter 600, loss 95.302661,
iter 700, loss 94.060039,
iter 800, loss 92.711923,
iter 900, loss 91.307091,
iter 1000, loss 89.762375,
iter 1100, loss 88.105645,
iter 1200, loss 86.335657,
iter 1300, loss 84.516296,
iter 1400, loss 82.657501,
iter 1500, loss 80.770168,
iter 1600, loss 78.866536,
iter 1700, loss 76.949395,
iter 1800, loss 75.170179,
iter 1900, loss 73.374703,
iter 2000, loss 71.671364,
iter 2100, loss 69.815565,
iter 2200, loss 67.955738,
iter 2300, loss 66.141716,
iter 2400, loss 64.390396,
iter 2500, loss 62.723512,
iter 2600, loss 61.040793,
iter 2700, loss 59.311900,
iter 2800, loss 57.570052,
iter 2900, loss 55.923399,
iter 3000, loss 54.354960,
iter 3100, loss 52.765734,
iter 3200, loss 51.231508,
iter 3300, loss 49.794897,
iter 3400, loss 48.331195,
iter 3500, loss 46.939484,
iter 3600, loss 45.595768,
iter 3700, l

KeyboardInterrupt: 

In [None]:
sample_ix = sample(hprev, inputs[0], 3500)
sample_notes = [note_for_ind(i)[1] for i in sample_ix]
print sample_notes