# Improvise a Jazz Solo with an LSTM Network


训练LSTM、生成音乐

1. 准备训练数据
2. 构造LSTM RNN模型，训练数据
3. 使用模型生成新的数据序列
4. 转换数据序列为音乐

In [1]:
from __future__ import print_function
import IPython
import sys
import numpy as np
from keras import backend as K

Using TensorFlow backend.


## 1. 准备数据

1. 先加载音乐，转成序列数据（corpus），共有N个tones。
2. 从corpus中随机剪切出长度为Tx的样本m个， 转成训练数据 X，Y

这里每个样本这样裁剪：
```
    D1, D2, D3, ..., D(T-1), DT
X:   0, D2, D3, ..., D(T-1), DT
Y:  D2, D3, D4, ..., DT,     0
```

In [37]:
from preprocess import get_musical_data, get_corpus_data


chords, abstract_grammars = get_musical_data('data/original_metheny.mid')
corpus, tones, tones_indices, indices_tones = get_corpus_data(abstract_grammars)
N_tones = len(set(corpus))

print(corpus[0], len(corpus))

C,0.500 193


In [38]:

def data_processing(corpus, values_indices, m = 60, Tx = 30):
    # cut the corpus into semi-redundant sequences of Tx values
    np.random.seed(0)
    # Ty = Tx
    X = np.zeros((m, Tx, N_values), dtype=np.bool)
    Y = np.zeros((m, Tx, N_values), dtype=np.bool)
    for i in range(m):
        random_idx = np.random.choice(len(corpus) - Tx)
        corp_data = corpus[random_idx:(random_idx + Tx)]
        for j in range(Tx):
            idx = values_indices[corp_data[j]]
            if j != 0:
                X[i, j, idx] = 1
                Y[i, j-1, idx] = 1
    
    Y = np.swapaxes(Y,0,1)
    Y = Y.tolist()
    return np.asarray(X), np.asarray(Y) 


X, Y = data_processing(corpus, values_indices, m=60, Tx=30)
print('shape of X:', X.shape)
print('number of training examples:', X.shape[0])
print('Tx (length of sequence):', X.shape[1])
print('total # of unique values:', n_values)
print('Shape of Y:', Y.shape)

shape of X: (60, 30, 78)
number of training examples: 60
Tx (length of sequence): 30
total # of unique values: 78
Shape of Y: (30, 60, 78)


## 2. 训练模型

<img src="images/music_generation.png" style="width:600;height:400px;">

<!--
<img src="images/djmodel.png" style="width:600;height:400px;">
<br>
<caption><center> **Figure 1**: LSTM model. $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a window of size $T_x$ scanned over the musical corpus. Each $x^{\langle t \rangle}$ is an index corresponding to a value (ex: "A,0.250,< m2,P-4 >") while $\hat{y}$ is the prediction for the next value  </center></caption>
!--> 



In [18]:
from keras.models import load_model, Model
from keras.layers import Input, Reshape, LSTM, Dense, Lambda
from keras.optimizers import Adam


# reusable units
reshapor = Reshape((1, 78))                     
LSTM_cell = LSTM(n_a, return_state=True)       
densor = Dense(n_values, activation='softmax')   


def djmodel(Tx, n_values, n_a=64):
    # input layer
    X = Input(shape=(Tx, n_values))
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    
    a = a0
    c = c0
    outputs = list()
    
    for t in range(Tx):
        # 2.A: select the "t"th time step vector from X. 
        x = Lambda(lambda x: X[:,t,:])(X)
        # 2.B: Use Reshape to reshape x to be (1, n_values) (≈1 line)
        x = reshapor(x)
        # 2.C: Perform one step of the LSTM_cell
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        # 2.D: Apply densor to the hidden state output of LSTM_Cell
        out = densor(a)
        # 2.E: add the output to "outputs"
        outputs.append(out)
    
    model = Model(inputs=[X, a0, c0], outputs=outputs)
    return model

n_a = 64
m, Tx, n_values = X.shape
model = djmodel(Tx, n_values, n_a)
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [19]:
model.fit([X, np.zeros((m, n_a)), np.zeros((m, n_a))], list(Y), epochs=100, verbose=0)

<keras.callbacks.History at 0x13f7b96a0>

## 3 生成序列数据


<img src="images/music_gen.png" style="width:600;height:400px;">



In [26]:
import tensorflow as tf
from keras import backend as K
from keras.layers import RepeatVector
from keras.utils import to_categorical


def one_hot(x):
    x = K.argmax(x)
    x = tf.one_hot(x, 78) 
    x = RepeatVector(1)(x)
    return x

def music_inference_model(LSTM_cell, densor, n_values = 78, n_a = 64, Ty = 100):
    
    # input layer
    x0 = Input(shape=(1, n_values))
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    
    a = a0
    c = c0
    x = x0

    outputs = list()
    
    for t in range(Ty):
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        out = densor(a)
        outputs.append(out)
        # select the max probability index value as the next x
        x = Lambda(one_hot)(out)
        
    inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)
    
    return inference_model

inference_model = music_inference_model(LSTM_cell, densor, n_values = 78, n_a = 64, Ty = 50)

In [47]:
# predict and sample

x0 = np.zeros((1, 1, 78))
a0 = np.zeros((1, n_a))
c0 = np.zeros((1, n_a))

def predict_and_sample(inference_model, x0=x0, a0=a0, c0=c0):
    pred = inference_model.predict([x0, a0, c0])
    indices = np.argmax(pred, axis = -1)
    results = to_categorical(indices, num_classes=78)
    return results, indices

In [49]:
results, indices = predict_and_sample(inference_model, x0, a0, c0)
print("np.argmax(results[12]) =", np.argmax(results[12]))
print("np.argmax(results[17]) =", np.argmax(results[17]))
print("list(indices[12:18]) =", list(indices[12:18]))

np.argmax(results[12]) = 62
np.argmax(results[17]) = 15
list(indices[12:18]) = [array([62]), array([15]), array([39]), array([70]), array([62]), array([15])]


## 4. 转换序列数据为音乐

In [57]:
from music21 import stream, note, tempo, midi
from qa import prune_grammar, prune_notes, clean_up_notes
from grammar import unparse_grammar

def generate_music(inference_model, corpus, abstract_grammars, tones, tones_indices, indices_tones,
                   T_y = 10, max_tries = 1000, diversity = 0.5): 
    # set up audio stream
    out_stream = stream.Stream()
    
    # Initialize chord variables
    curr_offset = 0.0                                     # variable used to write sounds to the Stream.
    num_chords = int(len(chords) / 3)                     # number of different set of chords
    
    print("Predicting new values for different set of chords.")
    # Loop over all 18 set of chords. At each iteration generate a sequence of tones
    # and use the current chords to convert it into actual sounds 
    for i in range(1, num_chords):
        
        # Retrieve current chord from stream
        curr_chords = stream.Voice()
        
        # Loop over the chords of the current set of chords
        for j in chords[i]:
            # Add chord to the current chords with the adequate offset, no need to understand this
            curr_chords.insert((j.offset % 4), j)
        
        # Generate a sequence of tones using the model
        _, indices = predict_and_sample(inference_model)
        indices = list(indices.squeeze())
        pred = [indices_tones[p] for p in indices]
        
        predicted_tones = 'C,0.25 '
        for k in range(len(pred) - 1):
            predicted_tones += pred[k] + ' ' 
        
        predicted_tones +=  pred[-1]
                
        #### POST PROCESSING OF THE PREDICTED TONES ####
        # We will consider "A" and "X" as "C" tones. It is a common choice.
        predicted_tones = predicted_tones.replace(' A',' C').replace(' X',' C')

        # Pruning #1: smoothing measure
        predicted_tones = prune_grammar(predicted_tones)
        
        # Use predicted tones and current chords to generate sounds
        sounds = unparse_grammar(predicted_tones, curr_chords)

        # Pruning #2: removing repeated and too close together sounds
        sounds = prune_notes(sounds)

        # Quality assurance: clean up sounds
        sounds = clean_up_notes(sounds)

        # Print number of tones/notes in sounds
        print('Generated %s sounds using the predicted values for the set of chords ("%s") and after pruning' % (len([k for k in sounds if isinstance(k, note.Note)]), i))
        
        # Insert sounds into the output stream
        for m in sounds:
            out_stream.insert(curr_offset + m.offset, m)
        for mc in curr_chords:
            out_stream.insert(curr_offset + mc.offset, mc)

        curr_offset += 4.0
        
    # Initialize tempo of the output stream with 130 bit per minute
    out_stream.insert(0.0, tempo.MetronomeMark(number=130))

    # Save audio stream to fine
    mf = midi.translate.streamToMidiFile(out_stream)
    mf.open("output/my_music.midi", 'wb')
    mf.write()
    print("Your generated music is saved in output/my_music.midi")
    mf.close()
    
    # Play the final stream through output (see 'play' lambda function above)
    # play = lambda x: midi.realtime.StreamPlayer(x).play()
    # play(out_stream)
    
    return out_stream

out_stream = generate_music(inference_model, corpus, abstract_grammars, tones, tones_indices, indices_tones)

Predicting new values for different set of chords.
Generated 50 sounds using the predicted values for the set of chords ("1") and after pruning
Generated 50 sounds using the predicted values for the set of chords ("2") and after pruning
Generated 51 sounds using the predicted values for the set of chords ("3") and after pruning
Generated 51 sounds using the predicted values for the set of chords ("4") and after pruning
Generated 51 sounds using the predicted values for the set of chords ("5") and after pruning
Your generated music is saved in output/my_music.midi




## 小结

中间训练预测过程挺简单， 头尾数据转换完全没看懂。

midi 本地放不了。先转成MP3在播放（https://www.zamzar.com/convert/midi-to-mp3/ ）


In [58]:
import IPython
IPython.display.Audio('./output/my_music.mp3')

**References**

The ideas presented in this notebook came primarily from three computational music papers cited below. The implementation here also took significant inspiration and used many components from Ji-Sung Kim's github repository.

- Ji-Sung Kim, 2016, [deepjazz](https://github.com/jisungk/deepjazz)
- Jon Gillick, Kevin Tang and Robert Keller, 2009. [Learning Jazz Grammars](http://ai.stanford.edu/~kdtang/papers/smc09-jazzgrammar.pdf)
- Robert Keller and David Morrison, 2007, [A Grammatical Approach to Automatic Improvisation](http://smc07.uoa.gr/SMC07%20Proceedings/SMC07%20Paper%2055.pdf)
- François Pachet, 1999, [Surprising Harmonies](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.7473&rep=rep1&type=pdf)

We're also grateful to François Germain for valuable feedback.