## E30 - Music Transformer
### Music Generation 모델의 발전
분 단위의 음악을 만드는데 처음으로 성공한 모델은 google의 magenta(2018)프로젝트의 music transformer이다.  
https://arxiv.org/pdf/1809.04281.pdf  
https://magenta.tensorflow.org/music-transformer  
음악에는 반복적인 구성이 있는다. A-B-A'  
RNN은 반복적인 구성을 만들지 못하고 vanilla transformer는 처음엔 보여주지만 끝까지 유지하지 못한다.  

MuseNet  
https://openai.com/blog/musenet/  
GPT-2를 기반으로 한 음악 생성 모델  

Jukebox  
https://openai.com/blog/jukebox/  
musenet이 기악 분야의 음악 합성이라면 jukebox는 특정 가수의 음색, 음정과 가사전달까지 가능해짐  
### Music Transformer 시스템 개요
#### 음악 생성 모델 전체 구조
* Transcription : wave -> midi  
* Symbolic modeling : 심볼릭 음악 생성 모델, music transformer가 해당되는 부분  
* Synthesis : 생성된 midi -> wave, conditional wavenet 모델이 사용됨  
음악합성모델은 wave2midi2wave 구조로 이루어짐.  

Memory-Effective Relative-Global Attention Model  

### MAESTRO 데이터셋
google의 megenta 프로젝트에서 공개한 데이터셋  
https://magenta.tensorflow.org/datasets/maestro  

mkdir -p ~/aiffel/music_transformer/data  
mkdir -p ~/aiffel/music_transformer/models  
wget https://storage.googleapis.com/magentadata/datasets/maestro/v2.0.0/maestro-v2.0.0-midi.zip  
mv maestro-v2.0.0-midi.zip ~/aiffel/music_transformer/data  
cd ~/aiffel/music_transformer/data && unzip maestro-v2.0.0-midi.zip  

midi 파일 들어보기  
Audacious 프로그램 사용  
설치 방법 : https://vitux.com/how-to-install-audacious-audio-player-on-ubuntu/  
soundfont 프로그램 필요  
설치 방법 : https://askubuntu.com/questions/801069/audacious-how-to-play-midi-files  

midi 파일구조 분석  
pip install mido  

In [1]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.sequence import pad_sequences

import pandas as pd
import numpy as np

import time
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import concurrent.futures

import mido

In [None]:
tf.config.list_physical_devices('GPU')

In [2]:
# 샘플로 1개의 MIDI 파일을 골라봅니다.
midi_file = os.getenv('HOME')+'/aiffel/music_transformer/data/maestro-v2.0.0/2018/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2.midi'

midi = mido.MidiFile(midi_file)

In [3]:
ON = 1
OFF = 0
CC = 2

current_time = 0
eventlist = []
cc = False
for idx, msg in enumerate(midi):
    print('MSG [{}]----------------'.format(idx))
    current_time += msg.time
    print(current_time)
    print(msg.type)
    if msg.type == 'note_on' and msg.velocity > 0:
        event = [current_time, ON, msg.note, msg.velocity]
        print(event)
    elif msg.type == 'note_off' or (msg.type == 'note_on' and msg.velocity == 0):
        event = [current_time, OFF, msg.note, msg.velocity]
        print(event)
        
    if msg.type == 'control_change':
        if msg.control != 64:
            continue
        if cc == False and msg.value > 0:
            cc = True
            event = [current_time, CC, 0, 1]
            print(event)
        elif cc == True and msg.value == 0:
            cc = False
            event = [current_time, CC, 0, 0]
            print(event)

    if idx > 30:
        break

MSG [0]----------------
0
set_tempo
MSG [1]----------------
0
time_signature
MSG [2]----------------
0
program_change
MSG [3]----------------
0
control_change
MSG [4]----------------
0
control_change
MSG [5]----------------
0.5143229166666666
control_change
MSG [6]----------------
0.6328125
control_change
MSG [7]----------------
0.7903645833333333
control_change
MSG [8]----------------
0.9999999999999999
control_change
MSG [9]----------------
1.0325520833333333
note_on
[1.0325520833333333, 1, 74, 86]
MSG [10]----------------
1.0442708333333333
note_on
[1.0442708333333333, 1, 38, 77]
MSG [11]----------------
1.0794270833333333
control_change
MSG [12]----------------
1.1184895833333333
control_change
MSG [13]----------------
1.1588541666666665
control_change
MSG [14]----------------
1.2174479166666665
control_change
MSG [15]----------------
1.2265624999999998
note_on
[1.2265624999999998, 0, 74, 0]
MSG [16]----------------
1.2369791666666665
control_change
MSG [17]----------------
1.23958

이벤트 메시지 타입에 따라 바뀐다.  
control_change는 셋팅, note_on은 실제 악보 부분  
이벤트 구성은 [음 지속시간, ON/OFF, 음고(pitch), 속도(velocity)]  
학습을 위한 midi 파일을 전처리하는 함수 제공 get_data()  
time, note, interval 등에 대한 augmentation도 진행

In [4]:
IntervalDim = 100

VelocityDim = 32
VelocityOffset = IntervalDim

NoteOnDim = NoteOffDim = 128
NoteOnOffset = IntervalDim + VelocityDim
NoteOffOffset = IntervalDim + VelocityDim + NoteOnDim

CCDim = 2
CCOffset = IntervalDim + VelocityDim + NoteOnDim + NoteOffDim

EventDim = IntervalDim + VelocityDim + NoteOnDim + NoteOffDim + CCDim # 390

def get_data(data, length):    
    # time augmentation
    data[:, 0] *= np.random.uniform(0.80, 1.20)
    
    # absolute time to relative interval
    data[1:, 0] = data[1:, 0] - data[:-1, 0]
    data[0, 0] = 0
    
    # discretize interval into IntervalDim
    data[:, 0] = np.clip(np.round(data[:, 0] * IntervalDim), 0, IntervalDim - 1)
    
    # Note augmentation
    data[:, 2] += np.random.randint(-6, 6)
    data[:, 2] = np.clip(data[:, 2], 0, NoteOnDim - 1)
    
    eventlist = []
    for d in data:
        # append interval
        interval = d[0]
        eventlist.append(interval)
    
        # note on case
        if d[1] == 1:
            velocity = (d[3] / 128) * VelocityDim + VelocityOffset
            note = d[2] + NoteOnOffset
            eventlist.append(velocity)
            eventlist.append(note)
            
        # note off case
        elif d[1] == 0:
            note = d[2] + NoteOffOffset
            eventlist.append(note)
        # CC
        elif d[1] == 2:
            event = CCOffset + d[3]
            eventlist.append(event)
            
    eventlist = np.array(eventlist).astype(np.int)
    
    if len(eventlist) > (length+1):
        start_index = np.random.randint(0, len(eventlist) - (length+1))
        eventlist = eventlist[start_index:start_index+(length+1)]
        
    # pad zeros
    if len(eventlist) < (length+1):
        pad = (length+1) - len(eventlist)
        eventlist = np.pad(eventlist, (pad, 0), 'constant')
        
    x = eventlist[:length]
    y = eventlist[1:length+1]
    
    return x, y

wget https://aiffelstaticprd.blob.core.windows.net/media/documents/midi_test.zip  
mv midi_test.zip ~/aiffel/music_transformer/data  
unzip midi_test.zip  

midi_test파일은 midi파일에 대해 전처리한 파일

In [None]:
# # 전체 midi 파일을 전처리하는 코드

# with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
#     future = [executor.submit(get_eventlist, d_path) for d_path in midi_path]
#     return_value = [f.result() for f in future]

# return_value = np.array(return_value)
# np.save('midi_test.npy', return_value) 

### Music Transformer 모델 구현
데이터셋 구성

In [5]:
data_path = os.getenv('HOME')+'/aiffel/music_transformer/data/midi_test.npy'

get_midi = np.load(data_path, allow_pickle=True)
get_midi.shape

(1282,)

In [6]:
length = 256
train = []
labels = []

for midi_list in get_midi:
    cut_list = [midi_list[i:i+length] for i in range(0, len(midi_list), length)]
    for sublist in cut_list:
        x, y = get_data(np.array(sublist), length)
        train.append(x)
        labels.append(y)

In [7]:
train = np.array(train)
labels = np.array(labels)

print(train.shape, labels.shape)   # 학습을 위해 MIDI list를 256 길이로 나누었다.
# label은 train 데이터를 1만큼 shift한 것임.
# 자연어 처리와 비슷한 데이터셋 구성을 가짐.

(59268, 256) (59268, 256)


In [8]:
train_data_pad = pad_sequences(train,
                               maxlen=length,
                               padding='post',
                               value=0)
train_label_pad = pad_sequences(labels,
                                maxlen=length,
                                padding='post',
                                value=0)

In [9]:
def tensor_casting(train, label):
    train = tf.cast(train, tf.int64)
    label = tf.cast(label, tf.int64)

    return train, label

In [10]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_data_pad, train_label_pad))
train_dataset = train_dataset.map(tensor_casting)
train_dataset = train_dataset.shuffle(10000).batch(batch_size=16)

In [11]:
for t,l in train_dataset.take(1):
    print(t)
    print(l)

tf.Tensor(
[[309   2 108 ... 110 173  10]
 [114 189  20 ... 156   7 319]
 [  4 333   2 ...   1 111 201]
 ...
 [  5 319   2 ... 116 172   2]
 [  4 117 185 ... 185   1 311]
 [  9 119 219 ... 113 193   0]], shape=(16, 256), dtype=int64)
tf.Tensor(
[[  2 108 173 ... 173  10 301]
 [189  20 305 ...   7 319   8]
 [333   2 117 ... 111 201   1]
 ...
 [319   2 120 ... 172   2 117]
 [117 185   1 ...   1 311   0]
 [119 219   1 ... 193   0 114]], shape=(16, 256), dtype=int64)


Music Transformer 모델 구현

In [12]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 1), tf.float32)

    # add extra dimensions to add the padding
    # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)


def create_look_ahead_mask(size):
    mask = tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)


def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

In [13]:
# self-attention을 대신하는 Relative-global-attention
class RelativeGlobalAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(RelativeGlobalAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.headDim = d_model // num_heads
        self.contextDim = int(self.headDim * self.num_heads)
        self.eventDim = 390
        self.E = self.add_weight('E', shape=[self.num_heads, length, self.headDim])

        assert d_model % self.num_heads == 0

        self.wq = tf.keras.layers.Dense(self.headDim)
        self.wk = tf.keras.layers.Dense(self.headDim)
        self.wv = tf.keras.layers.Dense(self.headDim)
    
    def call(self, v, k, q, mask):
        # [Heads, Batch, Time, HeadDim]
        q = tf.stack([self.wq(q) for _ in range(self.num_heads)])
        k = tf.stack([self.wk(k) for _ in range(self.num_heads)])
        v = tf.stack([self.wv(v) for _ in range(self.num_heads)])

        self.batch_size = q.shape[1]
        self.max_len = q.shape[2]
        
        #skewing
        # E = Heads, Time, HeadDim
        # [Heads, Batch * Time, HeadDim]
        Q_ = tf.reshape(q, [self.num_heads, self.batch_size * self.max_len, self.headDim])
        # [Heads, Batch * Time, Time]
        S = tf.matmul(Q_, self.E, transpose_b=True)
        # [Heads, Batch, Time, Time]
        S = tf.reshape(S, [self.num_heads, self.batch_size, self.max_len, self.max_len])
        # [Heads, Batch, Time, Time+1]
        S = tf.pad(S, ((0, 0), (0, 0), (0, 0), (1, 0)))
        # [Heads, Batch, Time+1, Time]
        S = tf.reshape(S, [self.num_heads, self.batch_size, self.max_len + 1, self.max_len])   
        # [Heads, Batch, Time, Time]
        S = S[:, :, 1:]
        # [Heads, Batch, Time, Time]
        attention = (tf.matmul(q, k, transpose_b=True) + S) / np.sqrt(self.headDim)
        # mask tf 2.0 == tf.linalg.band_part
        get_mask = tf.linalg.band_part(tf.ones([self.max_len, self.max_len]), -1, 0)
        attention = attention * get_mask - tf.cast(1e10, attention.dtype) * (1-get_mask)
        score = tf.nn.softmax(attention, axis=3)

        # [Heads, Batch, Time, HeadDim]
        context = tf.matmul(score, v)
        # [Batch, Time, Heads, HeadDim]
        context = tf.transpose(context, [1, 2, 0, 3])
        # [Batch, Time, ContextDim]
        context = tf.reshape(context, [self.batch_size, self.max_len, self.d_model])
        # [Batch, Time, ContextDim]
        logits = tf.keras.layers.Dense(self.d_model)(context)

        return logits, score

In [14]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.rga = RelativeGlobalAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.rga(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

In [15]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.rga1 = RelativeGlobalAttention(d_model, num_heads)
        self.rga2 = RelativeGlobalAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.rga1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.rga2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

In [16]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, rate=0.1):
        super(Encoder, self).__init__()

        self.num_layers = num_layers
        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)

In [17]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, rate=0.1):
        super(Decoder, self).__init__()
        self.num_layers = num_layers
        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        attention_weights = {}
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                                   look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

In [18]:
class MusicTransformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, rate=0.1):
        super(MusicTransformer, self).__init__()
        self.d_model = d_model
        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, rate)
        self.decoder = Decoder(num_layers, d_model, num_heads, dff, rate)

        self.final_layer = tf.keras.layers.Dense(input_vocab_size)

    def call(self, inp, training, enc_padding_mask, 
             look_ahead_mask, dec_padding_mask):
        embed = self.embedding(inp)
        embed *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))

        enc_output = self.encoder(embed, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
            embed, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights

### Music Transformer 모델 학습

In [19]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = 390   # MIDI가 낼 수 있는 소리의 종류
dropout_rate = 0.1

In [20]:
# 모델 선언
music_transformer = MusicTransformer(num_layers, d_model, num_heads, dff,
                                     input_vocab_size, rate=dropout_rate)

In [21]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [22]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)

자연어 모델이 다음에 올 단어를 맞추는 classification task처럼  
midi 생성 모델도 390가지 음향 종류 중에서 어느 것이 올 지 맞추는 문제로 구성된다.

In [23]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

In [24]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

In [25]:
train_loss = tf.keras.metrics.Mean(name='train_loss')

In [26]:
checkpoint_path = os.getenv('HOME')+'/aiffel/music_transformer/models/'

ckpt = tf.train.Checkpoint(music_transformer=music_transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print ('Latest checkpoint restored!!')

In [27]:
#EPOCHS = 20  
EPOCHS = 10 # 1epoch가 매우 오래 걸립니다. 

for epoch in range(EPOCHS):
    start = time.time()

    train_loss.reset_states()

    for (batch, (inp, tar)) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            predictions, _ = music_transformer(inp, True, None, None, None)
            loss = loss_function(tar, predictions)

        gradients = tape.gradient(loss, music_transformer.trainable_variables)    
        optimizer.apply_gradients(zip(gradients, music_transformer.trainable_variables))

        train_loss(loss)

        if batch % 50 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(
                epoch + 1, batch, train_loss.result()))

    if (epoch + 1) % 2 == 0:
        ckpt_save_path = ckpt_manager.save()
        print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
                                                             ckpt_save_path))

    print ('Epoch {} Loss {:.4f}'.format(epoch + 1, train_loss.result()))

    print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 6.3414
Epoch 1 Batch 50 Loss 6.2065
Epoch 1 Batch 100 Loss 6.1536
Epoch 1 Batch 150 Loss 6.0593
Epoch 1 Batch 200 Loss 5.9519
Epoch 1 Batch 250 Loss 5.8335
Epoch 1 Batch 300 Loss 5.7133
Epoch 1 Batch 350 Loss 5.6035
Epoch 1 Batch 400 Loss 5.5052
Epoch 1 Batch 450 Loss 5.4219
Epoch 1 Batch 500 Loss 5.3526
Epoch 1 Batch 550 Loss 5.2872
Epoch 1 Batch 600 Loss 5.2049
Epoch 1 Batch 650 Loss 5.1086
Epoch 1 Batch 700 Loss 5.0128
Epoch 1 Batch 750 Loss 4.9260
Epoch 1 Batch 800 Loss 4.8480
Epoch 1 Batch 850 Loss 4.7790
Epoch 1 Batch 900 Loss 4.7163
Epoch 1 Batch 950 Loss 4.6606
Epoch 1 Batch 1000 Loss 4.6100
Epoch 1 Batch 1050 Loss 4.5641
Epoch 1 Batch 1100 Loss 4.5222
Epoch 1 Batch 1150 Loss 4.4829
Epoch 1 Batch 1200 Loss 4.4470
Epoch 1 Batch 1250 Loss 4.4135
Epoch 1 Batch 1300 Loss 4.3831
Epoch 1 Batch 1350 Loss 4.3550
Epoch 1 Batch 1400 Loss 4.3286
Epoch 1 Batch 1450 Loss 4.3040
Epoch 1 Batch 1500 Loss 4.2808
Epoch 1 Batch 1550 Loss 4.2586
Epoch 1 Batch 1600 Loss 4.2376


Epoch 4 Batch 1650 Loss 3.4184
Epoch 4 Batch 1700 Loss 3.4183
Epoch 4 Batch 1750 Loss 3.4182
Epoch 4 Batch 1800 Loss 3.4183
Epoch 4 Batch 1850 Loss 3.4188
Epoch 4 Batch 1900 Loss 3.4188
Epoch 4 Batch 1950 Loss 3.4187
Epoch 4 Batch 2000 Loss 3.4185
Epoch 4 Batch 2050 Loss 3.4184
Epoch 4 Batch 2100 Loss 3.4183
Epoch 4 Batch 2150 Loss 3.4183
Epoch 4 Batch 2200 Loss 3.4181
Epoch 4 Batch 2250 Loss 3.4180
Epoch 4 Batch 2300 Loss 3.4180
Epoch 4 Batch 2350 Loss 3.4180
Epoch 4 Batch 2400 Loss 3.4180
Epoch 4 Batch 2450 Loss 3.4179
Epoch 4 Batch 2500 Loss 3.4180
Epoch 4 Batch 2550 Loss 3.4183
Epoch 4 Batch 2600 Loss 3.4183
Epoch 4 Batch 2650 Loss 3.4183
Epoch 4 Batch 2700 Loss 3.4185
Epoch 4 Batch 2750 Loss 3.4182
Epoch 4 Batch 2800 Loss 3.4182
Epoch 4 Batch 2850 Loss 3.4183
Epoch 4 Batch 2900 Loss 3.4182
Epoch 4 Batch 2950 Loss 3.4184
Epoch 4 Batch 3000 Loss 3.4184
Epoch 4 Batch 3050 Loss 3.4183
Epoch 4 Batch 3100 Loss 3.4180
Epoch 4 Batch 3150 Loss 3.4178
Epoch 4 Batch 3200 Loss 3.4177
Epoch 4 

KeyboardInterrupt: 

### Music Generation 테스트
20Epoch 진행한 모델 체크포인트 파일을 이용하여 음악 생성 테스트 진행하기
wget https://aiffelstaticprd.blob.core.windows.net/media/documents/models.zip  
mv models.zip ~/aiffel/music_transformer/models  
cd ~/aiffel/music_transformer/models && unzip models.zip  

In [28]:
tf.train.latest_checkpoint(checkpoint_path)

'/home/ubuntu/aiffel/music_transformer/models/ckpt-11'

In [29]:
# 별도의 테스트 데이터셋은 없으므로 학습에 사용했던 데이터의 첫 번째 스텝의 데이터를 이용
test_dataset = tf.data.Dataset.from_tensor_slices((train_data_pad, train_label_pad))
test_dataset = test_dataset.map(tensor_casting)
test_dataset = test_dataset.shuffle(10000).batch(batch_size=1)

In [30]:
# 자연어 모델처럼 inference 단계는 step-by-step으로 진행. 예측된 단어를 다음의 입력으로 사용
N = 1000
_inputs = np.zeros([1, N], dtype=np.int32)

for x, y in test_dataset.take(1):
    _inputs[:, :length] = x[None, :]
    
for i in range(N - length):
    predictions, _ = music_transformer(_inputs[:, i:i+length], False, None, None, None)
    predictions = tf.squeeze(predictions, 0)    
    
    # select the last word from the seq_len dimension
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
    print(predicted_id)
    
    # 예측된 단어를 다음 입력으로 모델에 전달
    # 이전 은닉 상태와 함께
    _inputs[:, i+length] = predicted_id

_inputs.shape

320
3
116
206
4
326
1
120
195
6
113
174
1
308
3
323
4
326
6
118
190
2
115
208
1
116
191
1
111
203
3
319
8
111
195
1
315
2
315
4
115
205
1
333
9
310
7
309
9
336
1
118
199
6
333
3
331
1
112
189
4
115
220
11
118
221
11
120
225
1
334
2
122
213
9
317
1
120
177
3
359
15
121
200
23
313
4
388
5
349
10
123
223
1
120
222
1
123
177
2
349
7
331
7
127
180
7
117
203
2
112
204
3
116
190
2
311
7
389
1
330
19
124
223
4
117
222
5
327
3
388
19
343
10
115
177
23
120
231
1
118
223
1
120
221
6
121
206
15
112
212
8
342
7
110
225
7
346
6
389
21
119
225
6
122
202
4
332
3
350
17
119
222
2
348
17
116
229
1
116
214
3
113
209
10
116
216
4
294
15
340
15
388
2
115
225
3
357
6
123
218
2
116
219
2
344
1
304
1
338
2
310
10
126
231
20
122
232
6
121
227
9
121
224
3
115
204
7
345
5
338
3
315
1
348
4
115
204
8
330
24
124
230
4
119
232
2
291
2
347
43
123
192
2
359
16
118
202
5
338
30
120
212
5
341
32
116
196
12
117
220
18
125
208
5
330
2
349
9
116
205
14
302
6
339
4
340
26
388
21
313
50
332
3
118
213
21
319
23
121
218
18
12

(1, 1000)

In [31]:
# MIDI 파일로 복원하는 클래스
class Event():
    def __init__(self, time, note, cc, on, velocity):
        self.time = time
        self.note = note
        self.on = on
        self.cc = cc
        self.velocity = velocity

    def get_event_sequence(self):
        return [self.time, self.note, int(self.on)]

class Note():
    def __init__(self):
        self.pitch = 0
        self.start_time = 0
        self.end_time = 0

In [32]:
event_list = []
time = 0
event = None

EventDim = IntervalDim + VelocityDim + NoteOnDim + NoteOffDim # 388

for _input in _inputs[0]:
    # interval
    if _input < IntervalDim: 
        time += _input
        event = Event(time, 0, False, 0, 0)

    # velocity
    elif _input < NoteOnOffset:
        if event is None:
            continue
        event.velocity = (_input - VelocityOffset) / VelocityDim * 128

    # note on
    elif _input < NoteOffOffset:
        if event is None:
            continue

        event.note = _input - NoteOnOffset
        event.on = True
        event_list.append(event)

        event = None

    # note off
    elif _input < CCOffset:
        if event is None:
            continue
        event.note = _input - NoteOffOffset
        event.on = False
        event_list.append(event)
        event = None

    ## CC
    else:
        if event is None:
            continue
        event.cc = True
        on = _input - CCOffset == 1
        event.on = on
        event_list.append(event)
        event = None

In [33]:
# 이벤트를 바탕으로 midi파일 만들기
from mido import Message, MidiFile, MidiTrack, MetaMessage, bpm2tempo

midi = MidiFile()
output_midi_path = os.getenv('HOME')+'/aiffel/music_transformer/data/output_file.mid'

# Instantiate a MIDI Track (contains a list of MIDI events)
track = MidiTrack()
track.append(MetaMessage("set_tempo", tempo=bpm2tempo(120)))
# Append the track to the pattern
midi.tracks.append(track)

prev_time = 0
pitches = [None for _ in range(128)]
for event in event_list:
    tick = (event.time - prev_time) // 3
    midi.ticks_per_beat = 8
    prev_time = event.time

    # case NOTE:
    if not event.cc:
        if event.on:
            if pitches[event.note] is not None:
                # Instantiate a MIDI note off event, append it to the track
                off = Message('note_off', note=event.note, velocity=0, time=0)
                track.append(off)
                pitches[event.note] = None

            # Instantiate a MIDI note on event, append it to the track
            on = Message('note_on', note=event.note, velocity=int(event.velocity), time=tick)
            track.append(on)
            pitches[event.note] = prev_time
        else:
            # Instantiate a MIDI note off event, append it to the track
            off = Message('note_off', note=event.note, velocity=0, time=tick)
            track.append(off)
            pitches[event.note] = None

#     case CC:
    elif event.cc:
        if event.on:
            cc = Message('control_change', control=64, time=tick, value=127)
        else:
            cc = Message('control_change', control=64, time=tick, value=0)

        track.append(cc)

    for pitch in range(128):
        if pitches[pitch] is not None and pitches[pitch] + 100 < prev_time:
            off = Message('note_off', note=pitch, velocity=0, time=0)
            track.append(off)
            pitches[pitch] = None


# Add the end of track event, append it to the track
track.append(MetaMessage("end_of_track"))

# Save the pattern to disk
midi.save(output_midi_path)

for i, track in enumerate(midi.tracks):
    print('Track {}: {}'.format(i, track.name))
    for msg in track:
        print(msg)

print('done')

Track 0: 
<meta message set_tempo tempo=500000 time=0>
note_on channel=0 note=77 velocity=92 time=11
note_on channel=0 note=80 velocity=96 time=0
note_on channel=0 note=89 velocity=88 time=0
note_on channel=0 note=65 velocity=100 time=0
note_on channel=0 note=84 velocity=92 time=0
note_on channel=0 note=53 velocity=92 time=0
note_on channel=0 note=60 velocity=92 time=0
note_on channel=0 note=56 velocity=76 time=0
note_on channel=0 note=87 velocity=36 time=1
note_off channel=0 note=87 velocity=0 time=1
control_change channel=0 control=64 value=127 time=6
note_off channel=0 note=65 velocity=0 time=20
note_off channel=0 note=89 velocity=0 time=1
note_off channel=0 note=84 velocity=0 time=2
note_off channel=0 note=53 velocity=0 time=0
note_off channel=0 note=60 velocity=0 time=0
note_off channel=0 note=77 velocity=0 time=0
note_off channel=0 note=80 velocity=0 time=0
note_off channel=0 note=77 velocity=0 time=0
note_off channel=0 note=56 velocity=0 time=0
note_off channel=0 note=53 velocit