# Semantic Music Tagging

The necessary dependencies to run this notebook is described in `environment.yml`. It can also be created automatically with Anaconda: `conda env create -f environment.yml`

In [1]:
from pathlib import Path
from collections import OrderedDict
import time
import pandas as pd
import numpy as np
import copy
from sklearn.metrics import roc_auc_score

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import torchaudio
from torchaudio.transforms import MelSpectrogram, AmplitudeToDB

## Dataset

**Source:** http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset

### Structure

- Clip info: `data/clip_info_final.csv`
- Annotation info: `data/annotations_final.csv`
- MP3 files: `data/audio`

In [2]:
clips = pd.read_csv('./data/clip_info_final.csv', delimiter='\t')
clips.head()

Unnamed: 0,clip_id,track_number,title,artist,album,url,segmentStart,segmentEnd,original_url,mp3_path
0,2,1,BWV54 - I Aria,American Bach Soloists,J.S. Bach Solo Cantatas,http://www.magnatune.com/artists/albums/abs-so...,30,59,http://he3.magnatune.com/all/01--BWV54%20-%20I...,f/american_bach_soloists-j_s__bach_solo_cantat...
1,6,1,BWV54 - I Aria,American Bach Soloists,J.S. Bach Solo Cantatas,http://www.magnatune.com/artists/albums/abs-so...,146,175,http://he3.magnatune.com/all/01--BWV54%20-%20I...,f/american_bach_soloists-j_s__bach_solo_cantat...
2,10,1,BWV54 - I Aria,American Bach Soloists,J.S. Bach Solo Cantatas,http://www.magnatune.com/artists/albums/abs-so...,262,291,http://he3.magnatune.com/all/01--BWV54%20-%20I...,f/american_bach_soloists-j_s__bach_solo_cantat...
3,11,1,BWV54 - I Aria,American Bach Soloists,J.S. Bach Solo Cantatas,http://www.magnatune.com/artists/albums/abs-so...,291,320,http://he3.magnatune.com/all/01--BWV54%20-%20I...,f/american_bach_soloists-j_s__bach_solo_cantat...
4,12,1,BWV54 - I Aria,American Bach Soloists,J.S. Bach Solo Cantatas,http://www.magnatune.com/artists/albums/abs-so...,320,349,http://he3.magnatune.com/all/01--BWV54%20-%20I...,f/american_bach_soloists-j_s__bach_solo_cantat...


In [3]:
annotations = pd.read_csv('./data/annotations_final.csv', delimiter='\t')
annotations.head()

Unnamed: 0,clip_id,no voice,singer,duet,plucking,hard rock,world,bongos,harpsichord,female singing,...,rap,metal,hip hop,quick,water,baroque,women,fiddle,english,mp3_path
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
1,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
2,10,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
3,11,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...
4,12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,f/american_bach_soloists-j_s__bach_solo_cantat...


### Filter dataset
Remove damaged/empty files

In [4]:
# found with "find <dir> -type f -empty"
empty_files = np.array([
    '8/jacob_heringman-josquin_des_prez_lute_settings-19-gintzler__pater_noster-204-233.mp3',
    '9/american_baroque-dances_and_suites_of_rameau_and_couperin-25-le_petit_rien_xiveme_ordre_couperin-88-117.mp3',
    '6/norine_braun-now_and_zen-08-gently-117-146.mp3'
])

annotations = annotations[~annotations['mp3_path'].isin(empty_files)]
clips = clips[~clips['mp3_path'].isin(empty_files)]

Filter clips that don't belong to 50 most frequent tags

In [5]:
top50_tags = annotations.iloc[:,1:-1].sum(axis=0).sort_values(ascending=False)[:50].index.to_numpy()
annotations = annotations[annotations[top50_tags].sum(axis=1) > 0]
tags_to_remove = annotations.columns[1:-1].difference(top50_tags)
annotations = annotations.drop(tags_to_remove, axis=1)
clips = clips[clips['clip_id'].isin(annotations['clip_id'])]
top50_tags

array(['guitar', 'classical', 'slow', 'techno', 'strings', 'drums',
       'electronic', 'rock', 'fast', 'piano', 'ambient', 'beat', 'violin',
       'vocal', 'synth', 'female', 'indian', 'opera', 'male', 'singing',
       'vocals', 'no vocals', 'harpsichord', 'loud', 'quiet', 'flute',
       'woman', 'male vocal', 'no vocal', 'pop', 'soft', 'sitar', 'solo',
       'man', 'classic', 'choir', 'voice', 'new age', 'dance',
       'male voice', 'female vocal', 'beats', 'harp', 'cello', 'no voice',
       'weird', 'country', 'metal', 'female voice', 'choral'],
      dtype=object)

### Split dataset
Split dataset randomly while making sure that clips from the same track don't end up in different splits

In [6]:
# split tracks
track_nums = clips['track_number'].unique()
np.random.shuffle(track_nums)
train_split, val_split, test_split = np.split(track_nums, [int(len(track_nums) * 0.8), int(len(track_nums) * 0.9)])

# assign all clips from tracks to their corresponding split
def clip_files_from_tracks(track_nums):
    relevant_clips = clips[clips['track_number'].isin(track_nums)]
    df = relevant_clips.merge(annotations.drop('mp3_path', axis=1))
    labels = df[top50_tags].to_numpy()
    files = 'data/audio/' + df['mp3_path'].to_numpy()
    return files, labels

train_clips, train_labels = clip_files_from_tracks(train_split)
val_clips, val_labels = clip_files_from_tracks(val_split)
test_clips, test_labels = clip_files_from_tracks(test_split)

print('Train clips:', len(train_clips))
print('Val clips:', len(val_clips))
print('Test clips:', len(test_clips))

Train clips: 15902
Val clips: 3393
Test clips: 1813


### Load dataset

Try to adhere to the parameters used in the ference implementation

In [7]:
N_MELS = 96
N_FFT = 512
BATCH_SIZE = 24

class MagnaTagATuneDataset(Dataset):
    def __init__(self, files, labels, sample_rate=12000):
        self.files = files
        self.labels = torch.from_numpy(labels.astype(np.float32)) # pytorch expects 32-bit floats
        self.sample_rate = sample_rate
        self.transform = nn.Sequential(
            MelSpectrogram(
                sample_rate=sample_rate,
                n_mels=N_MELS,
                n_fft=N_FFT,
                hop_length=256
            ),
            AmplitudeToDB()
        )
    
    def __getitem__(self, index):
        file = self.files[index]
        waveform, _ = torchaudio.load(file)
        mel_spec = self.transform(waveform)
        return mel_spec, self.labels[index]
    
    def __len__(self):
        return len(self.files)

In [8]:
loader = {
    mode: DataLoader(MagnaTagATuneDataset(clips, labels), shuffle=True, batch_size=BATCH_SIZE, num_workers=4, pin_memory=True)
    for mode, clips, labels in [('train', train_clips, train_labels), ('val', val_clips, val_labels), ('test', test_clips, test_labels)]
}

print('Batch shape =', next(iter(loader['test']))[0].shape)
print('Label shape = ', next(iter(loader['test']))[1].shape)

Batch shape = torch.Size([24, 1, 96, 1821])
Label shape =  torch.Size([24, 50])


## Model

It would be nice to batch normalize along frequency dimension and not both for the input normalization, but this doesn't seem to be implemented in PyTorch: https://github.com/pytorch/pytorch/issues/21856

In [9]:
model = nn.Sequential(OrderedDict([
    ('input',
        nn.BatchNorm2d(1)
        
    ),
    ('block1',
        nn.Sequential(
            nn.Conv2d(1, 64, 3),
            nn.BatchNorm2d(64),
            nn.ELU(),
            nn.MaxPool2d((2,4))
        )
    ),
    ('block2',
        nn.Sequential(
            nn.Conv2d(64, 128, 3),
            nn.BatchNorm2d(128),
            nn.ELU(),
            nn.MaxPool2d((2,4))
        )
    ),
    ('block3',
        nn.Sequential(
            nn.Conv2d(128, 128, 3),
            nn.BatchNorm2d(128),
            nn.ELU(),
            nn.MaxPool2d((2,4))
        )
    ),
    ('block4',
        nn.Sequential(
            nn.Conv2d(128, 64, 3),
            nn.BatchNorm2d(64),
            nn.ELU(),
            nn.MaxPool2d((3,5))
        )
    ),
    ('output',
        nn.Sequential(
            nn.Flatten(),
            nn.Linear(640, 50),
            nn.Sigmoid()
        )
    )
]))

## Training

Move model to GPU

In [10]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
device

device(type='cuda', index=0)

Train model for a maximum of 10 epoch with early stopping when the AUC score on the validation dataset got worse 2 epochs in a row. In this case use the model weights that scored the highest AUC on the validation dataset.

In [11]:
def train_model(model, criterion, optimizer, num_epochs=10):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_auc = 0.0
    dataset_sizes = {'train': len(loader['train']), 'val': len(loader['val']) }
    no_improv = 0

    for epoch in range(num_epochs):
        print(f'\nEpoch {epoch + 1}/{num_epochs}')
        print('-' * 10)

        # each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            running_loss = 0.0
            running_auc = 0

            for inputs, labels in loader[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                    output = model(inputs)
                    preds = (output > 0.5).float()
                    loss = criterion(output, labels)

                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                running_loss += loss.item()
                running_auc += roc_auc_score(labels.data.cpu(), preds.cpu(), 'samples')

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_auc = running_auc / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} AUC: {epoch_auc:.4f}')

            if phase == 'val':
                if epoch_auc > best_auc:
                    best_auc = epoch_auc
                    best_model_wts = copy.deepcopy(model.state_dict())
                    no_improv = 0
                else:
                    no_improv += 1

        # early stopping
        if no_improv > 2:
            break

    time_elapsed = time.time() - since
    print(f'\nTraining complete in {time_elapsed // 60:.0f}m {time_elapsed % 60 :.0f}s')
    print(f'Best val AUC: {best_auc:.4f}')

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [12]:
trained_model = train_model(
    model,
    nn.BCELoss(),
    optim.Adam(model.parameters())
)


Epoch 1/10
----------
train Loss: 0.1850 AUC: 0.6006
val Loss: 0.2075 AUC: 0.6049

Epoch 2/10
----------
train Loss: 0.1601 AUC: 0.6675
val Loss: 0.1745 AUC: 0.6555

Epoch 3/10
----------
train Loss: 0.1525 AUC: 0.6871
val Loss: 0.1749 AUC: 0.7015

Epoch 4/10
----------
train Loss: 0.1480 AUC: 0.6974
val Loss: 0.1682 AUC: 0.6943

Epoch 5/10
----------
train Loss: 0.1442 AUC: 0.7056
val Loss: 0.1647 AUC: 0.6784

Epoch 6/10
----------
train Loss: 0.1405 AUC: 0.7121
val Loss: 0.1750 AUC: 0.7142

Epoch 7/10
----------
train Loss: 0.1368 AUC: 0.7189
val Loss: 0.1664 AUC: 0.6835

Epoch 8/10
----------
train Loss: 0.1331 AUC: 0.7262
val Loss: 0.1684 AUC: 0.6984

Epoch 9/10
----------
train Loss: 0.1291 AUC: 0.7335
val Loss: 0.1593 AUC: 0.7013

Training complete in 29m 47s
Best val AUC: 0.7142


## Evaluation

In [33]:
def eval_model(model):
    n_test_batches = len(loader['test'])
    total_auc = 0
    with torch.no_grad():
        for inputs, labels in loader['test']:
            inputs = inputs.to(device)
            labels = labels.to(device)
            output = model(inputs)
            preds = (output > 0.1).float()
            total_auc += roc_auc_score(labels.data.cpu(), preds.cpu(), 'samples')

    print(f'AUC: {total_auc / n_test_batches:.3f}')

In [34]:
eval_model(trained_model)

AUC: 0.819


## Conclusion

I was not quite able to reproduce the high performance of the FCN4 model. This despite the preprocessing and network architecture being very similar with the only difference being that in the [reference implementation](https://github.com/keunwoochoi/music-auto_tagging-keras) (FCN5) they used 128 feature output dimensions for the 4th convolutional block. Instead of using 128 dimensions I opted to use fewer dimensions (64) to conform to the last layer which is also smaller with 64 dimensions. Another difference was that PyTorch did not support batch normalization along axes which Keras does.

Maybe the few architectural differences can account for the 7% difference (81.9% vs 89.4%) in performance. Or maybe a different dataset split methodology was used which was not explained in the paper. I opted for the popular 80%/10%/10% split.