# Create Dataset for Math Rock Songs

Sup. I just discovered this underrated music genre few weeks ago. They somehow give different vibe, with unique features such as catchy riffs and unusual time signatures. Fuse it with Python, I got an idea to create dataset for math rock song. Check out the goal of this dataset at the end of this notebook.

This dataset gathered from 40 YouTube videos of math rock song. These includes: tricot, CHON, some songs in <a href='https://www.youtube.com/watch?v=q-YDhSU4OwI'>this YouTube playlist</a>, and Murphy Radio, an underrated band from my country. Most of them use time signature other than 4/4. You can see the song list below tho.

> **TL;DR** to get the dataset, just delete the last cell of this notebook and then run it. Either way keep scrolling down to read how this notebook works.

First we observe how to read audio file with Python. Then we do mass download the songs with `youtube-dl` tools (<a href="https://github.com/ytdl-org/youtube-dl">reference</a>) and save it to `.wav` files. Downloaded files will be splitted much smaller chunks of 30 sec files. This follows the format of GTZAN dataset <a href='https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification/'>here</a>. This dataset contains 1000 songs from 10 genres each of 100 songs. You could check <a href="http://marsyas.info/downloads/datasets.html">this link</a> for more information. Finally, we zip the result so it is ready to be downloaded.


**Outline**:

1. Basic Audio Reading
2. Download Songs
3. Prepare The Data
4. Summary

# Basic Audio Reading

First thing, add data with Kaggle kernel by choosing <a href='https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification/'>GTZAN Dataset</a> from Kaggle. Then, let's see what is an audio file read as.

In [None]:
!ls ../input/

In [None]:
import os
import librosa
import numpy as np
import IPython.display as ipd
import scipy.io.wavfile
from tqdm import tqdm

data_path = '../input/gtzan-dataset-music-genre-classification/Data/genres_original'

def generate_sample_paths():
    samples = dict()
    idx = len(data_path) + 1
    
    for dirname, _, filenames in os.walk(data_path):
        genre = dirname[idx:]
        sample_path = []
        for filename in filenames:
            sample_path.append(os.path.join(dirname, filename))
        samples[genre] = np.array(sorted(sample_path))

    del(samples[''])
    return samples

In [None]:
sample_paths = generate_sample_paths()
sample, sr = librosa.load(sample_paths['jazz'][42])
print(sample.shape)
print(sample)

Turns out it was our king Numpy array. The length of the array is the number of audio sample, and the elements are the amplitude of each sample.

However, `.load` method uses 22050 as default sample rate. Does different sample rate in a same audio result in different array shape? Let's see.

In [None]:
sr = 220
sample, sr = librosa.load(sample_paths['jazz'][42], sr=sr)
print('Sample rate:', sr, '| Array shape:', sample.shape[0])

sr = 2205
sample, sr = librosa.load(sample_paths['jazz'][42], sr=sr)
print('Sample rate:', sr, '| Array shape:', sample.shape[0])

Looks like array shape grow linearly with sample rate. Now does different song with same duration and same sample rate give the same array shape?

In [None]:
sample, sr = librosa.load(sample_paths['jazz'][42])
print('Shape:', sample.shape[0])
sample, sr = librosa.load(sample_paths['country'][1])
print('Shape:', sample.shape[0])
sample, sr = librosa.load(sample_paths['metal'][0])
print('Shape:', sample.shape[0])

The answer is, different songs not necessarily have the same array shape although they read with same sample rate and duration.

Now let's try to play one random song.

In [None]:
def play(path=None, sample=None, sr=22050, **kwargs):
    if path != None:
        sample, sr = librosa.load(path, **kwargs)
    return ipd.Audio(sample, rate=sr)

play(sample_paths['disco'][7])

Cool! Now we move to download **math rock** song with `youtube-dl`.

# Download Math Rock Songs

First we install `youtube-dl`, and then let's try to download one of my favorite math rock song: Ochansensu-su~

In [None]:
!wget https://yt-dl.org/downloads/latest/youtube-dl -O /usr/local/bin/youtube-dl
!chmod a+rx /usr/local/bin/youtube-dl

!youtube-dl -qx --audio-format wav https://www.youtube.com/watch?v=d6rxGmvQPLU -o 'ochansensusu.%(ext)s'

In [None]:
play('ochansensusu.wav', offset=12, duration=30)

That's super cool. Now what about mass download?

In [None]:
links = [
         'https://www.youtube.com/watch?v=P_B_GalsJrE',
         'https://www.youtube.com/watch?v=lln2NPx3aKw',
         'https://www.youtube.com/watch?v=fxsCwzsMOfU',
         'https://www.youtube.com/watch?v=RJ1YBbUKzvw',
         'https://www.youtube.com/watch?v=Tqpj9gmk8UI',
         'https://www.youtube.com/watch?v=PVn6gY1Jc7I',
         'https://www.youtube.com/watch?v=d6rxGmvQPLU',
         'https://www.youtube.com/watch?v=hvudfoL1EWU',
         'https://www.youtube.com/watch?v=-rZWdolJfgk',
         'https://www.youtube.com/watch?v=yU38oLPNpYk',
         'https://www.youtube.com/watch?v=C7NXYSklMbg',
         'https://www.youtube.com/watch?v=axV7NhKArV0',
         'https://www.youtube.com/watch?v=TZjTXh_zaXc',
         'https://www.youtube.com/watch?v=9oboWLb4I1Y',
         'https://www.youtube.com/watch?v=Bwq2M5T4dQo',
         'https://www.youtube.com/watch?v=Q77gbjfsVO8',
         'https://www.youtube.com/watch?v=3UjW3-0MsSI',
         'https://www.youtube.com/watch?v=iYrUwWq6KO8',
         'https://www.youtube.com/watch?v=8LIqn2FGYsg',
         'https://www.youtube.com/watch?v=F840uydN-Ps',
         'https://www.youtube.com/watch?v=18HPVYj_HnY',
         'https://www.youtube.com/watch?v=hSaiW1lJRhU',
         'https://www.youtube.com/watch?v=HGzrJjHwmBQ',
         'https://www.youtube.com/watch?v=0XWzY5SLTss',
         'https://www.youtube.com/watch?v=XWqua6rsEmw',
         'https://www.youtube.com/watch?v=JU-8Ikw5HL0',
         'https://www.youtube.com/watch?v=FLL8WPho6BI',
         'https://www.youtube.com/watch?v=iP-6wexQ0V0',
         'https://www.youtube.com/watch?v=uuMrQ6NMP0A',
         'https://www.youtube.com/watch?v=16Ep_2bLq0g',
         'https://www.youtube.com/watch?v=6qsXxKawUos',
         'https://www.youtube.com/watch?v=6Ey3jFf6vhA',
         'https://www.youtube.com/watch?v=RXGwVJCdV6A',
         'https://www.youtube.com/watch?v=rvNOFp6xFMc',
         'https://www.youtube.com/watch?v=FTxSXUzc96A',
         'https://www.youtube.com/watch?v=t0lbJRmlXW8',
         'https://www.youtube.com/watch?v=bKwPynQ7MLU',
         'https://www.youtube.com/watch?v=k-F1k6Nwlhk',
         'https://www.youtube.com/watch?v=Ohq_fzWyXfw',
         'https://www.youtube.com/watch?v=WISgltgMMR8'
]

print('Total math rock songs:', len(links))

First let's generate `.sh` file containing a list of download commands.

In [None]:
def generate_sh(links, sh_name='download_list.sh', target_dir='./'):
    if not os.path.exists(target_dir):
        os.mkdir(target_dir)

    with open(os.path.join(sh_name), 'w') as f:
        for i in range(len(links)):
            link = links[i]
            cmd = "youtube-dl -qx --audio-format wav %s -o " % link
            cmd = cmd + ("'%s%03d." % (target_dir, i)) + "%(title)s.%(ext)s'\n"
            f.write(cmd)

generate_sh(links, target_dir='./downloads/')

Then we check the first 3 commands to ensure that the commands are correct, and finally execute it.

In [None]:
!head -n 3 download_list.sh

In [None]:
!chmod +x download_list.sh
!./download_list.sh

Let's take a look of the song list!

In [None]:
!ls downloads

In [None]:
play('downloads/020.CHON - Knot - Audiotree Live.wav', duration=10)

# Prepare Data

Let's divide each song into GTZAN format: into 30 sec audio file.

In [None]:
def get_optimum_gap(chunk_length, audio_length):
    '''
    `chunk_length` and `audio_length` in ms.
    return gap length (in ms) which use maximum number of audio part.
    '''
    x = np.arange(chunk_length // 2)
    c = np.floor((audio_length - x) // (chunk_length - x))
    x = chunk_length + (c - 1) * (chunk_length - x)
    return np.argmax(x)

def partition(audio_path, chunk_length, sr=22050, min_amp=0.03, stats=False):
    '''
    Arguments:
        - audio_path : str, path to audio
        - chunk_length : int, length of each chunk in ms
        - sr : int, sample rate
        - min_amp : float, minimum amplitudes to start and finish and audio
        - stats : bool, whether return chunks only or also with the stats
    Returns:
        - If stats=False, a list of Numpy array (such result of `librosa.load`)
        - Else, return list also with gap duration of audio unused in ms, and the
          percentage of duration of audio used
    '''
    audio, _ = librosa.load(audio_path, sr=sr)
    s_time = np.argmax(audio > min_amp)
    e_time = len(audio) - np.argmax(audio[::-1] > min_amp)
    audio = audio[s_time:e_time]
    
    audio_length = 1000 * len(audio) // sr
    gap_length = get_optimum_gap(chunk_length, audio_length)
    num_chunks = (audio_length - gap_length) // (chunk_length - gap_length)
    
    counter, sample_length = 0, chunk_length * sr // 1000
    start, step = 0, (chunk_length - gap_length) * sr // 1000
    parts = []
    while counter < num_chunks:
        parts.append(audio[start : start + sample_length])
        start += step
        counter += 1

    total_length = chunk_length + (num_chunks - 1) * (chunk_length - gap_length)
    total_pct = total_length / audio_length
    loss = audio_length - total_length

    if stats:
        return parts, loss, total_pct
    return parts

def mass_partition(audio_path, save_path, chunk_length=30000, sr=22050, \
                   start_index=0, suffix='', **kwargs):
    if not os.path.exists(save_path):
        os.mkdir(save_path)
    filenames = os.listdir(audio_path)
    audio_path = sorted([os.path.join(audio_path, x) for x in filenames])

    file_id = start_index
    for path in tqdm(audio_path):
        audios = partition(path, chunk_length, **kwargs)
        chunk_id = 0
        for audio in audios:
            file_path = '%s.%03d%02d.wav' % (suffix, file_id, chunk_id)
            file_path = os.path.join(save_path, file_path)
            scipy.io.wavfile.write(file_path, sr, audio)
            chunk_id += 1
        file_id += 1

Not all of the part of the song used when splitting. Now let's see these things resulted by splitting a song:

- How many ms of the part unused,
- The percentage of used part and
- How many optimum split we got

In [None]:
audios, loss, pct = partition('downloads/001.tricot「 爆裂パニエさん」（大反射祭Tour／2019.04.28 at '
                              'TSUTAYA O-EAST）YouTube Ver..wav', 30000, stats=True)
    
print(loss, pct, len(audios))

Yea for this song: `tricot「 爆裂パニエさん」（大反射祭Tour／2019.04.28 at TSUTAYA O-EAST）YouTube Ver..wav`,

we got the optimum split of 11 chunks, which use 99.98% of the song by losing 4ms of it.

In [None]:
mass_partition(audio_path='downloads', \
               save_path='./math/', 
               chunk_length=30000, \
               sr=22050, \
               start_index=0, \
               suffix='math')

Cool. Now we are ready. The dataset is located at `./math/`.

Let's do a last sanity check to ensure we got the right split.

In [None]:
result = sorted(os.listdir('./math/'))
result = '\n'.join(result[:3] + result[-3:])
print(result)
play('./math/math.00105.wav')

**Note**: These Linux commands are present to reduce the size of this notebook output.

**Delete these Linux command in the cell below to keep the dataset.**

In [None]:
!rm -rf math
!rm -rf downloads
!rm -rf ochansensusu.wav

# Congrats!

Well, yea, that's all. We downloaded 40 math rock songs, divided it into 418 audio files, 30 sec each. Much more than the number of audio files for each genre in GTZAN dataset, which is only 100 files each. The dataset is located at directory `math`.

The core of this notebook are: generate shell file containing `youtube-dl` commands, execute it, and then split each downloaded song. This math rock song dataset could be used for further purposes, like genre classification or even math rock song generation.