# Preprocessing for...

* Ballroom, extended ballroom, gtzan (genre)
* emomusic45s (emotion)
* jamendo (vocal or not)
* gtzan music speech

### like how?
For each dataset,
* Audio: decoded, resampled to 12 kHz, and stored as filename.npy
* Annotations: becomes a csv file, `[dataset_name].csv` file with binary values
  * [file_id]: file id -- for management
  * [index]: file index (integer) -- for management
  * [filepath]: filepath -- for reading file
  * [label]: y in integer -- for stratified spliting
* Annotations: `[dataset_name].npy`: `n_track, 1`
* Annotations: `[dataset_name]_1hot.npy`: `(n_track, n_label)` if classification task
* Setup: `[dataset_name]_setup.json` for setup like
  * task name
  * task type (classification or regression)
  * suggested `k` for cross validation
  * dataset root path

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import h5py
import librosa
import os, sys
import time
import pandas as pd 
from collections import namedtuple
from sklearn.model_selection import StratifiedShuffleSplit

## PATHS
You should change these to where you saved/untarred the datasets

In [11]:
# For almost every datasets,
PATH_DATASETS = '/misc/kcgscratch1/ChoGroup/keunwoo/datasets/'
# To store processed Jamendo dataset
PATH_PROCESSED = '/misc/kcgscratch1/ChoGroup/keunwoo/datasets_processed/'
# For some random reason I put UrbanSound8K in this folder. 
PATH_URBAN = '/misc/kcgscratch1/ChoGroup/keunwoo/UrbanSound8K' 

In [4]:
FOLDER_CSV = 'data_csv/'
SR = 12000

In [5]:
allowed_exts = set(['mp3', 'wav', 'au'])

# Create CSV files

Helper functions for other datasets.

In [7]:
column_names = ['id', 'filepath', 'label'] # todo: label_for_stratify?

In [8]:
def get_rows_from_folders(folder_dataset, folders, dataroot=None):
    '''gtzan, ballroom extended. each class in different folders'''
    rows = []
    if dataroot is None:
        dataroot = PATH_DATASETS
    for label_idx, folder in enumerate(folders): # assumes different labels per folders.
        files = os.listdir(os.path.join(dataroot, folder_dataset, folder))
        files = [f for f in files if f.split('.')[-1].lower() in allowed_exts]
        for fname in files:
            file_path = os.path.join(folder_dataset, folder, fname)
            file_id =fname.split('.')[0]
            file_label = label_idx
            rows.append([file_id, file_path, file_label])
    print('Done - length:{}'.format(len(rows)))
    print(rows[0])
    print(rows[-1])
    return rows

In [9]:
def write_to_csv(rows, column_names, csv_fname):
    '''rows: list of rows (= which are lists.)
    column_names: names for columns
    csv_fname: string, csv file name'''
    df = pd.DataFrame(rows, columns=column_names)
    df.to_csv(os.path.join(FOLDER_CSV, csv_fname))

### Jamendo voice activity
pre-preprocess: trim them and save into `PATH_JAMENDO_TRIM`. 

The `train/test/valid` structure is preserved, segments are stored in `sing/nosing` subdirectory

In [6]:
PATH_JAMENDO = os.path.join(PATH_DATASETS, 'jamendo_voice_activity')
PATH_JAMENDO_TRIM = os.path.join(PATH_PROCESSED, 'jamendo_trimmed/')

In [7]:
folders = ['train', 'test', 'valid']

### Jamendo - x

This code trims the files into 'sing' and 'nosing' segments and save them as wav files.

In [20]:
# Create 'x's 
print('Start..')
for folder in folders:
    print('{}..'.format(folder))
    path_folder = os.path.join(PATH_JAMENDO, folder)
    files = os.listdir(path_folder)
    music_files = [f for f in files if f.split('.')[-1].lower() in ('ogg', 'mp3') and not f.startswith('._')]
    
    lab_files = [f.replace('ogg', 'lab').replace('mp3', 'lab') for f in music_files]
    try:
        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'sing/'))
        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'nosing/'))
    except:
        pass
    for file_idx, (m_file, l_file) in enumerate(zip(music_files, lab_files)): # music file, lab file (text)
        if folder == 'train' and file_idx <= 43:
            continue
        print('  {} {}..'.format(file_idx, m_file))
        filename = l_file.rstrip('.lab') # == song title.
        path_audio = os.path.join(path_folder, m_file)
        src, sr = librosa.load(path_audio, sr=SR)
        starts, ends, labels = [], [], []
        with open(os.path.join(PATH_JAMENDO, 'jamendo_lab/', l_file)) as f_lab:
            for line in f_lab:
                start, end, label = line.rstrip('\n').split(' ') # [second], [s], 'sing' or 'nosing'
                starts.append(start)
                ends.append(end)
                labels.append(label)
        for seg_idx, (start, end, label) in enumerate(zip(starts, ends, labels)):
            out_name = '{}_{}.wav'.format(filename, seg_idx)
            start_sp = int(float(start) * SR)
            end_sp = int(float(end) * SR)
            librosa.output.write_wav(os.path.join(PATH_JAMENDO_TRIM, folder, label, out_name),
                                    src[start_sp:end_sp],
                                    SR)


Start..
train..
  4402 - Colombia.ogg..
  4501 - The Final Rewind.ogg..
  4601 - Sunlight.ogg..
  4701 - Seven Months.ogg..
  4801 - Perdre le Nord.ogg..
  4901 - Ok.ogg..
  5001 - Its Easy.ogg..
  5101 - Angels Of Crime.ogg..
  5201 - Visa pour hier.mp3..
  5301 - Sunken Sailor.mp3..
  5402 - The Louise XIV Cathorse.mp3..
  5501 - alice.mp3..
  5601 - A smile on your face.mp3..
  5701 - A new singing song.mp3..
  5801 - 10min.mp3..
  5901 - A city.mp3..
  6002 - emporte-moi.mp3..
test..
  003 - castaway.ogg..
  103 - Une charogne.ogg..
  205 - 05 LIrlandaise.ogg..
  305 - 16 ans.ogg..
  405 - 2003-Circonstances attenuantes part II.ogg..
  505 - A Poings Fermes.ogg..
  605 - Crepuscule.ogg..
  705 - Dance.ogg..
  804 - Healing Luna.mp3..
  903 - Say me Good Bye.mp3..
  1004 - Inside.mp3..
  1105 - Elles disent.mp3..
  1203 - Si Dieu.mp3..
  1303 - School.mp3..
  1404 - Believe.mp3..
  1504 - You are.mp3..
valid..
  005 - Change.ogg..
  105 - Cecilia.ogg..
  205 - Callypige palindrome.o

### jamendo - y

In [68]:
# y: csv file, columns=['filename_total', 'filename_seg', 'y', 'category']
#                       'blah.ogg', 'blah_1.ogg', 1, 'train'
filename_segs = []
filepaths = []
ys = []
categories = []

for folder in folders:
#     print('{}..'.format(folder))
    path_folder = os.path.join(PATH_JAMENDO, folder)
    files = os.listdir(path_folder)
    music_files = [f for f in files if f.split('.')[-1].lower() in ('ogg', 'mp3') and not f.startswith('._')]
    
    lab_files = [f.replace('ogg', 'lab').replace('mp3', 'lab') for f in music_files]
    try:
        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'sing/'))
        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'nosing/'))
    except:
        pass
    for file_idx, (m_file, l_file) in enumerate(zip(music_files, lab_files)): # music file, lab file (text)
#         print('  {} {}..'.format(file_idx, m_file))
        filename = l_file.rstrip('.lab') # == song title.
        is_sings = []
        labels = []
        with open(os.path.join(PATH_JAMENDO, 'jamendo_lab/', l_file)) as f_lab:
            for line in f_lab:
                start, end, label = line.rstrip('\n').split(' ') # [second], [s], 'sing' or 'nosing'
                is_sings.append(int(label == 'sing'))
                labels.append(label)
        for seg_idx, (is_sing, label) in enumerate(zip(is_sings, labels)):
            out_name = '{}_{}.wav'.format(filename, seg_idx)
            filename_segs.append(out_name.rstrip('.wav'))
            filepaths.append(os.path.join(folder, label, out_name))
            ys.append(is_sing)
            categories.append(folder)

print len(filepaths)
print len(ys)
    

4086
4086


In [69]:
write_to_csv(zip(*[filename_segs, filepaths, ys, categories]), ['id', 'filepath', 'label', 'category'], 'jamendo_vd.csv')
print('jamendo_vd.csv is saved! ')

jamendo_vd.csv is saved! 


### ballroom extended

In [29]:
folder_dataset_be = 'ballroom_extended_2016/'
labels_be = ['Chacha', 'Foxtrot', 'Jive', 'Pasodoble', 'Rumba', 'Salsa', 'Samba', 'Slowwaltz', 'Tango', 'Viennesewaltz'
         , 'Waltz', 'Wcswing', 'Quickstep']
n_label_be = len(labels_be)
folders_be = [s + '/' for s in labels_be]

In [31]:
rows_ballroom = get_rows_from_folders(folder_dataset_be, folders_be)
write_to_csv(rows_ballroom, column_names, 'ballroom_extended.csv')

Done - length:4180
['100701', 'ballroom_extended_2016/Chacha/100701.mp3', 0]
['118720', 'ballroom_extended_2016/Quickstep/118720.mp3', 12]


### gtzan genre

In [32]:
folder_dataset_gtg = 'gtzan_genre/genres/'
labels_gtg = ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']
n_label_gtg = len(labels_gtg)
folders_gtg = [s + '/' for s in labels_gtg]

In [33]:
rows_gtg = get_rows_from_folders(folder_dataset_gtg, folders_gtg)
write_to_csv(rows_gtg, column_names, 'gtzan_genre.csv')

Done - length:1000
['blues', 'gtzan_genre/genres/blues/blues.00000.au', 0]
['rock', 'gtzan_genre/genres/rock/rock.00099.au', 9]


### gtzan music speech

In [34]:
folder_dataset_gtms = 'gtzan_music_speech/music_speech/'
labels_gtms = ['music', 'speech']
n_label_gtms = len(labels_gtms)
folders_gtms = ['music_wav', 'speech_wav']

In [35]:
rows_gtms = get_rows_from_folders(folder_dataset_gtms, folders_gtms)
write_to_csv(rows_gtms, column_names, 'gtzan_speechmusic.csv')

Done - length:128
['bagpipe', 'gtzan_music_speech/music_speech/music_wav/bagpipe.wav', 0]
['voices', 'gtzan_music_speech/music_speech/speech_wav/voices.wav', 1]


### emoMusic45s
* at /misc/kcgscratch1/ChoGroup/keunwoo/datasets/emoMusic45s,
* files: 0.mp3 - 1000.mp3
* labels: `static_annotations.csv`, 
```csv
song_id,mean_arousal,std_arousal,mean_valence,std_valence
2,3.1,0.99443,3,0.66667
```

Some files are missing AV labels, overall 744 songs. 

I ignore train/test pre-split cuz it seems unnecessary.

http://cvml.unige.ch/databases/emoMusic/


### emoMusic 45s

In [70]:
emoMusic_folder = 'emoMusic45s'
anno_file = 'static_annotations.csv'
info_file = 'song_info.csv'
df = pd.read_csv(os.path.join(PATH_DATASETS, emoMusic_folder, anno_file))
# df_info = pd.read_csv(os.path.join(PATH_DATASETS, emoMusic_folder, info_file))

In [71]:
filenames = ['%s.mp3' % idx for idx in df['song_id']]
filepaths = [os.path.join(emoMusic_folder, 'clips_45seconds', f) for f in filenames]
# categories = [df['']]

In [72]:
write_to_csv(zip(*[df['song_id'], filepaths, df['mean_arousal']/9., df['mean_valence']/9.]),
            ['id', 'filepath', 'label_arousal', 'label_valence'],
            'emoMusic.csv')

In [73]:
len(df['song_id'])

744

### urbansound8k

csv: `/misc/kcgscratch1/ChoGroup/keunwoo/UrbanSound8K/metadata/UrbanSound8K.csv`
```
slice_file_name,fsID,start,end,salience,fold,classID,class
100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark
100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing
```


In [None]:
df = pd.read_csv(os.path.join(PATH_URBAN, 'metadata/UrbanSound8K.csv'))

ids = [s.rstrip('.wav') for s in df['slice_file_name']]
filepaths = [os.path.join('fold%s' % fd, fn) for fd, fn in zip(df['fold'], df['slice_file_name'])]

In [79]:
write_to_csv(zip(*[ids, filepaths, df['classID'], df['fold']]), ['id', 'filepath', 'label', 'fold'],
            'urbansound.csv')

In [80]:
print df.columns

Index([u'slice_file_name', u'fsID', u'start', u'end', u'salience', u'fold',
       u'classID', u'class'],
      dtype='object')
