# Split MedleyDB 

This notebook was created to split the MedleyDB dataset into train and test sets using only the multitracks containing vocal melodies.

I used medleydb api to manipulate the files and build the subsets. 
The principal dependency is:

 - MedleyDB api: [https://github.com/marl/medleydb]

In [1]:
import medleydb as mdb

# Load all multitracks
mtrack_generator = mdb.load_all_multitracks()

all_tracks_id = [mtrack.track_id for mtrack in mtrack_generator]



In [2]:
# get all valid instrument labels
instruments = mdb.get_valid_instrument_labels()
print (instruments)

{'cello', 'bongo', 'male singer', 'fx/processed sound', 'synthesizer', 'bass clarinet', 'string section', 'tack piano', 'alto saxophone', 'male rapper', 'clarinet', 'choir', 'accordion', 'bassoon', 'guiro', 'tuba', 'piccolo', 'yangqin', 'cymbal', 'cabasa', 'euphonium', 'ukulele', 'timpani', 'trumpet section', 'male screamer', 'maracas', 'violin', 'piano', 'shaker', 'triangle', 'bagpipe', 'female speaker', 'tenor saxophone', 'electronic organ', 'dulcimer', 'bass drum', 'clarinet section', 'sitar', 'flute', 'tabla', 'theremin', 'beatboxing', 'acoustic guitar', 'sampler', 'oboe', 'bandoneon', 'oud', 'cowbell', 'dizi', 'chimes', 'conga', 'whistle', 'double bass', 'gu', 'harp', 'drum set', 'lap steel guitar', 'claps', 'high hat', 'vocalists', 'female screamer', 'zhongruan', 'panpipes', 'recorder', 'mandolin', 'snare drum', 'harpsichord', 'cello section', 'drum machine', 'trombone section', 'toms', 'banjo', 'flute section', 'male speaker', 'harmonium', 'clean electric guitar', 'horn section'

In [3]:
mtrack1 = mdb.MultiTrack('LizNelson_Rainfall')
print (mtrack1.melody_stems()[0].instrument)

['female singer']


In [4]:
# A not clean and not beautiful way to find vocal music
print ('== List of musics with singing voice ==')
vocal_tracks_id = []
for music in all_tracks_id:
    mtrack = mdb.MultiTrack(music)
    stems = [melodics.instrument for melodics in mtrack.melody_stems()]
    search_for = ['female singer', 'male singer', 'vocalists', 'choir']
    inters = [list(filter(lambda x: x in search_for, sublist)) for sublist in stems]
    #print (inters)
    has = [element for element in inters if element != []]
    if len(has) > 0:
        vocal_tracks_id.append(music)
        print (music)

== List of musics with singing voice ==
AClassicEducation_NightOwl
AimeeNorwich_Child
AlexanderRoss_GoodbyeBolero
AlexanderRoss_VelvetCurtain
Auctioneer_OurFutureFaces
AvaLuna_Waterduct
BigTroubles_Phantom
BrandonWebster_DontHearAThing
BrandonWebster_YesSirICanFly
CelestialShore_DieForUs
ClaraBerryAndWooldog_AirTraffic
ClaraBerryAndWooldog_Boys
ClaraBerryAndWooldog_Stella
ClaraBerryAndWooldog_TheBadGuys
ClaraBerryAndWooldog_WaltzForMyVictims
Creepoid_OldTree
Debussy_LenfantProdigue
DreamersOfTheGhetto_HeavyLove
FacesOnFilm_WaitingForGa
FamilyBand_Again
Handel_TornamiAVagheggiar
HeladoNegro_MitadDelMundo
HezekiahJones_BorrowedHeart
HopAlong_SisterCities
InvisibleFamiliars_DisturbingWildlife
LizNelson_Coldwar
LizNelson_ImComingHome
LizNelson_Rainfall
MatthewEntwistle_DontYouEver
MatthewEntwistle_Lontano
Meaxic_TakeAStep
Meaxic_YouListen
Mozart_BesterJungling
Mozart_DiesBildnis
MusicDelta_80sRock
MusicDelta_Beatles
MusicDelta_Britpop
MusicDelta_Country1
MusicDelta_Country2
MusicDelta_Disc

In [5]:
print ("MedleyDB has", len(all_tracks_id), "multitracks files,", len(vocal_tracks_id), "have singing voice.")

MedleyDB has 122 multitracks files, 61 have singing voice.


### Split into train and test sets

In [6]:
vocal_split = mdb.utils.artist_conditional_split(trackid_list=vocal_tracks_id, test_size=0.20, \
                                                 num_splits=1,random_state=8526325)

In [7]:
print(vocal_split[0]['train'], "\nThere are", len(vocal_split[0]['train']), "songs on train set")

['AClassicEducation_NightOwl', 'AimeeNorwich_Child', 'AlexanderRoss_GoodbyeBolero', 'AlexanderRoss_VelvetCurtain', 'Auctioneer_OurFutureFaces', 'AvaLuna_Waterduct', 'BigTroubles_Phantom', 'BrandonWebster_DontHearAThing', 'BrandonWebster_YesSirICanFly', 'ClaraBerryAndWooldog_AirTraffic', 'ClaraBerryAndWooldog_Boys', 'ClaraBerryAndWooldog_Stella', 'ClaraBerryAndWooldog_TheBadGuys', 'ClaraBerryAndWooldog_WaltzForMyVictims', 'Creepoid_OldTree', 'Debussy_LenfantProdigue', 'DreamersOfTheGhetto_HeavyLove', 'FacesOnFilm_WaitingForGa', 'FamilyBand_Again', 'Handel_TornamiAVagheggiar', 'HeladoNegro_MitadDelMundo', 'HezekiahJones_BorrowedHeart', 'HopAlong_SisterCities', 'LizNelson_Coldwar', 'LizNelson_ImComingHome', 'LizNelson_Rainfall', 'MatthewEntwistle_DontYouEver', 'MatthewEntwistle_Lontano', 'Meaxic_TakeAStep', 'Meaxic_YouListen', 'Mozart_BesterJungling', 'Mozart_DiesBildnis', 'MusicDelta_80sRock', 'MusicDelta_Beatles', 'MusicDelta_Britpop', 'MusicDelta_Disco', 'MusicDelta_Grunge', 'MusicDelt

In [8]:
print(vocal_split[0]['test'], "\nThere are", len(vocal_split[0]['test']), "songs on test set")

['CelestialShore_DieForUs', 'InvisibleFamiliars_DisturbingWildlife', 'MusicDelta_Country1', 'MusicDelta_Country2', 'MusicDelta_Gospel', 'MusicDelta_Rock', 'PortStWillow_StayEven', 'Snowmine_Curfews', 'StrandOfOaks_Spacestation', 'SweetLights_YouLetMeDown'] 
There are 10 songs on test set


### Split train into train/validation set

In [9]:
vocal_train_split = mdb.utils.artist_conditional_split(trackid_list=vocal_split[0]['train'], test_size=0.20, \
                                                       num_splits=1,random_state=8526325)

In [10]:
print ("There are", len(vocal_train_split[0]['train']), "songs on train set and",\
       len(vocal_train_split[0]['test']), "songs on validation set")

There are 38 songs on train set and 13 songs on validation set


In [11]:
vocal_train_split[0]['validation'] = vocal_train_split[0].pop('test')

In [12]:
vocal_train_split[0]['test'] = vocal_split[0]['test']

In [13]:
import json
with open('split_voiced_medleydb.json', 'w') as outfile:
    json.dump(vocal_train_split[0], outfile)