## About this notebook
This notebook contains our workflow to convert the million songs dataset with h5 format into panda pickle files.
We convert our datasets into pickle in order to fasten the process of loading the dataset and avoid reading complex hadoop hdfs h5 files.
We separate the datasets based on the raw datasets' directory structure, for example:
> All raw h5 files in '/datasets/raw_files/A/' will be stored as a single file "df_pickle_A" in '/datasets/pickle_file/'

## Libraries

In [1]:
import pandas as pd
import os
import h5py
import string

## Vars

In [2]:
raw_dataset_dir    = "./datasets/raw_files"
pickle_dataset_dir = "./datasets/pickle_files"

# create alphabet letters since the raw datasets are arranged by alphabet letters
letters            = string.ascii_uppercase

## Functions

In [3]:
def create_pickle_datasets(letters):
    for letter in letters:
        df = pd.DataFrame()    
        for subdir, dirs, files in os.walk(raw_dataset_dir + "/" + letter):
            for file in files:
                # avoid reading hidden files
                if not file[0] == ".":
                    store = pd.HDFStore(subdir+'/'+file)
                    df    = df.append(pd.concat([store['/analysis/songs'],store['/metadata/songs'],store['/musicbrainz/songs']],axis=1))
                    store.close()
        df.to_pickle(pickle_dataset_dir + "/" + "df_pickle_" + letter)

## Main

In [4]:
create_pickle_datasets(letters)