## About this notebook
This notebook contains our workflow to convert the million songs datasets into panda pickle files, especially for datasets with hadoop h5 file format.
We convert our datasets into pickle in order to fasten the process of loading the datasets (for h5, txt and db) and avoid reading complex hadoop hdfs h5 files.
We separate the datasets based on the raw datasets' directory structure, for example:
> All raw h5 files in `/datasets/raw_files/A/` will be stored as a single file "df_pickle_A" in `/datasets/pickle_files/`

Also for additional datasets, e.g. :
> `/datasets/raw_files/AdditionalFiles/artist_location.txt` will be stored as a single file "df_pickle_artist_location" in `/datasets/pickle_files/`

## Libraries

In [1]:
import pandas as pd
import os
import h5py
import string

## Vars

In [2]:
raw_dataset_dir    = "./datasets/raw_files"
pickle_dataset_dir = "./datasets/pickle_files"

# create alphabet letters since the raw datasets are arranged by alphabet letters
letters            = string.ascii_uppercase

## Functions

Function to convert h5 files

In [3]:
def create_pickle_datasets(letters):
    for letter in letters:
        df = pd.DataFrame()    
        for subdir, dirs, files in os.walk(raw_dataset_dir + "/" + letter):
            for file in files:
                # avoid reading hidden files
                if not file[0] == ".":
                    store = pd.HDFStore(subdir+'/'+file)
                    df    = df.append(pd.concat([store['/analysis/songs'],store['/metadata/songs'],store['/musicbrainz/songs']],axis=1))
                    store.close()
        df.to_pickle(pickle_dataset_dir + "/" + "df_pickle_" + letter)

Function to convert txt datasets

In [3]:
def create_pickle_additional_datasets(dataset_file, separator, columns):
    df         = pd.read_csv("./datasets/raw_files/AdditionalFiles/" + dataset_file, sep=separator, engine="python", header=None)
    df.columns = columns
    filename   = dataset_file.split(".")[0]
    df.to_pickle("./datasets/pickle_files/df_pickle_"+filename)

## Main

Converting h5 files

In [4]:
create_pickle_datasets(letters)

Converting txt datasets in AdditionalFiles dir

In [4]:
columns = ["artist_id","lat","long","Name","Location"]
create_pickle_additional_datasets("artist_location.txt","<SEP>",columns)