## About this notebook
This notebook contains brief and general description of the dataset. We use this one to know what features we can use to answer our research questions.
<br>**Note that at this point, we peek to raw dataset in pickle format, no other processing has been made**

## Libraries

In [1]:
import pandas as pd
from IPython.display import display

## Vars

In [2]:
pickle_dataset_dir = "./datasets/pickle_files"

## Functions

Function to observe dataset structure in general

In [10]:
def debug_dataframe(df):
    print("Dataset size: ", df.shape)
    print("Dataset columns: ", df.columns)
    print("\nPreview:")
    display(df.head(3))

Function to read one single dataset pickle file

In [6]:
def read_dataset(dataset_file):
    df = pd.read_pickle(pickle_dataset_dir + "/df_pickle_" + dataset_file)
    return df

## Main

# 1. Million Song Dataset (MSD)
Million song dataset consists of the following:
1. msd_songs : one million songs data with main feature columns
2. msd_summary : one million songs data with aside feature columns
3. msd_artist_location : contains location per artist_id, with latitude and longitude entries
4. msd_artist_mbtag : contains mbtag of artist_id, note that artist_id is not unique
5. msd_artist_similarity : contains artist similarity of certain artis_id, note that artist_id is not unique
6. msd_artist_term : contains terms of artist_id, note that artist_id is not unique
7. msd_tracks_per_year : contains summary of msd_songs; with only 'year', 'track_id', 'artist_name', 'song_title' columns
8. msd_unique_artist : list of all unique artists in msd dataset
9. msd_unique_tracks : list of all unique tracks in msd dataset

### 1.1 msd_songs

In [13]:
df_msd_songs = read_dataset("msd_songs")
debug_dataframe(df_msd_songs)

Dataset size:  (1000000, 11)
Dataset columns:  Index(['track_id', 'title', 'song_id', 'release', 'artist_id', 'artist_mbid',
       'artist_name', 'duration', 'artist_familiarity', 'artist_hotttnesss',
       'year'],
      dtype='object')

Preview:


Unnamed: 0,track_id,title,song_id,release,artist_id,artist_mbid,artist_name,duration,artist_familiarity,artist_hotttnesss,year
0,TRMMMYQ128F932D901,Silent Night,SOQMMHC12AB0180CB8,Monster Ballads X-Mas,ARYZTJS1187B98C555,357ff05d-848a-44cf-b608-cb34b5701ae5,Faster Pussy cat,252.05506,0.649822,0.394032,2003
1,TRMMMKD128F425225D,Tanssi vaan,SOVFVAK12A8C1350D9,Karkuteillä,ARMVN3U1187FB3A1EB,8d7ef530-a6fd-4f8f-b2e2-74aec765e0f9,Karkkiautomaatti,156.55138,0.439604,0.356992,1995
2,TRMMMRX128F93187D9,No One Could Ever,SOGTUKN12AB017F4F1,Butter,ARGEKB01187FB50750,3d403d44-36ce-465c-ad43-ae877e65adc4,Hudson Mohawke,138.97098,0.643681,0.437504,2006


### 1.2 msd_summary

In [12]:
df_msd_summary = read_dataset("msd_summary")
debug_dataframe(df_msd_summary)

Dataset size:  (1000000, 53)
Dataset columns:  Index(['analysis_sample_rate', 'audio_md5', 'danceability', 'duration',
       'end_of_fade_in', 'energy', 'idx_bars_confidence', 'idx_bars_start',
       'idx_beats_confidence', 'idx_beats_start', 'idx_sections_confidence',
       'idx_sections_start', 'idx_segments_confidence',
       'idx_segments_loudness_max', 'idx_segments_loudness_max_time',
       'idx_segments_loudness_start', 'idx_segments_pitches',
       'idx_segments_start', 'idx_segments_timbre', 'idx_tatums_confidence',
       'idx_tatums_start', 'key', 'key_confidence', 'loudness', 'mode',
       'mode_confidence', 'start_of_fade_out', 'tempo', 'time_signature',
       'time_signature_confidence', 'track_id', 'analyzer_version',
       'artist_7digitalid', 'artist_familiarity', 'artist_hotttnesss',
       'artist_id', 'artist_latitude', 'artist_location', 'artist_longitude',
       'artist_mbid', 'artist_name', 'artist_playmeid', 'genre',
       'idx_artist_terms', 'idx_sim

Unnamed: 0,analysis_sample_rate,audio_md5,danceability,duration,end_of_fade_in,energy,idx_bars_confidence,idx_bars_start,idx_beats_confidence,idx_beats_start,...,idx_artist_terms,idx_similar_artists,release,release_7digitalid,song_hotttnesss,song_id,title,track_7digitalid,idx_artist_mbtags,year
0,22050,aee9820911781c734e7694c5432990ca,0.0,252.05506,2.049,0.0,0,0,0,0,...,0,0,Monster Ballads X-Mas,633681,0.542899,SOQMMHC12AB0180CB8,Silent Night,7032331,0,2003
1,22050,ed222d07c83bac7689d52753610a513a,0.0,156.55138,0.258,0.0,0,0,0,0,...,0,0,Karkuteillä,145266,0.299877,SOVFVAK12A8C1350D9,Tanssi vaan,1514808,0,1995
2,22050,96c7104889a128fef84fa469d60e380c,0.0,138.97098,0.0,0.0,0,0,0,0,...,0,0,Butter,625706,0.617871,SOGTUKN12AB017F4F1,No One Could Ever,6945353,0,2006


### 1.3 msd_artist_location

In [14]:
df_msd_artist_location= read_dataset("msd_artist_location")
debug_dataframe(df_msd_artist_location)

Dataset size:  (13850, 5)
Dataset columns:  Index(['artist_id', 'lat', 'long', 'Name', 'Location'], dtype='object')

Preview:


Unnamed: 0,artist_id,lat,long,Name,Location
0,ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz
1,AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN"
2,ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England"


### 1.4 msd_artist_mbtag

In [15]:
df_msd_artist_mbtag = read_dataset("msd_artist_mbtag")
debug_dataframe(df_msd_artist_mbtag)

Dataset size:  (24777, 2)
Dataset columns:  Index(['artist_id', 'mbtag'], dtype='object')

Preview:


Unnamed: 0,artist_id,mbtag
0,AR002UA1187B9A637D,uk
1,AR002UA1187B9A637D,rock
2,AR002UA1187B9A637D,garage rock


### 1.5 msd_artist_similarity

In [17]:
df_msd_artist_similarity = read_dataset("msd_artist_similarity")
debug_dataframe(df_msd_artist_similarity)

Dataset size:  (2201916, 2)
Dataset columns:  Index(['artist_id', 'similliar_artist_id'], dtype='object')

Preview:


Unnamed: 0,artist_id,similliar_artist_id
0,AR002UA1187B9A637D,ARQDOR81187FB3B06C
1,AR002UA1187B9A637D,AROHMXJ1187B989023
2,AR002UA1187B9A637D,ARAGWVR1187B9B749B


### 1.6 msd_artist_term

In [18]:
df_msd_artist_term = read_dataset("msd_artist_term")
debug_dataframe(df_msd_artist_term)

Dataset size:  (1109381, 2)
Dataset columns:  Index(['artist_id', 'term'], dtype='object')

Preview:


Unnamed: 0,artist_id,term
0,AR002UA1187B9A637D,garage rock
1,AR002UA1187B9A637D,country rock
2,AR002UA1187B9A637D,free jazz


### 1.7 msd_tracks_per_year

In [19]:
df_msd_tracks_per_year = read_dataset("msd_tracks_per_year")
debug_dataframe(df_msd_tracks_per_year)

Dataset size:  (515576, 4)
Dataset columns:  Index(['year', 'track_id', 'artist_name', 'song_title'], dtype='object')

Preview:


Unnamed: 0,year,track_id,artist_name,song_title
0,1922,TRSGHLU128F421DF83,Alberta Hunter,Don't Pan Me
1,1922,TRMYDFV128F42511FC,Barrington Levy,Warm And Sunny Day
2,1922,TRRAHXQ128F42511FF,Barrington Levy,Looking My Love


### 1.8 msd_unique_artists

In [21]:
df_msd_unique_artists = read_dataset("msd_unique_artists")
debug_dataframe(df_msd_unique_artists)

Dataset size:  (44745, 4)
Dataset columns:  Index(['artist_id', 'artist_mbid', 'track_id', 'artist_name'], dtype='object')

Preview:


Unnamed: 0,artist_id,artist_mbid,track_id,artist_name
0,AR002UA1187B9A637D,7752a11c-9d8b-4220-ac44-e4a04cc8471d,TRMUOZE12903CDF721,The Bristols
1,AR003FB1187B994355,1dbd2d7b-64c8-46aa-9f47-ff589096d672,TRWDPFR128F93594A6,The Feds
2,AR006821187FB5192B,94fc1228-7032-4fe6-a485-e122e5fbee65,TRMZLJF128F4269EAC,Stephen Varcoe/Choir of King's College_ Cambri...


### 1.9 msd_unique_tracks

In [22]:
df_msd_unique_tracks = read_dataset("msd_unique_tracks")
debug_dataframe(df_msd_unique_tracks)

Dataset size:  (1000000, 4)
Dataset columns:  Index(['track_id', 'song_id', 'artist_name', 'song_title'], dtype='object')

Preview:


Unnamed: 0,track_id,song_id,artist_name,song_title
0,TRMMMYQ128F932D901,SOQMMHC12AB0180CB8,Faster Pussy cat,Silent Night
1,TRMMMKD128F425225D,SOVFVAK12A8C1350D9,Karkkiautomaatti,Tanssi vaan
2,TRMMMRX128F93187D9,SOGTUKN12AB017F4F1,Hudson Mohawke,No One Could Ever


# 2. Secondhand Song Dataset (SHS)
TBD

### 2.1 SHS

In [23]:
# TBD

# 3. MusicX Match (MXM)
MusicX match dataset consists of the following:
1. mxm_lyrics : contains detailed lyrics of a song, per word count per song. (track_id) is not unique

### 3.1 mxm_lyrics

In [24]:
df_mxm_lyrics = read_dataset("mxm_lyrics")
debug_dataframe(df_mxm_lyrics)

Dataset size:  (19045332, 5)
Dataset columns:  Index(['track_id', 'mxm_tid', 'word', 'count', 'is_test'], dtype='object')

Preview:


Unnamed: 0,track_id,mxm_tid,word,count,is_test
0,TRAAAAV128F421A322,4623710,i,6,0
1,TRAAAAV128F421A322,4623710,the,4,0
2,TRAAAAV128F421A322,4623710,you,2,0


# 4. Lastfm Dataset (LASTFM)
Lastfm dataset consists of the following:
1. lastfm_similars : contains similars tracks (with score) relative to "track_id"

### 4.1 lastfm_similars

In [25]:
df_lastfm_similars = read_dataset("lastfm_similars")
debug_dataframe(df_lastfm_similars)

Dataset size:  (584897, 2)
Dataset columns:  Index(['track_id', 'similars_entry'], dtype='object')

Preview:


Unnamed: 0,track_id,similars_entry
0,TRCCCYE12903CFF0E9,"TRHZRQH128F92F9AC2,0.498053,TRZQUEN12903CBFFBB..."
1,TRCCCPM12903CBEEE5,"TRNRRXT12903CFAD5A,1,TRWVNWV12903CBEEE7,0.5238..."
2,TRCCCFH12903CEBC70,"TRRGGCN128F92E3579,0.646036,TRTVJGV128F424A147..."


# 5. Taste Profile
Taste profile dataset consists of the following:
1. train_triplets : contains per song clicks per user. Note that (user_id) is not unique

### 5.1 train_triplets

In [26]:
df_train_triplets = read_dataset("train_triplets")
debug_dataframe(df_train_triplets)

Dataset size:  (48373586, 3)
Dataset columns:  Index(['user_id', 'song_id', 'play_count'], dtype='object')

Preview:


Unnamed: 0,user_id,song_id,play_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
