# Million Song Dataset Analysis

*Moses Surumen, Ellen Peng, Kuhuk Goyal*  
*CS 194-31  Final Project*  
*Project Name: Music Networks*

---

## Introduction

The **Million Song Dataset (MSD)** is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. It contains track, song, artist, and album metadata as well as artist similarity and artist tags. The data is stored in HDF5 format, with one file per song.

The dataset was created using the [**Echo Nest**](http://the.echonest.com/) API. More information on the dataset can be found [here](http://labrosa.ee.columbia.edu/millionsong/).


---

In [2]:
import os
import numpy as np
import pandas as pd
import pprint, pickle

---
##  Dataset Format

| Column        | Description                            | Format
| :------------ |--------------------------------------: | :-:
| a_similar     | Similar artists                        | array('artistId', 'artistId', ... , 'artistId')
| artist_7did   | Seven digit artist ID                  | 
| artist_id     | Echo Nest artist ID                    | 
| artist_mbid   | Music Brain artist ID                  | 
| artist_name   | Name of the artist                     | string
| dance         | Danceability of the song               | float
| dur           | Duration of the song                   | float
| energy        | Energy of the song                     | float
| song_id       | Echo Nest song ID                      | 
| title         | The title of the song                  |
| track_id      | Echo Nest track ID                     |
| year          | The year the song was released         |






---

## Process Dataset

In [4]:
path = 'data/cleaned_data/'

### 1. Artists

In [6]:
artists_file = path + 'artists.csv'

In [11]:
df_artists = pd.read_csv(artists_file)
df_artists

Unnamed: 0,artist_id,artist_mbid,artist_7did,artist_name
0,'AREJXK41187B9A4ACC','c43bb0d6-94d7-410f-80fb-e5a243b18d23',16971,'Rapha\xc3\xabl'
1,'ARODOO01187FB44F4A','60bd8a1c-c093-4849-8f28-08101ca059b1',1701,'The Baltimore Consort'
2,'ARJGW911187FB586CA','44b5b950-2ae2-403a-8c67-82d8fc72033d',92184,'I Hate Sally'
3,'ARV8T9T1187B99F3F4','efaefde1-e09b-4d49-9d8e-b1304d2ece8d',21896,'Amorphis'
4,'AR050VJ1187B9B13A7','37c78aeb-d196-42b5-b991-6afb4fc9bc2e',94403,'Dead Kennedys'
5,'AR8KUS11187B98C991','050ce7ea-0935-430f-bcec-b83e702298e',263016,'Brigada Victor Jara'
6,'ARB2M051187FB56FA9','336f2e14-b064-45c8-8b2a-01663a0bee64',50458,'Spoonie Gee'
7,'AREJ5K11187B993F5F','3c050ada-9db8-4e87-86fb-11c5294f9711',25079,'R. Carlos Nakai'
8,'ARQUMH41187B9AF699','f59c5520-5f46-4d2c-b2c4-822eabf53419',12873,'Linkin Park'
9,'ARWNTXK11EBCD7BBF','',505616,'Alain-Fran\xc3\xa7ois'


### 2. Songs

In [7]:
songs_file = path + 'songs.csv'

In [12]:
df_songs = pd.read_csv(songs_file)
df_songs

Unnamed: 0,song_id,track_id,song_name,danceability,duration,energy,loudness
0,'SOGSOUE12A58A76443','TRARRPG12903CD1DE9','Zip-A-Dee-Doo-Dah (Song of the South)',0.0,199.99302,0.0,-16.477
1,'SOVVDCO12AB0187AF7','TRARRER128F9328521','Liquid Time (composition by John Goodsall)',0.0,279.35302,0.0,-12.474
2,'SOZQSGL12AF72A9145','TRARREF128F422FD96','Halloween',0.0,216.84200,0.0,-4.264
3,'SOBTEHX12A6D4FBF18','TRARRQO128F427B5F5','You Eclipsed By Me (Album Version)',0.0,218.90567,0.0,-4.707
4,'SOXGDVW12AB01864E7','TRARRMK12903CDF793','Shovel',0.0,580.70159,0.0,-4.523
5,'SOZKFHV12A6D4F996F','TRARIRG128F147FC96',"b""I'm Not Moving""",0.0,154.93179,0.0,-15.433
6,'SOPVMQT12A679C8040','TRARIWH128E079349F','Columbia',0.0,324.80608,0.0,-7.588
7,'SOIAZJL12A6D4F854E','TRARIIY128F147187D','Radar For Love',0.0,241.08363,0.0,-8.021
8,'SOCPFDY12AB018300D','TRARIZK128F92FF27A','Antarctica Starts Here',0.0,199.96689,0.0,-20.172
9,'SOVOTBC12A6D4F696A','TRARINE128F4280B9C','Out In The Street (Live) (2008 Digital Remast...,0.0,314.17424,0.0,-6.763


### 3. Similar Artists

In [8]:
similar_artists_file = path + 'similar_artists.csv'

In [13]:
df_similar_artists = pd.read_csv(similar_artists_file)
df_similar_artists

Unnamed: 0,from_artist,to_artist
0,AREJXK41187B9A4ACC,ARVEJ9M1187FB4DC44
1,AREJXK41187B9A4ACC,ARYDHN21187FB466A8
2,AREJXK41187B9A4ACC,AR270BS1187FB42CE6
3,AREJXK41187B9A4ACC,AR7VA9B1187B99CFFF
4,AREJXK41187B9A4ACC,AR2X7JU1187FB3DCBC
5,AREJXK41187B9A4ACC,ARHOUWF1187B9AF7CA
6,AREJXK41187B9A4ACC,ARV8IRZ1187B9A14BA
7,AREJXK41187B9A4ACC,AR25ITW1187FB4E5C8
8,AREJXK41187B9A4ACC,AREO17I1187B9B159B
9,AREJXK41187B9A4ACC,ARN3YVZ1187FB4E7CD


### 4. Similar Songs

In [9]:
similar_songs_file = path + 'similar_songs.csv'

In [14]:
df_similar_songs = pd.read_csv(similar_songs_file)
df_similar_songs

Unnamed: 0,from_track,to_track,sim_measure
0,TRARRUZ128F9307C57,TRGGBUV128F930B5F6,1.000000
1,TRARRUZ128F9307C57,TRKJAYK128F930B613,0.985081
2,TRARRUZ128F9307C57,TRHSFCB128F428D093,0.190086
3,TRARRUZ128F9307C57,TRGJYWK128F428D09A,0.006668
4,TRARRUZ128F9307C57,TRRCYEK128F92F930C,0.006554
5,TRARRUZ128F9307C57,TRBDVWE128F92F930A,0.006554
6,TRARRUZ128F9307C57,TRMWTLR128F4215D72,0.006034
7,TRARRUZ128F9307C57,TRTDETO128F4215D85,0.005971
8,TRARRUZ128F9307C57,TRQXVWZ128F933E701,0.005142
9,TRARRUZ128F9307C57,TRJQRMU128F930CB16,0.005134


### 5. Artist Performs Songs

In [10]:
artist_songs_file = path + 'performs.csv'

In [15]:
df_artist_performs = pd.read_csv(artist_songs_file)
df_artist_performs

Unnamed: 0,artist_id,song_id
0,AREJXK41187B9A4ACC,TRARRZU128F4253CA2
1,ARODOO01187FB44F4A,TRARRUZ128F9307C57
2,ARDPTGD1187B9AD361,TRARRER128F9328521
3,ARV8T9T1187B99F3F4,TRARRYC128F428CCDA
4,ARJ5BEW1187FB52361,TRARROY128F42281F7
5,AR8KUS11187B98C991,TRARRVB128F92F47CA
6,ARK8OHG1187B99016A,TRARUOP12903CF2384
7,ARZFRQM1187B9A9772,TRARURM128F931A91B
8,ARQUMH41187B9AF699,TRARUTP128E0797FC7
9,ARWNTXK11EBCD7BBFB,TRARITF12903CE7DA9
