# Creating User Profiles

*Moses Surumen, Ellen Peng, Kuhuk Goyal*  
*CS 194-31  Final Project*  
*Project Name: Music Networks*

---

## Introduction

The **Million Song Dataset (MSD)** is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. It contains track, song, artist, and album metadata as well as artist similarity and artist tags. The data is stored in HDF5 format, with one file per song.

The dataset was created using the [**Echo Nest**](http://the.echonest.com/) API. More information on the dataset can be found [here](http://labrosa.ee.columbia.edu/millionsong/).


---

In [1]:
# Pandas
import pandas as pd

# Graph
import community
import networkx as nx

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Combinations
import itertools

___

## Get Echo Nest Taste Profile

In [4]:
df = pd.read_csv('data/train_triplets.txt', delimiter="\t", header=None, names=["User", "Song", "Playcount"])

In [5]:
df.head()

Unnamed: 0,User,Song,Playcount
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1


In [6]:
df.shape[0]

48373586

In [7]:
df.shape[1]

3

___

## Pre-Processing Data

We drop all songs from our dataset which have been played less that 25 times, and write the updated dataset to a different file

In [14]:
df.drop(df.loc[df['Playcount'] < 25].index, inplace=True)

In [15]:
df.shape[0]

476267

In [16]:
df.head()

Unnamed: 0,User,Song,Playcount
503,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOBONKR12A58A7A7E0,26
787,5a905f000fc1ff3df7ca807d57edb608863db05d,SOMVTRL12A67AE0921,28
1326,a820d2d4f16bbd53be9e41e0417dfb234bfdfba8,SOGKEGN12AB0185355,26
2148,3f152d355d53865a2ca27ac5ceeffb7ebaea0a26,SOQGETC12AB017F1E5,26
2209,3f152d355d53865a2ca27ac5ceeffb7ebaea0a26,SOYZLWW12AB0186148,55


In [17]:
df.to_csv("taste_profiles.csv", encoding='utf-8', index=False)

## Sort Dataframe based on User IDs

In [18]:
df.sort_values('User')

Unnamed: 0,User,Song,Playcount
15493119,00020e8ba3f9041deed64ec9c60b26ff6bf41c66,SOOBUXN12AB01887FA,42
6783308,00023f6ad10cd247d187b461e6b00b7bf3ebc568,SOFFKCX12A6D4FD4EB,136
6783309,00023f6ad10cd247d187b461e6b00b7bf3ebc568,SOFLJQZ12A6D4FADA6,93
6783331,00023f6ad10cd247d187b461e6b00b7bf3ebc568,SOLIIPO12AB01861F7,77
6783335,00023f6ad10cd247d187b461e6b00b7bf3ebc568,SOMYTVF12AB018DD45,69
6783340,00023f6ad10cd247d187b461e6b00b7bf3ebc568,SOOBYPW12AB018DD4A,71
12156673,00028f3cff4872bff3e9985cfa32e01a8d54e374,SOFKZNG12AC9072F32,81
12156713,00028f3cff4872bff3e9985cfa32e01a8d54e374,SOLPVAQ12AB017EB35,52
27246496,0002b896949cb2899feaed47104406e99eafa983,SOIUJLY12A6701DF4D,30
27246493,0002b896949cb2899feaed47104406e99eafa983,SOBAKOT12A67021B3D,26


## Create Hashmap with Values as Songs and Keys as Users

In [19]:
from collections import defaultdict

In [20]:
data_dict = defaultdict(list)
for k, v in zip(df.User.values,df.Song.values):
    data_dict[k].append(v)

In [21]:
len(data_dict)

187548

## Write Dictionary to File

In [26]:
import json

In [27]:
json.dump(data_dict, open("data_dict.txt",'w'))

## Remove Key-Value pairs with less than 25 Songs

In [36]:
new_data_dict = data_dict

In [37]:
len(new_data_dict)

24720

In [38]:
for k in sorted(new_data_dict, key=lambda k: len(new_data_dict[k]), reverse=True):
    if len(new_data_dict[k]) < 25:
        del new_data_dict[k]

In [39]:
len(new_data_dict)

274

In [41]:
json.dump(new_data_dict, open("new_data_dict.txt",'w'))

In [43]:
%cat new_data_dict.txt

{"c0ff0f1c93f67c1fb372b36b1b08bb4c76bead7d": ["SOAKPQJ12A8C13D812", "SOBGXEU12A8AE45903", "SOBQWQX12A58A80CF8", "SOBWWUF12A8C13AC82", "SOCSBXQ12AB01806AC", "SOEYEQN12A58A75F3F", "SOFNSLY12A8C13B1C2", "SOFTXKI12A6D4F71DA", "SOFUHZF12A6D4F5A3F", "SOHZPVD12AB01839E8", "SOIOZPA12A8C137498", "SOJDIWD12AB0186CE9", "SOLAPGI12AF72A3955", "SONMSZZ12A6701F352", "SOOOLOP12AB0189B72", "SOOPHIF12A6D4F71DC", "SOPAJOR12A58A81CC0", "SOTSJHY12AF729C1A9", "SOUNQLL12A6D4F5A3B", "SOUYDLS12A6D4F6C0B", "SOVGANS12A81C2268D", "SOVNFJP12AF72A6545", "SOVNHMU12A6D4F5A38", "SOWKLHD12A67020290", "SOXALRW12A8159E8D5", "SOXASRE12A6D4F6C0C", "SOXKBTV12AF72A3A89", "SOXXZRM12A6D4F7F22"], "283882c3d18ff2ad0e17124002ec02b847d06e9a": ["SOAKMDU12A8C1346A9", "SOAXGDH12A8C13F8A1", "SOAYCLH12A81C22D59", "SOCCYYG12AB0184DE8", "SOCQOZB12AB0185685", "SOFEJPJ12A8C145455", "SOGBFOO12A6D4FC933", "SOHFJAQ12AB017E4AF", "SOHFNKO12AB017C772", "SOHGWFC12AB017F2E7", "SOKUECJ12A6D4F6129", "SOMCAFM12A58A7B024", "SOMKDZU12AB0185690", "SOOAL

____

In [None]:
##