# List of files

In `data/`:
* `kaggle_songs.txt`
* `kaggle_users.txt`
* `kaggle_visible_evaluation_triplets.txt`
* `taste_profile_song_to_tracks.txt`
* `train_triplets.txt`

All of the above data was collected directly from the [Kaggle competition](https://www.kaggle.com/c/msdchallenge), except for `train_triplets.txt`, which was collected from the [Million Song Dataset website](http://millionsongdataset.com/) (MSD website).

--------------

In `data/msd_misc/`:
* `artist_location.txt`
* `unique_artists.txt`
* `unique_mbtags.txt`
* `unique_terms.txt`
* `unique_tracks.txt`

All of the above data was collected from the MSD website - more specifically, [this page](http://millionsongdataset.com/sites/default/files/AdditionalFiles/).

# `data/`

## `kaggle_songs.txt`

* Contains a list of all song IDs.

In [4]:
path = '../data/kaggle_songs.txt'
with open(path, 'r') as f:
    lines = [f.readline().strip().split(' ')[0] for _ in range(10)]
    
print(lines)

['SOAAADD12AB018A9DD', 'SOAAADE12A6D4F80CC', 'SOAAADF12A8C13DF62', 'SOAAADZ12A8C1334FB', 'SOAAAFI12A6D4F9C66', 'SOAAAGK12AB0189572', 'SOAAAGN12AB017D672', 'SOAAAGO12A67AE0A0E', 'SOAAAGP12A6D4F7D1C', 'SOAAAGQ12A8C1420C8']


## `kaggle_users.txt`

* Contains a list of all user IDs.

In [2]:
path = '../data/kaggle_users.txt'
with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    
print(lines)

['fd50c4007b68a3737fe052d5a4f78ce8aa117f3d', 'd7083f5e1d50c264277d624340edaaf3dc16095b', 'd68dc6fc25248234590d7668a11e3335534ae4b4', '9be82340a8b5ef32357fe5af957ccd54736ece95', '841b2394ae3a9febbd6b06497b4a8ee8eb24b7f8', '91b8fac7dc5e03f6cfaf6e2aa7171f14a8354d62', '458833ce4418010e61304b34b2c992e1cce63435', 'c34670d9c1718361feb93068a853cead3c95b76a', '0f40e074aab2c5f47b7ddc2277fb0295b5b3a058', 'ef0d21935a2f8ae90571dbfab800f87fa5b38769']


## `taste_profile_song_to_tracks.txt`

* Contains the mapping from *song* IDs to corresponding *track* IDs.
* Most of the MSD data references songs by their *track* ID, but the kaggle data references songs by their *song* ID.
* This file allows us to translate back and forth between these two ID systems, but the mapping isn't perfect: some song IDs map to multiple track IDs, and some song IDs do not map to any track ID.

In [5]:
path = '../data/taste_profile_song_to_tracks.txt'
with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    out = dict()
    for line in lines:
        line = line.split('\t')
        song, tracks = line[0], line[1:]
        out[song] = tracks
    
print(out)

{'SOAAADD12AB018A9DD': ['TRNCENP12903C9EF3A'], 'SOAAADE12A6D4F80CC': ['TRSKKFK128F148B615'], 'SOAAADF12A8C13DF62': ['TRCQMSP128F428A6F7'], 'SOAAADZ12A8C1334FB': ['TRMDNZY128F425A532'], 'SOAAAFI12A6D4F9C66': ['TRZEXLQ128F1491D17'], 'SOAAAGK12AB0189572': ['TRDZUFJ12903CE29FC'], 'SOAAAGN12AB017D672': ['TRHPQGK128F930A22F'], 'SOAAAGO12A67AE0A0E': ['TRLNGIA128EF33FD34'], 'SOAAAGP12A6D4F7D1C': ['TRDWEPA128F145FE05'], 'SOAAAGQ12A8C1420C8': ['TRJWZII128F92CA924']}


## `train_triplets.txt`

* The core dataset for song recommendation. Each line is a triplet `(user_id, song_id, num_plays)` indicating that this user played this song this many times.
* There are 48373586 triplets in all.

In [9]:
path = '../data/train_triplets.txt'
with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    out = []
    for line in lines:
        line = line.split('\t')
        user_id = line[0]
        song_id = line[1]
        num_plays = int(line[2])
        print('%s --> %s -- > %d times' % (user_id, song_id, num_plays))
    


b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOAKIMP12A8C130995 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOAPDEY12A81C210A9 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBBMDR12A8C13253B -- > 2 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBFNSP12AF72A0E22 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBFOVM12A58A7D494 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBNZDC12A6D4FC103 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBSUJE12A6D4F8CF5 -- > 2 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBVFZR12A6D4F8AE3 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBXALG12A8C13C108 -- > 1 times
b80344d063b5ccb3212f76538f3d9e43d87dca9e --> SOBXHDL12A81C204C0 -- > 1 times


## `kaggle_visible_evaluation_triplets`

* This file has the same format as `train_triplets.txt`.
* It is the test data.

In [10]:
path = '../data/kaggle_visible_evaluation_triplets.txt'
with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    out = []
    for line in lines:
        line = line.split('\t')
        user_id = line[0]
        song_id = line[1]
        num_plays = int(line[2])
        print('%s --> %s -- > %d times' % (user_id, song_id, num_plays))
    


fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SOBONKR12A58A7A7E0 -- > 1 times
fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SOEGIYH12A6D4FC0E3 -- > 1 times
fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SOFLJQZ12A6D4FADA6 -- > 1 times
fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SOHTKMO12AB01843B0 -- > 1 times
fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SODQZCY12A6D4F9D11 -- > 1 times
fd50c4007b68a3737fe052d5a4f78ce8aa117f3d --> SOXLOQG12AF72A2D55 -- > 1 times
d7083f5e1d50c264277d624340edaaf3dc16095b --> SOUVUHC12A67020E3B -- > 1 times
d7083f5e1d50c264277d624340edaaf3dc16095b --> SOUQERE12A58A75633 -- > 1 times
d7083f5e1d50c264277d624340edaaf3dc16095b --> SOIPJAX12A8C141A2D -- > 1 times
d7083f5e1d50c264277d624340edaaf3dc16095b --> SOEFCDJ12AB0185FA0 -- > 2 times


# `data/msd_misc`

## `unique_tracks.txt`

* Each line of this file contains a data point of the form:
``` 
(track_id, song_id, artist_name, song_title)
```
* `track_id` and `song_id` contain the track and song IDs of the song in question, `artist_name` is the name of the artist, and `song_title` the name of the song.
* It seems that this file contains `song_id <--> track_id` mappings that aren't present in `taste_profile_song_to_tracks.txt` - when searching within `taste_profile_song_to_tracks.txt` for some of the song IDs listed in `unique_tracks.txt`, not all return results.

In [12]:
path = '../data/msd_misc/unique_tracks.txt'

with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    for line in lines:
        line = line.split('<SEP>')
        print(line)

['TRMMMYQ128F932D901', 'SOQMMHC12AB0180CB8', 'Faster Pussy cat', 'Silent Night']
['TRMMMKD128F425225D', 'SOVFVAK12A8C1350D9', 'Karkkiautomaatti', 'Tanssi vaan']
['TRMMMRX128F93187D9', 'SOGTUKN12AB017F4F1', 'Hudson Mohawke', 'No One Could Ever']
['TRMMMCH128F425532C', 'SOBNYVR12A8C13558C', 'Yerba Brava', 'Si Vos Querés']
['TRMMMWA128F426B589', 'SOHSBXH12A8C13B0DF', 'Der Mystic', 'Tangle Of Aspens']
['TRMMMXN128F42936A5', 'SOZVAPQ12A8C13B63C', 'David Montgomery', 'Symphony No. 1 G minor "Sinfonie Serieuse"/Allegro con energia']
['TRMMMLR128F1494097', 'SOQVRHI12A6D4FB2D7', 'Sasha / Turbulence', 'We Have Got Love']
['TRMMMBB12903CB7D21', 'SOEYRFT12AB018936C', 'Kris Kross', "2 Da Beat Ch'yall"]
['TRMMMHY12903CB53F1', 'SOPMIYT12A6D4F851E', 'Joseph Locke', 'Goodbye']
['TRMMMML128F4280EE9', 'SOJCFMH12A8C13B0C2', "The Sun Harbor's Chorus-Documentary Recordings", "Mama_ mama can't you see ?"]


In [16]:
to_find = 'SOBNYVR12A8C13558C'
tpst_path = '../data/taste_profile_song_to_tracks.txt'

with open(tpst_path, 'r') as f:
    for line in f:
        if to_find in line:
            print(line)
            break

SOBNYVR12A8C13558C	TRMMMCH128F425532C



## `unique_artists.txt`

* Each line of this file contains a data point of the form:
```
(artist_id, artist_mbid, track_id, artist_name)
```
* `artist_id` is the unique ID of the artist
* `artist_mbid` is the MusicBrainz ID for the artist -- I'm not sure I will be using any MusicBrainz data, so I will probably ignore this

In [19]:
path = '../data/msd_misc/unique_artists.txt'

with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    for line in lines:
        line = line.split('<SEP>')
        print(line)

['AR002UA1187B9A637D', '7752a11c-9d8b-4220-ac44-e4a04cc8471d', 'TRMUOZE12903CDF721', 'The Bristols']
['AR003FB1187B994355', '1dbd2d7b-64c8-46aa-9f47-ff589096d672', 'TRWDPFR128F93594A6', 'The Feds']
['AR006821187FB5192B', '94fc1228-7032-4fe6-a485-e122e5fbee65', 'TRMZLJF128F4269EAC', "Stephen Varcoe/Choir of King's College_ Cambridge/Sir David Willcocks"]
['AR009211187B989185', '9dfe78a6-6d91-454e-9b95-9d7722cbc476', 'TRMGURO12903CAE2F0', 'Carroll Thompson']
['AR009SZ1187B9A73F4', '8cd574c0-b9f7-4998-94f4-654dffaecdf2', 'TRGWWFP12903CE7E79', 'Gorodisch']
['AR00A1N1187FB484EB', '7373764f-c642-4393-9492-97b5622c4bce', 'TRMMUWQ128F92E88FC', '1.000 Mexicans']
['AR00A6H1187FB5402A', '312c14d9-7897-4608-944a-c5b1c76ae682', 'TRWNEQX12903CB84FB', 'The Meatmen']
['AR00AP71187B99635F', '607d7275-2b92-49e5-a7ee-32f9cc29b076', 'TRWXQXG12903CE91BD', 'Miles Davis']
['AR00B1I1187FB433EB', '4a5777b3-f55b-437c-8b23-d9ee7791c7fc', 'TRMZTST128E0792E44', 'Eagle-Eye Cherry']
['AR00DDV1187B98B2BF', '702d022f-

## `artist_location.txt`

* Each line of this file contains a data point of the form:
```
(artist_id, latitude, longitude, artist_name, location_name)
```
* Location data is not available for *all* artists, however.

In [21]:
path = '../data/msd_misc/artist_location.txt'

with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
    for line in lines:
        line = line.split('<SEP>')
        print(line)

['ARZGXZG1187B9B56B6', '-16.96595', '-61.14804', 'Endless Blue', 'Santa Cruz']
['AR8K6F31187B99C2BC', '46.44231', '-93.36586', 'Go Fish', 'Twin Cities, MN']
['ARHJJ771187FB5B581', '51.59678', '-0.33556', 'Screaming Lord Sutch', 'Harrow, Middlesex, England']
['ARJ8YLL1187FB3CA93', '40.69626', '-73.83301', 'Morton Gould', 'Richmond Hill, NY']
['ARYBAGV11ECC836DAC', '43.58828', '-79.64372', 'Crash Parallel', 'Mississauga']
['AR9SWJO1187FB47837', '59.91228', '10.74998', 'Prins Thomas', 'Oslo']
['ARO2Y5H1187B99D227', '56.95468', '-98.30897', 'Blinker The Star', 'Canada']
['ARM2AV21187FB3DB65', '51.16418', '10.45415', '17 Hippies', 'Fort Worth Texas USA']
['ARCSB641187FB47E64', '37.30198', '-78.3926', 'Lady Of Rage', 'Farmville, VA']
['ARWMIKB1187B9AD5DC', '52.22357', '6.89537', 'Hans Theessink / Gerry Lockran', 'Enschede, Netherlands']


## `unique_terms.txt`

TODO

* This file contains a list of all unique Echo Nest tags (_terms_) used

In [22]:
path = '../data/msd_misc/unique_terms.txt'

with open(path, 'r') as f:
    lines = [f.readline().strip() for _ in range(10)]
#     for line in lines:
#         line = line.split('<SEP>')
#         print(line)
    print(lines)

['00s', '00s alternative', '00s country', '00s indie', '00s pop', '12k', '15th century', '16th century', '1700s', '1800s']
