In [1]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

import pandas as pd
import csv
import numpy as np

### Load User Artists Data file

Let's load the user artist Data file -- this is our main data. And it's pretty clean.

Notice that what we have is the playcount.  That's not the same as a rating.  We'll need to decide how we deal with this.

In [2]:
dataset = pd.read_csv("../data/audioscrobble/user_artist_data.csv.gz")
dataset

Unnamed: 0,User,Artist,PlayCount
0,2398725,598,1
1,2430132,1001582,1
2,2063323,1342138,1
3,2429835,1055519,7
4,2073565,1030685,1
5,2165208,6865656,1
6,2125388,1007006,2
7,2178179,1015794,4
8,2337258,1002328,1
9,2355419,1003741,1


### Let's load the artist alias file

This is a file that has artist aliases.  Misspellings, alternate versions, etc.  This will help us clean up our data.

We'll load this into a a dictionary of str:str

In [3]:
alias = {}
reader = csv.reader(open('../data/audioscrobble/artist_alias.txt', 'r'), delimiter='\t')
alias = dict(reader)
alias

{'': '1329310',
 '10229482': '1013391',
 '1208828': '3940',
 '10232660': '873',
 '1115114': '1041572',
 '7002396': '1008485',
 '1125317': '742',
 '1270337': '1350694',
 '6749716': '1010227',
 '6725879': '1128716',
 '9973296': '6992495',
 '2082714': '2003588',
 '9964764': '6739971',
 '1155014': '1002457',
 '2107934': '1003249',
 '9985966': '1086065',
 '1216282': '1000024',
 '10382004': '10694359',
 '1007244': '1000896',
 '1046463': '1246719',
 '1094068': '1237611',
 '2140802': '1250104',
 '6685178': '5919',
 '2121648': '1002398',
 '2140003': '1027907',
 '6654740': '15',
 '10387491': '1239101',
 '1341378': '976',
 '10056199': '1044562',
 '6792066': '581',
 '1312270': '1010642',
 '6700723': '7022762',
 '6963681': '1024208',
 '1074814': '1000251',
 '9979719': '1330953',
 '2013902': '1003300',
 '6990355': '1238128',
 '2066901': '606',
 '6992629': '2966',
 '10086799': '1041936',
 '6674711': '6884097',
 '10309755': '1006384',
 '9994576': '4605',
 '6683690': '908',
 '6867525': '1001412',
 '666

In [4]:
# Clean up the dataset artist by using the dictionary.
dataset['Artist'] = dataset['Artist'].apply(lambda x: int(alias[str(x)]) if str(x) in alias else x)

### Let's look at a summary of the data

* ** TODO: Do a describe() on the data **

In [5]:
### TODO: DO a describe() on the dataset()
dataset.describe()

Unnamed: 0,User,Artist,PlayCount
count,1000000.0,1000000.0,1000000.0
mean,1947964.0,1697855.0,15.160827
std,495702.4,2513439.0,75.233635
min,90.0,1.0,1.0
25%,2012096.0,1000268.0,1.0
50%,2122280.0,1012972.0,3.0
75%,2280524.0,1235666.0,9.0
max,2443492.0,10794310.0,32768.0


### Decide what to do with the playcount

How are we going to deal with the playcount?  Presumably, if the person likes the music, they will play it more. If they only play it once or twice, maybe they didn't like it?  Hard to say, but we have to do something with it.

Here's a proposal.  We'll treat the playcount as a "rating" from 1 to 5.  Playcounts of higher than 5 will just be 5. Presumably if someone who plays a song more than 5 times likes it.

We then have to set the "scale" of the rating.  Here from 1 to 5.

* ** TODO: What is your idea? Do you have a better way of treating the playcount?  **

In [6]:
dataset['Rating'] = dataset['PlayCount'].apply(lambda x: 5 if x > 5 else x)  # Playcounts > 5 are treated as 5.

# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(1, 5))

### TODO: Think of a way todo the Playcount.


### Convert pandas dataframe to Dataset format

We need to use the internal scikit-surprise format for the data.  Luckily, it's easy to convert from a pandas dataframe.

In [7]:
# Load the audioscrobbler dataset
data = Dataset.load_from_df(dataset[['User', 'Artist', 'Rating']], reader)


### Use the SVD Algorithm to train the model.

Note: This may take a LOOONNG time to run.
    
We will run a cross validation while we train and get the results.

In [8]:
# Run 5-fold cross-validation and print results.
cross_validate(SVD(), data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     1.4866  1.4880  1.4849  1.4893  1.4869  1.4871  0.0015  
RMSE (testset)    1.6611  1.6624  1.6594  1.6648  1.6618  1.6619  0.0018  
Fit time          313.68  347.09  365.27  368.17  143.12  307.47  84.43   
Test time         17.37   17.73   21.42   6.73    4.20    13.49   6.75    


{'fit_time': (313.6777060031891,
  347.09243512153625,
  365.26723289489746,
  368.1741259098053,
  143.11809706687927),
 'test_mae': array([ 1.48655585,  1.48802427,  1.48487447,  1.48933126,  1.48685934]),
 'test_rmse': array([ 1.66109917,  1.66239648,  1.65940684,  1.6647684 ,  1.66181293]),
 'test_time': (17.374589920043945,
  17.731313943862915,
  21.423437118530273,
  6.7327799797058105,
  4.199251174926758)}

### Evaluate the Results

What is the RMSE?  What does that tell us about the results?
