





# Music Recommendation System

## Description

This project is a study on the music app,Spotify. Spotify is in the top spot because of Big Data Analytics. Due to its enormous playlist and its discovery weekly suggestion. This study creates a recommender system that will recommend new musical artists to a user based on their listening history.To build our recommendation system we will be using Spark and the collaborative filtering technique.

### Why Spark?

As industries evelop and the products and services becoming more creative and  customer-based, the need for machine learning algorithms to help develop personalizations, recommendations, and predictive insights becomes much more essential. Apache Spark involves graph computation, streaming, and real-time interactive query processing to solve highly complex machine learning problems.

Furthermore, Apache Spark is a fast and general-purpose cluster computing system. It is also simple, highly scalable, and effectively integrable with other tools, like R, SQL, Python, Scala, and Java.


## Datasets

The dataset for this project is publicly available song data from audioscrobbler. However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files can be found [here](http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) and contains only the information relevant to the top 50 most prolific users (highest artist play counts).

The original data file `user_artist_data.txt` contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.

Also note that the data set includes `artist_alias.txt`, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.

The `artist_data.txt` file then provides a map from the canonical artist ID to the name of the artist.

## Import Libraries and data sets

Load the three datasets into RDDs and name them `artistData`, `artistAlias`, and `userArtistData`.

In [1]:
from pyspark.mllib.recommendation import *
import random
from operator import *

In [2]:
def parser(s, delimeters=" ", to_int=None):
    s = s.split(delimeters)
    if to_int:
        return tuple([int(s[i]) if i in to_int else s[i] for i in range(len(s))])
    return tuple(s)
artistData = sc.textFile("artist_data_small.txt").map(lambda x: parser(x,'\t',[0]))
artistAlias = sc.textFile("artist_alias_small.txt").map(lambda x: parser(x,'\t', [0,1]))
artistAliasMap = artistAlias.collectAsMap()
userArtistData = sc.textFile("user_artist_data_small.txt").map(lambda x: parser(x,' ',[0,1,2]))
userArtistData = userArtistData.map(lambda x: (x[0], artistAliasMap.get(x[1], x[1]), x[2]))


## Data Exploration

The code below, detects the three users with the highest number of total play counts.


In [38]:
def summary(user_id):
    play_list = userArtistData.map(lambda x: (x[0], (x[1], x[2]))).lookup(user_id)
    total = sum(x[1] for x in play_list)
    print ("User %s has a total play count of %s and a mean play count of %s." % (user_id, total, total/len(play_list),))
summary(1059637)
summary(2064012)
summary(2069337)

User 1059637 has a total play count of 674412 and a mean play count of 1878.5849582172702.
User 2064012 has a total play count of 548427 and a mean play count of 9455.637931034482.
User 2069337 has a total play count of 393515 and a mean play count of 1519.3629343629343.


####  Splitting Data for Testing

In [27]:
trainingData, validationData, testData = userArtistData.randomSplit([40,40,20], 13)
trainingData.cache()
validationData.cache()
testData.cache()
print (trainingData.take(3))
print (validationData.take(3))
print (testData.take(3))
print (trainingData.count())
print (validationData.count())
print (testData.count())
# validationSet.lookup(1073421)

[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]
[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]
[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]
19769
19690
10022


## The Recommender Model

In [28]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

def cal_score(predict, actual):
    if len(actual) < len(predict):
#         print "here"
        predict = predict[0:len(actual)]
    return len(list(set(predict) & set(actual)))*1.0/len(actual)

def modelEval(model, dataset):
    # Find the list of all artists in the whole data set
    all_artists = userArtistData.map(lambda x: x[1]).distinct().collect()
    # Find the users in the input dataset
    test_user = dataset.map(lambda p: p[0]).distinct().collect()
    # Find the artists each user listened to in the training set and generate the test data
    global trainingData
    testdata = trainingData.filter(lambda x: x[0] in test_user).map(lambda x: (x[0], x[1])).groupByKey()
    testdata = testdata.map(lambda x: (x[0], list(x[1])))
    testdata = testdata.flatMap(lambda x: [(x[0],a) for a in all_artists if a not in x[1]])
    # Find the artists each user listened to in the input dataset
    testdata_actual = dataset.map(lambda x: (x[0], x[1])).groupByKey().map(lambda x: (x[0], list(x[1]))).collectAsMap()
    predictions = model.predictAll(testdata).map(lambda x: (x[0], (x[1], x[2])))
    predictions = predictions.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda y: y[1], reverse=True)))
    predictions = predictions.map(lambda x: (x[0], cal_score([y[0] for y in x[1]], testdata_actual[x[0]])))
    return predictions.map(lambda x:x[1]).reduce(lambda x, y: x+ y) * 1.0 / len(test_user)    
    

### Model Construction

Now we can build the best model possibly using the validation set of data and the `modelEval` function.

In [33]:
training = trainingData.map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
for r in [2, 10, 20]:
    model = ALS.trainImplicit(training, rank = r, seed=345)
    print ("The model score for rank %s is %s" % (r, modelEval(model, validationData)))

The model score for rank 2 is 0.08993238201906095
The model score for rank 10 is 0.0905379417355099
The model score for rank 20 is 0.08352965938288216


Now, using the bestModel, we will check the results over the test data.

In [35]:
bestModel = ALS.trainImplicit(training, rank=10, seed=345)
print (modelEval(bestModel, testData))

0.06028154355485623


## Artist Recommendations
Predicting the Top 10 Artists for user '1059637'. To achieve this we used the above model and the recommendProducts function.Further, we Map the results (integer IDs) into the real artist name using artistAlias.

In [39]:
recommended = map(lambda x: x.product, bestModel.recommendProducts(1059637, 10))
for i, artist in enumerate(recommended):
    print ("Artist %s: %s" % (i, artistData.lookup(artist)[0],))

Artist 0: Something Corporate
Artist 1: My Chemical Romance
Artist 2: Counting Crows
Artist 3: U2
Artist 4: Green Day
Artist 5: Further Seems Forever
Artist 6: Alkaline Trio
Artist 7: Switchfoot
Artist 8: Underoath
Artist 9: Smash Mouth
