# Music Recommender System with PySpark

## Process flow

- Importing the csv file
- Prepare our dataset by performing an Aggregation
- Converting String columns into columns with unique numerical values
- Creating the ALS model
- Suggest top 10 tracks for each user

---

*Learn pySpark and how to work wth a large dataset (1 GB) in this tool. Also I used pySpark's ALS tools to recommend music to the user based on the implicit listening count for that user.*

---

Let's install pyspark

In [None]:
!pip install pyspark

Importing the modules

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc , col, max
from pyspark.ml.feature import  StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder

Creating the spark session


In [None]:
spark = SparkSession.builder.appName('lastfm').getOrCreate()

## Loading the dataset

In [None]:
!gdown --id 1q8VWIZFjlOP_91z0GjbCe4RpmtGVDkvz
!gdown --id 14dMLzOTIf1GK-P6bA9rVEI_1WSedOdZU

Downloading...
From: https://drive.google.com/uc?id=1q8VWIZFjlOP_91z0GjbCe4RpmtGVDkvz
To: /content/genre.csv
3.38MB [00:00, 108MB/s]
Downloading...
From: https://drive.google.com/uc?id=14dMLzOTIf1GK-P6bA9rVEI_1WSedOdZU
To: /content/listenings.csv
1.09GB [00:07, 141MB/s]


In [None]:
file_path = '/content/listenings.csv'
df_listenings = spark.read.format('csv').option('header',True).option('inferSchema',True).load(file_path)
df_listenings.show()

+-----------+-------------+--------------------+---------------+--------------------+
|    user_id|         date|               track|         artist|               album|
+-----------+-------------+--------------------+---------------+--------------------+
|000Silenced|1299680100000|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|1299679920000|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|1299679440000|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|1299679200000|            Acapella|          Kelis|            Acapella|
|000Silenced|1299675660000|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|1297511400000|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|1294498440000|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|1292438340000|               ObZen|      Meshuggah|               ObZen|
|000Silenced|1292437740000|   Yama's Messengers|      

## Cleaning tables 

In [None]:
df_listenings = df_listenings.drop('date')
df_listenings.show()

+-----------+--------------------+---------------+--------------------+
|    user_id|               track|         artist|               album|
+-----------+--------------------+---------------+--------------------+
|000Silenced|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|            Acapella|          Kelis|            Acapella|
|000Silenced|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|               ObZen|      Meshuggah|               ObZen|
|000Silenced|   Yama's Messengers|         Gojira|The Way of All Flesh|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For No...|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For

In [None]:
df_listenings = df_listenings.na.drop()
df_listenings.show()

+-----------+--------------------+---------------+--------------------+
|    user_id|               track|         artist|               album|
+-----------+--------------------+---------------+--------------------+
|000Silenced|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|            Acapella|          Kelis|            Acapella|
|000Silenced|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|               ObZen|      Meshuggah|               ObZen|
|000Silenced|   Yama's Messengers|         Gojira|The Way of All Flesh|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For No...|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For

In [None]:
row_numbers = df_listenings.count()
column_numbers = len(df_listenings.columns)
print(row_numbers, column_numbers)

13758905 4


## Let's Perform some aggregation
to see how many times each user has listened to specific track


In [None]:
df_listenings_agg = df_listenings.select('user_id','track').groupby('user_id','track').agg(count('*').alias('count')).orderBy('user_id')
df_listenings_agg.show()

+-------+--------------------+-----+
|user_id|               track|count|
+-------+--------------------+-----+
| --Seph|Chelsea Hotel - L...|    1|
| --Seph|        Window Blues|    1|
| --Seph|          Paris 2004|    7|
| --Seph|Hungarian Rhapsod...|    1|
| --Seph|Vestido Estampado...|    1|
| --Seph|         The Embrace|    1|
| --Seph|       Phantom Pt II|    1|
| --Seph|       Life On Mars?|    1|
| --Seph|      Hour for magic|    2|
| --Seph|     The Way We Were|    1|
| --Seph| Air on the G String|    1|
| --Seph|Belina (Original ...|    1|
| --Seph|               Leloo|    1|
| --Seph|Hungarian Dance No 5|    1|
| --Seph|              Monday|    1|
| --Seph|  California Waiting|    1|
| --Seph|Airplanes [feat H...|    1|
| --Seph|   Summa for Strings|    1|
| --Seph|Virus (Luke Fair ...|    1|
| --Seph| White Winter Hymnal|    3|
+-------+--------------------+-----+
only showing top 20 rows



In [None]:
row_numbers = df_listenings_agg.count()
column_numbers = len(df_listenings_agg.columns)
print(row_numbers, column_numbers)

9930128 3


In [None]:
df_listenings_agg = df_listenings_agg.limit(20000)

## Let's convert the user id and track columns into unique integers




In [None]:
indexer = [StringIndexer(inputCol=col, outputCol=col+'_index').fit(df_listenings_agg) for col in list(set(df_listenings_agg.columns) - set(['count']))]

pipeline = Pipeline(stages=indexer)

data = pipeline.fit(df_listenings_agg).transform(df_listenings_agg)
data.show()

+-------+--------------------+-----+-------------+-----------+
|user_id|               track|count|user_id_index|track_index|
+-------+--------------------+-----+-------------+-----------+
| --Seph| White Winter Hymnal|    3|         69.0|       59.0|
| --Seph|Virus (Luke Fair ...|    1|         69.0|    15896.0|
| --Seph|Airplanes [feat H...|    1|         69.0|      519.0|
| --Seph|Belina (Original ...|    1|         69.0|     3278.0|
| --Seph|              Monday|    1|         69.0|      334.0|
| --Seph|Hungarian Dance No 5|    1|         69.0|     7558.0|
| --Seph|       Life On Mars?|    1|         69.0|     1161.0|
| --Seph|  California Waiting|    1|         69.0|      197.0|
| --Seph|       Phantom Pt II|    1|         69.0|     1377.0|
| --Seph|   Summa for Strings|    1|         69.0|    13739.0|
| --Seph|      Hour for magic|    2|         69.0|     7495.0|
| --Seph|Hungarian Rhapsod...|    1|         69.0|     7559.0|
| --Seph|     The Way We Were|    1|         69.0|    1

In [None]:
data = data.select('user_id_index','track_index','count').orderBy('user_id_index')
data.show()

+-------------+-----------+-----+
|user_id_index|track_index|count|
+-------------+-----------+-----+
|          0.0|    10943.0|    1|
|          0.0|    11628.0|    2|
|          0.0|     1349.0|    1|
|          0.0|      381.0|    1|
|          0.0|     8692.0|    1|
|          0.0|     6899.0|    1|
|          0.0|    14044.0|    1|
|          0.0|    15513.0|    1|
|          0.0|    11978.0|    2|
|          0.0|    15176.0|    1|
|          0.0|     8305.0|    1|
|          0.0|    13722.0|    1|
|          0.0|    10620.0|    1|
|          0.0|     4424.0|    1|
|          0.0|    16732.0|    1|
|          0.0|    10630.0|    1|
|          0.0|    12169.0|    1|
|          0.0|     4117.0|    1|
|          0.0|    10336.0|    1|
|          0.0|    16829.0|    1|
+-------------+-----------+-----+
only showing top 20 rows



## Train and Test data

In [None]:
(train, test) = data.randomSplit([0.5, 0.5])

## Let's Create our Model

In [None]:
USERID = 'user_id_index'
ITEMID = 'track_index'
COUNT = 'count'

als = ALS(maxIter=5, regParam=0.01, userCol=USERID, itemCol=ITEMID, ratingCol=COUNT)
model = als.fit(train)

predictions = model.transform(test)

## Generate top 10 Track recommendations for each user

In [None]:
recs = model.recommendForAllUsers(10)

In [None]:
recs.show()

+-------------+--------------------+
|user_id_index|     recommendations|
+-------------+--------------------+
|          148|[[14301, 10.61897...|
|           31|[[15430, 13.91627...|
|           85|[[1325, 5.786715]...|
|          137|[[4660, 7.3788295...|
|           65|[[13563, 7.874768...|
|           53|[[348, 9.359735],...|
|          133|[[14826, 17.99768...|
|           78|[[9500, 12.5825],...|
|          108|[[1325, 11.927787...|
|           34|[[2484, 9.65376],...|
|          101|[[121, 19.537437]...|
|          115|[[3525, 6.98412],...|
|          126|[[8391, 10.986389...|
|           81|[[309, 10.332466]...|
|           28|[[7849, 9.123482]...|
|           76|[[15430, 11.95738...|
|           26|[[9500, 9.509514]...|
|           27|[[1325, 8.3653755...|
|           44|[[9500, 9.395316]...|
|          103|[[121, 14.861119]...|
+-------------+--------------------+
only showing top 20 rows



In [None]:
recs.take(1)

[Row(user_id_index=148, recommendations=[Row(track_index=14301, rating=10.618975639343262), Row(track_index=182, rating=10.616599082946777), Row(track_index=9500, rating=9.797794342041016), Row(track_index=14379, rating=6.147827625274658), Row(track_index=235, rating=5.912038803100586), Row(track_index=12061, rating=5.912038803100586), Row(track_index=1325, rating=5.741284370422363), Row(track_index=12845, rating=5.2165045738220215), Row(track_index=4220, rating=5.2165045738220215), Row(track_index=8705, rating=5.2165045738220215)])]