<a href="https://colab.research.google.com/github/redjules/-PyTorch-Project-to-Build-a-LSTM-Text-Classification-Model/blob/main/Music_recommeder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install pyspark

In [56]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [57]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Importing the modules

In [58]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc, col, max
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder


Creating the spark session

In [59]:
spark = SparkSession.builder.appName("lastfm").getOrCreate()

## Loading the dataset

In [60]:
file_path = '/content/drive/MyDrive/data analysis/listenings.csv'
df_listenings = spark.read.format('csv').option('header',True).option('inferSchema',True).load(file_path)
df_listenings.show()


+-----------+-------------+--------------------+---------------+--------------------+
|    user_id|         date|               track|         artist|               album|
+-----------+-------------+--------------------+---------------+--------------------+
|000Silenced|1299680100000|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|1299679920000|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|1299679440000|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|1299679200000|            Acapella|          Kelis|            Acapella|
|000Silenced|1299675660000|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|1297511400000|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|1294498440000|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|1292438340000|               ObZen|      Meshuggah|               ObZen|
|000Silenced|1292437740000|   Yama's Messengers|      

## Cleaning tables

In [61]:
df_listenings = df_listenings.drop('date')
df_listenings.show()

+-----------+--------------------+---------------+--------------------+
|    user_id|               track|         artist|               album|
+-----------+--------------------+---------------+--------------------+
|000Silenced|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|            Acapella|          Kelis|            Acapella|
|000Silenced|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|               ObZen|      Meshuggah|               ObZen|
|000Silenced|   Yama's Messengers|         Gojira|The Way of All Flesh|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For No...|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For

In [62]:
df_listenings = df_listenings.na.drop()
df_listenings.show()

+-----------+--------------------+---------------+--------------------+
|    user_id|               track|         artist|               album|
+-----------+--------------------+---------------+--------------------+
|000Silenced|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|            Acapella|          Kelis|            Acapella|
|000Silenced|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|               ObZen|      Meshuggah|               ObZen|
|000Silenced|   Yama's Messengers|         Gojira|The Way of All Flesh|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For No...|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For

In [63]:
row_numbers = df_listenings.count()
column_numbers = len(df_listenings.columns)
print(row_numbers, column_numbers)

13758905 4


## Let's Perform some aggregation

to see how many times each user has listened to specific track



In [64]:
df_listenings_agg = df_listenings.select('user_id','track').groupby('user_id','track').agg(count('*')).orderBy('user_id')

df_listenings_agg.show()

+-------+--------------------+--------+
|user_id|               track|count(1)|
+-------+--------------------+--------+
| --Seph| White Winter Hymnal|       3|
| --Seph|         The Funeral|       1|
| --Seph|Hope There's Someone|       1|
| --Seph|         The Painter|       1|
| --Seph|          Je te veux|       1|
| --Seph|            War Pigs|       1|
| --Seph|                 F12|       1|
| --Seph|                Team|       1|
| --Seph|          Nightmares|       1|
| --Seph|               Radio|       1|
| --Seph|   All I Want Is You|       1|
| --Seph|    Little by Little|       2|
| --Seph|        After Nature|       1|
| --Seph|In the Hall of th...|       1|
| --Seph|   Hey There Delilah|       1|
| --Seph|   Let's Call It Off|       1|
| --Seph|               Leloo|       1|
| --Seph|             Pack Up|       1|
| --Seph|           Introitus|       1|
| --Seph|        The Leanover|       1|
+-------+--------------------+--------+
only showing top 20 rows



In [65]:
row_numbers = df_listenings_agg.count()
column_numbers = len(df_listenings_agg.columns)
print(row_numbers,column_numbers)

9930128 3


In [66]:
df_listenings_agg = df_listenings_agg.limit(20000)

## Let's convert ohe user id and track columns into unique integers

In [67]:
indexer = [StringIndexer(inputCol=col, outputCol=col+'_index').fit(df_listenings_agg) for col in list(set(df_listenings_agg.columns) - set(['count']))]
pipeline = Pipeline(stages=indexer)

data = pipeline.fit(df_listenings_agg).transform(df_listenings_agg)
data.show()


+-------+--------------------+--------+-------------+-----------+--------------+
|user_id|               track|count(1)|user_id_index|track_index|count(1)_index|
+-------+--------------------+--------+-------------+-----------+--------------+
| --Seph|          Nightmares|       1|         69.0|    10600.0|           0.0|
| --Seph|Virus (Luke Fair ...|       1|         69.0|    15893.0|           0.0|
| --Seph|Airplanes [feat H...|       1|         69.0|      521.0|           0.0|
| --Seph|Belina (Original ...|       1|         69.0|     3280.0|           0.0|
| --Seph|              Monday|       1|         69.0|      334.0|           0.0|
| --Seph|Hungarian Dance No 5|       1|         69.0|     7555.0|           0.0|
| --Seph|       Life On Mars?|       1|         69.0|     1164.0|           0.0|
| --Seph|  California Waiting|       1|         69.0|      195.0|           0.0|
| --Seph|       Phantom Pt II|       1|         69.0|     1378.0|           0.0|
| --Seph|   Summa for String

In [68]:
data = data.select('user_id_index','track_index','count(1)').orderBy('user_id_index')

## Train and Test Data

In [69]:
(training, test) = data.randomSplit([0.5,0.5])

## Let's Create our Model

In [70]:
USERID = 'user_id_index'
TRACK = 'track_index'
COUNT = 'count(1)'

als = ALS(maxIter=5, regParam=0.01, userCol=USERID, itemCol=TRACK, ratingCol=COUNT)
model = als.fit(training)

predictions = model.transform(test)

Py4JJavaError: ignored

## Generate top 10 Track recommendations for each user

In [None]:
recs = model.recommendForAllUsers(10)

In [None]:
recs.show()

In [None]:
recs.take(1)