# Graph based Music Recommender

In this assignments you will build a music recommender engine based on user’s playlists history. For Tasks 1-4 you will use dataframes that represent the weight of edges count the collaborative similarity between the vertices. Task 5-6 takes steps forward to fully implement the recommender system described in the lectures.

## Data description
There are two data sources for this assignment. They are DataFrames in parquet format.

**The first dataset captures the user’s playing history.**

*Location - /data/sample264*

Fields: *trackId, userId, timestamp, artistId*

- trackId - id of the track
- userId - id of the user
- artistId - id of the artist
- timestamp - timestamp of the moment the user starts listening to a track

**The second is the meta data for track or artist.**

*Location - /data/meta*

Fields: *type, Name, Artist, Id*

- Type could be “track” or “artist”
- Name is the title of the track, if the type == “track” and the name of the musician or group, if the type == “artist”.
- Artist states for the creator of the track in case the type == “track” and for the name of the musician or group in case the type == “artist”.
- Id - id of the item

**NB.** Each subsequent of these tasks is a continuation of the previous one. So, you may use the same ipython notebook for all the programming assignments in this week.

### Graph based Music Recommender. Task 1
Build the edges of the type “track-track”. To do it you will need to count the collaborative similarity between all the tracks: if a user has started listening to track B within 7 minutes after starting track A, then you should add 1 to the weight of the edge from vertex A to vertex B (initial weight is equal to 0).

Example:

`userId artistId trackId timestamp
7        12        1          1534574189
7        13        4          1534574289 
5        12        1          1534574389 
5        13        4          1534594189 
6        12        1          1534574489 
6        13        4          1534574689` 

The track 1 is similar to the track 4 with the weight 2 (before normalization): the user 7 and the user 6 listened these 2 tracks together in the 7 minutes long window:

- userId 7: 1534574289 - 1534574189 = 100 seconds = 1 min 40 seconds < 7 minutes
- userId 6: 1534574689 - 1534574489 = 200 seconds = 3 min 20 seconds < 7 minutes
Note that the track 4 is similar to the track 1 with the same weight 2.

**Tip:** consider joining the graph to itself with the UserId and remove pairs with the same tracks.For each track choose top 50 tracks ordered by weight similar to it and normalize weights of its edges (divide the weight of each edge on a sum of weights of all edges). Use rank() to choose top 40 tracks as is done in the demo.

Sort the resulting Data Frame in the descending order by the column norm_weight, and then in the ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

**Output example:**

`54719		767867
54719		767866
50787		327676`

In [145]:
#import os
#execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

In [137]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.enableHiveSupport().master("local").getOrCreate()

In [138]:
data = sparkSession.read.parquet("/data/sample264")
meta = sparkSession.read.parquet("/data/meta")

In [139]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, sum

def norm(df, key1, key2, field, n): 
    
    window = Window.partitionBy(key1).orderBy(col(field).desc())
    
    topsDF = df.withColumn("row_number", row_number().over(window)) \
        .filter(col("row_number") <= n) \
        .drop(col("row_number")) 

    tmpDF = topsDF.groupBy(col(key1)).agg(col(key1), sum(col(field)).alias("sum_" + field))

    normalizedDF = topsDF.join(tmpDF, key1, "inner") \
        .withColumn("norm_" + field, col(field) / col("sum_" + field)) \
        .cache()

    return normalizedDF

In [246]:
from pyspark.sql.functions import collect_list, col, abs, count, rank

data1 = data.select(col("userId"), col("trackId").alias("trackId1"), col("timestamp").alias("timestamp1"));
    
data2 = data.select(col("userId"), col("trackId").alias("trackId2"), col("timestamp").alias("timestamp2"));

#joining the graph to itself with the UserId
#...and remove pairs with the same tracks.
similarityDF = data1.join(data2, "userId", "inner")\
    .filter(abs(col("timestamp1")-col("timestamp2"))/60 <= 7)\
    .filter(col("trackId1") != col("trackId2"))\
    .groupBy(col('trackId1'), col('trackId2'))\
    .count().alias('count')\
    .cache();

In [217]:
#choose top 50 tracks ordered by weight similar to it and normalize weights of its edges
normalizedDF = norm(similarityDF, "trackId1", "trackId2", "count", 50);

In [218]:
window = Window.orderBy(col("norm_count").desc())

similarTrackList = normalizedDF.withColumn("position", rank().over(window)) \
    .orderBy(col("norm_count"), col("trackId1"), col("trackId2"))\
    .filter(col("position") < 50);

#Sort the resulting Data Frame in the descending order by the column norm_weight, 
#and then in the ascending order this time first by “id1”, then by “id2”. 
#Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.    
result = similarTrackList\
    .select(col("trackId1"), col("trackId2"))\
    .orderBy(col("trackId1"), col("trackId2"))\
    .take(40);

In [220]:
#for val in result:
#    print('%s %s' % val);

### Graph based Music Recommender. Task 2

Build the edges of the type “user-track”. Take the amount of times the track was listened by the user as the weight of the edge from the user’s vertex to the track’s vertex.

**Tip:** group the dataframe by columns userId and trackId and use function “count” of DF API.

For each user take top-1000 and normalize them.

Sort the resulting Data Frame in descending order by the column norm_weight, and then in ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

**The part of the result on the sample dataset:**

`...
195 946408
215 860111
235 897176
300 857973
321 915545
...`

In [241]:
#group the dataframe by columns userId and trackId and use function “count” of DF API.
userTrack = data.groupBy(col("userId"), col("trackId")).count();

In [242]:
#For each user take top-1000 and normalize them.
#Sort the resulting Data Frame in descending order by the column norm_weight, 
#and then in ascending order this time first by “id1”, then by “id2”.
userTrackNorm = (norm(userTrack, "userId", "trackId", "count", 1000) \
    .orderBy(col("norm_count").desc(), col("userId"), col("trackId"))\
    .limit(40)).cache();    

In [243]:
window = Window.orderBy(col("norm_count"));

#Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.
userTrackList = userTrackNorm.withColumn("position", rank().over(window))\
    .filter(col("position") < 50)\
    .select(col("userId").alias("id1"), col("trackId").alias("id2"))\
    .take(40);

In [244]:
#for val in userTrackList:
#    print ("%s %s" % val)

66 965774
116 867268
128 852564
131 880170
195 946408
215 860111
235 897176
300 857973
321 915545
328 943482
333 818202
346 864911
356 961308
428 943572
431 902497
445 831381
488 841340
542 815388
617 946395
649 901672
658 937522
662 881433
698 935934
708 952432
746 879259
747 879259
776 946408
784 806468
806 866581
811 948017
837 799685
901 871513
923 879322
934 940714
957 945183
989 878364
999 967768
1006 962774
1049 849484
1057 920458


### Graph based Music Recommender. Task 3
Build the edges of the type “user-artist”. Take the amount of times the user has listened to the artist’s tracks as the weight of the edge from the user’s vertex to the artist’s vertex.

**Tip:** group the dataframe by the columns userId and trackId and use the function “count” of DF API. For each user take top-100 artists and normalize weights.

Sort the resulting Data Frame in descending order by the column norm_weight, and then in ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

**The part of the result on the sample dataset:**

`...
131 983068
195 997265
215 991696
235 990642
288 1000564
...`

In [247]:
userArtist = data.groupBy(col("userId"), col("artistId")).count();

In [254]:
userArtistNorm = (norm(userArtist, "userId", "artistId", "count", 100)\
                  .orderBy(col("norm_count").desc(), col("userId"), col("artistId"))\
                  .limit(40)).cache();

In [255]:
window = Window.orderBy(col("norm_count"));

userArtistList = userArtistNorm.withColumn("position", rank().over(window))\
    .filter(col("position") < 40)\
    .select(col("userId").alias("id1"), col("artistId").alias("id2"))\
    .take(40);

In [256]:
for val in userArtistList:
    print ("%s %s" % val)

66 993426
116 974937
128 1003021
131 983068
195 997265
215 991696
235 990642
288 1000564
300 1003362
321 986172
328 967986
333 1000416
346 982037
356 974846
374 1003167
428 993161
431 969340
445 970387
488 970525
542 969751
612 987351
617 970240
649 973851
658 973232
662 975279
698 995788
708 968848
746 972032
747 972032
776 997265
784 969853
806 995126
811 996436
837 989262
901 988199
923 977066
934 990860
957 991171
989 975339
999 968823
