# Jaccard similarity 

We discussed Jaccard similarity during the lecture and approach to calculate it on MapReduce

1. How many pairs have <>0 Jaccard similarity (we calc (a,b), (b,a) and (a,a) pairs)?
2. Find top 5 artists most similar to "Maroon 5" by Jaccard similarity.

In [1]:
import findspark
findspark.init()

import sys

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext(appName="Jaccard")
se = SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-04-08 13:06:00,688 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [6]:
! aws s3 cp s3://ydatazian/yandex_music yandex_music --recursive
! hadoop fs -copyFromLocal yandex_music /
! hadoop fs -ls -h /yandex_music


Provided region_name 'ru-central1-' doesn't match a supported format.
Found 4 items
drwxr-xr-x   - jovyan supergroup          0 2023-04-08 12:14 /yandex_music/.ipynb_checkpoints
-rw-r--r--   1 jovyan supergroup        254 2023-04-08 12:14 /yandex_music/README.txt
-rw-r--r--   1 jovyan supergroup      3.7 M 2023-04-08 12:14 /yandex_music/artists.jsonl
-rw-r--r--   1 jovyan supergroup     47.6 M 2023-04-08 12:14 /yandex_music/events.csv


In [7]:
artists = se.read.json("hdfs:///yandex_music/artists.jsonl")
artists.registerTempTable("artists")
artists.limit(5).toPandas()

                                                                                

Unnamed: 0,artistId,artistName
0,0,Mack Gordon
1,1,Kenny Dorham
2,2,Max Roach
3,3,Francis Rossi
4,4,Status Quo


In [8]:
events = se.read.csv("hdfs:///yandex_music/events.csv", header=True, 
                     schema='userId bigint, artistId bigint, plays INT, skips INT')
events.registerTempTable("events")
events.limit(5).toPandas()

Unnamed: 0,userId,artistId,plays,skips
0,0,335,1,0
1,0,708,1,0
2,0,710,2,1
3,0,815,1,1
4,0,880,1,1


In [2]:
import pandas as pd

events = pd.read_csv("filtered_events.csv")

In [6]:
events = se.createDataFrame(events) 

In [7]:
events.show(20)

[Stage 0:>                                                          (0 + 1) / 1]

+------+--------+-----+-----+
|userId|artistId|plays|skips|
+------+--------+-----+-----+
|     0|    2130|    4|   10|
|     0|    2267|    5|    3|
|     0|    2810|    5|    3|
|     0|    3568|    5|    9|
|     0|    3629|    9|    8|
|     0|   14803|   17|   15|
|     0|   17830|    4|    3|
|     0|   19720|    7|    1|
|     0|   20003|    4|    0|
|     0|   21142|    3|    0|
|     0|   21465|    5|    2|
|     0|   14988|    5|    3|
|     0|   27803|    7|    5|
|     0|   28052|    4|    0|
|     0|   36687|   12|    8|
|     0|   59783|    4|    8|
|     0|   64609|    4|    3|
|     0|   13591|    8|    4|
|     0|   14201|    3|    1|
|     0|   17074|    4|    2|
+------+--------+-----+-----+
only showing top 20 rows



                                                                                

In [8]:
%%time
count_by_artist = (
    events.rdd
    .filter(lambda x: x.plays > 2)
    .map(lambda x: (x.artistId, 1))
    .reduceByKey(lambda a, b: a + b)
    .collect()
)

count_by_artist = {a: c for a, c in count_by_artist}



CPU times: user 30.3 ms, sys: 15.3 ms, total: 45.6 ms
Wall time: 8.74 s


                                                                                

In [9]:
print(len(count_by_artist))
print(sum(1 if v > 50 else 0 for v in count_by_artist.values()))

2889
2889


In [10]:
%%time
! hadoop fs -rm -r hdfs:///jaccard.pickle

def generate_pairs(x):
    return [((a1, a2), 1) for a1 in x for a2 in x]

jaccard_by_pair = (
    events.rdd
    .filter(lambda x: x.plays > 2 and count_by_artist[x.artistId] > 50)
    .map(lambda x: (x.userId, x.artistId))
    .groupByKey(numPartitions=100)
    .flatMap(lambda x: generate_pairs(list(x[1])))
    .reduceByKey(lambda a, b: a + b)
    .map(lambda x: (x[0], float(x[1]) / (count_by_artist[x[0][0]] + count_by_artist[x[0][1]] - x[1])))
)

jaccard_by_pair.saveAsPickleFile("hdfs:///jaccard.pickle")

Deleted hdfs:///jaccard.pickle




CPU times: user 248 ms, sys: 59 ms, total: 307 ms
Wall time: 4min 43s


                                                                                

In [None]:
jaccard_by_pair = sc.pickleFile("hdfs:///jaccard.pickle")
print(jaccard_by_pair.count())
jaccard_by_pair.take(2)

6838579


[((709, 70609), 0.13356973995271867), ((876, 61198), 0.044416243654822336)]

In [None]:
artists.filter(artists.artistName == "Maroon 5").show()

+--------+----------+
|artistId|artistName|
+--------+----------+
|   14803|  Maroon 5|
+--------+----------+



In [None]:
artist_to_name = {}
for row in artists.collect():
    artist_to_name[row.artistId] = row.artistName

In [None]:
# most similar by Jaccard for Maroon 5
similar = (
    jaccard_by_pair
    .filter(lambda x: x[0][0] == 14803)
    .collect()
)

In [None]:
for (artist, other), j in sorted(similar, key=lambda x: -x[1])[:10]:
    print(artist_to_name[other], j)

Maroon 5 1.0
OneRepublic 0.3319755600814664
Sia 0.31266017426960535
David Guetta 0.29184782608695653
Bruno Mars 0.2867448151487827
Calvin Harris 0.2858903265557609
Imagine Dragons 0.28221597751906863
Ed Sheeran 0.2798199549887472
Coldplay 0.2794561933534743
Sam Smith 0.27321981424148606


In [3]:
sc.stop()