# Homework

In the lectures, we discussed the Jaccard measure and how to calculate it efficiently on MapReduce.

You are invited to calculate the Jaccard measure on Spark to find similar performers in the entire dataset and answer the following questions:
1. **How many performers remain in consideration after applying all filters from the job description?**
2. **For how many pairs of performers did you manage to calculate non-zero similarity according to Jaccard? Here, all possible pairs (a, b) and (b, a) are taken into account, as well as (a, a), to check the correctness.**
3. **Find the 5 most similar artists to "Maroon 5" by Jaccard's calculated measure. As a result, write down the names of 5 artists other than "Maroon 5".**

Несколько напутственных слов:
- Use the data loaded in the <a href="#Loading-data">Loading-data</a> section.
- Users who listened to $N$ artists will contribute to the similarity of $N^2$ pairs of artists. Therefore, rare very active users will greatly slow down our algorithm. For such users, in practice, take a subset of plays, for example, 1000. We will do it easier and will only consider plays where $plays > 2$, thus leaving only the most confident user preferences.
- To make the similarities more confident, we will consider them only for those performers who were strictly listened to by more than 50 people (taking into account the previous filter by auditions).
- To debug the algorithm on a smaller amount of data, you can use the transformation <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sample">events.sample(False , 0.01)</a> so as not to wait long for debug runs.
- We can assume that data about performers (for example, their popularity) will fit in the memory of each machine. There just aren't that many performers in the world that won't fit.
- If a step takes a very long time, you can increase the degree of parallelism, for example,
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey">groupByKey(numPartitions=100)</a> to see more granular progress execution.
- Sometimes it makes sense to save the calculated result in HDFS, so as not to recalculate it again every time it is needed.
- When working with big data, patience is required, the author's solution works for about 10 minutes.
- This problem can also be solved in Spark SQL, if you like it better.

Save the solution to the `result.json` file. 
File content example:
```json
{
    "q1": 123,
    "q2": 456,
    "q3": [
        "artistName1",
        "artistName2",
        "artistName3",
        "artistName4",
        "artistName5"
    ]
}
```

In [1]:
! hadoop fs -copyFromLocal yandex_music /

copyFromLocal: `/yandex_music/events.csv': File exists
copyFromLocal: `/yandex_music/artists.jsonl': File exists
copyFromLocal: `/yandex_music/README.txt': File exists
copyFromLocal: `/yandex_music/untitled.txt': File exists
copyFromLocal: `/yandex_music/.ipynb_checkpoints/artists-checkpoint.jsonl': File exists
copyFromLocal: `/yandex_music/.ipynb_checkpoints/README-checkpoint.txt': File exists
copyFromLocal: `/yandex_music/.ipynb_checkpoints/untitled-checkpoint.txt': File exists


In [2]:
! hadoop fs -ls -h /yandex_music

Found 5 items
drwxr-xr-x   - jovyan supergroup          0 2023-04-09 16:02 /yandex_music/.ipynb_checkpoints
-rw-r--r--   1 jovyan supergroup        254 2023-04-08 12:14 /yandex_music/README.txt
-rw-r--r--   1 jovyan supergroup      3.7 M 2023-04-08 12:14 /yandex_music/artists.jsonl
-rw-r--r--   1 jovyan supergroup     47.6 M 2023-04-08 12:14 /yandex_music/events.csv
-rw-r--r--   1 jovyan supergroup          0 2023-04-09 16:02 /yandex_music/untitled.txt


In [3]:
import findspark
findspark.init()

In [4]:
import pyspark
sc = pyspark.SparkContext(appName='week-5')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-04-21 13:40:07,950 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [5]:
from pyspark.sql import SparkSession, Row
se = SparkSession(sc)

In [6]:
artists = se.read.json("hdfs:///yandex_music/artists.jsonl")
artists.registerTempTable("artists")

                                                                                

In [7]:
events = se.read.csv("hdfs:///yandex_music/events.csv", header=True, 
                     schema='userId bigint, artistId bigint, plays INT, skips INT')
events.registerTempTable("events")

In [32]:
se.sql("""select distinct userId, artistId from events where plays > 2""").registerTempTable("cond1_table")

In [34]:
se.sql("""select artistId, count(*) as popularity from cond1_table group by artistId""")



+------+--------+
|userId|artistId|
+------+--------+
|     2|   66150|
|     3|   40096|
|     3|   60021|
|     3|    8452|
|     4|   68294|
+------+--------+
only showing top 5 rows



                                                                                

In [35]:
se.sql("""select * from (select artistId, count(*) as popularity from cond1_table group by artistId) where popularity>50""").registerTempTable("cond2_table")

In [37]:
se.sql("""select userId, events.artistId, plays, popularity from cond2_table as s join events on s.artistId = events.artistId where plays > 2 order by userId, artistId""").registerTempTable("cond3_table")

In [38]:
se.sql("""select userId, artistId from cond3_table group by userId, artistId order by userId, artistId""").registerTempTable("table")

AnalysisException: Table or view not found: temp1; line 9 pos 21;
'Sort ['userId ASC NULLS FIRST, 'artistId ASC NULLS FIRST], true
+- 'Aggregate ['userId, 'artistId], ['userId, 'artistId]
   +- 'SubqueryAlias __auto_generated_subquery_name
      +- 'Sort ['userId ASC NULLS FIRST, 'artistId ASC NULLS FIRST], true
         +- 'Project ['userId, 'events.artistId, 'plays, 'popularity]
            +- 'Filter ('plays > 2)
               +- 'Join Inner, ('s.artistId = 'events.artistId)
                  :- 'SubqueryAlias s
                  :  +- 'Project [*]
                  :     +- 'Filter ('popularity > 50)
                  :        +- 'SubqueryAlias __auto_generated_subquery_name
                  :           +- 'Aggregate ['artistId], ['artistId, count(1) AS popularity#148L]
                  :              +- 'UnresolvedRelation [temp1], [], false
                  +- SubqueryAlias events
                     +- View (`events`, [userId#11L,artistId#12L,plays#13,skips#14])
                        +- Relation [userId#11L,artistId#12L,plays#13,skips#14] csv


In [41]:
table.count()

498589

In [40]:
from pyspark.sql import functions as F
df = table.groupBy('userId').agg(F.collect_list("artistId").alias('pairs'))

In [12]:
def f(list_):
    lis = []
    for i in list_:
        for j in list_:
#             if j>=i:
                lis.append('{}_{}'.format(i, j))
    return lis
f([1, 2, 3])

['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3']

In [43]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

udf_f = udf(f, ArrayType(StringType()))
pairs = df.withColumn("pairs", udf_f(col("pairs"))).select('pairs')
big_col = pairs.select(explode('pairs'))
cnt_col = big_col.groupBy('col').count().sort(desc('count'))
split = cnt_col.withColumn('A', split(cnt_col['col'], '_').getItem(0)) \
       .withColumn('B', split(cnt_col['col'], '_').getItem(1))

In [44]:
listenings = table.groupby('artistId').count().sort(desc('count')).toPandas().set_index('artistId')
def artist_to_listenings(id_):
    return int(listenings.loc[int(id_)].values[0])

AtoL = udf(artist_to_listenings, IntegerType())

                                                                                

In [45]:

result = split.withColumn("jaccard", (col("count")/(AtoL(col("A"))+AtoL(col("B"))-col("count")))).drop('col')


In [46]:
result.select('jaccard').filter('jaccard is not null').count()

2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_129 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_131 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_2 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_19 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_111 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_93 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_38 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_80 !
2023-04-22 08:20:50,713 WARN storage.BlockManagerMasterEndpoint: No more replicas available for rdd_36_75 !
2023-04-22 08:20:50,713 WA

Py4JError: An error occurred while calling o213.count

In [54]:
import json

res = {"q1": 2889, "q2": 6838579}
with open("week5.json", "w") as f:
    json.dump(res, f)
    

In [58]:
! curl -F file=@week5.json "51.250.54.133:80/MDS-LSML1/kewlsid96/w6/1"

<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
