<a target="_blank" href="../cluster" style="font-size:20px">All Applications (YARN)</a>

# Homework

_This task is very similar to the one you have already done on Hadoop. It should be so, appreciate how much easier it is to solve on Spark._

We will use the logs of listening to music artists in the Yandex.Music service.

The `events.csv` file contains entries like `User,Artist,Number of plays,Number of skips`:
```csv
userId,artistId,plays,skips
0,335,1,0
0,708,1,0
0,710,2,1
0,815,1,1
```

You need to do the following:
1. **Leave in the data only those users for whom the sum of plays is strictly greater than 2000. How many such users?**
2. **In the data filtered at the first step, find the 5 most popular performers by the number of users (identifiers).**

Details:
1. Let's assume that a single user's playlist always fits in memory.

Save the solution to the `result.json` file. 
File content example:

```json
{
    "q1": 123,
    "q2": [
        4,
        5,
        6,
        7,
        8
    ]
}

In [1]:
# file content example
! head -n 5 yandex_music/events.csv

userId,artistId,plays,skips
0,335,1,0
0,708,1,0
0,710,2,1
0,815,1,1


In [2]:
# copy files to HDFS
! hadoop fs -copyFromLocal yandex_music /
! hadoop fs -ls -h /yandex_music

copyFromLocal: `/yandex_music/events.csv': File exists
copyFromLocal: `/yandex_music/artists.jsonl': File exists
copyFromLocal: `/yandex_music/README.txt': File exists
Found 3 items
-rw-r--r--   1 jovyan supergroup        254 2023-04-19 14:44 /yandex_music/README.txt
-rw-r--r--   1 jovyan supergroup      3.7 M 2023-04-19 14:44 /yandex_music/artists.jsonl
-rw-r--r--   1 jovyan supergroup     47.6 M 2023-04-19 14:44 /yandex_music/events.csv


In [3]:
# https://spark.apache.org/docs/latest/rdd-programming-guide.html
# http://spark.apache.org/docs/latest/sql-getting-started.html

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName='jupyter')

from pyspark.sql import SparkSession, Row
se = SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-04-19 17:22:10,041 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [4]:
# load csv as Spark DataFrame
events = se.read.csv("hdfs:///yandex_music/events.csv", header=True)  # в первой строке у нас заголовок
events.registerTempTable("events")
events.limit(5).toPandas()

                                                                                

Unnamed: 0,userId,artistId,plays,skips
0,0,335,1,0
1,0,708,1,0
2,0,710,2,1
3,0,815,1,1
4,0,880,1,1


In [5]:
# we can convert this DataFrame to RDD
events.rdd.take(5)

                                                                                

[Row(userId='0', artistId='335', plays='1', skips='0'),
 Row(userId='0', artistId='708', plays='1', skips='0'),
 Row(userId='0', artistId='710', plays='2', skips='1'),
 Row(userId='0', artistId='815', plays='1', skips='1'),
 Row(userId='0', artistId='880', plays='1', skips='1')]

In [6]:
#fetch the data types of each column
events.dtypes

[('userId', 'string'),
 ('artistId', 'string'),
 ('plays', 'string'),
 ('skips', 'string')]

In [7]:
from pyspark.sql.functions import col, sum
from pyspark.sql.types import IntegerType

events = events.withColumn("plays", col("plays").cast(IntegerType()))
events = events.withColumn("skips", col("skips").cast(IntegerType()))

In [8]:
events.dtypes

[('userId', 'string'),
 ('artistId', 'string'),
 ('plays', 'int'),
 ('skips', 'int')]

In [9]:
# group by userID and get the sum 
new_events = events.groupBy("userID").agg(sum("plays").alias("total"))

# filter userIDs which have count greater than 2000
user_id = new_events.filter(col("total") > 2000)
user_id.show()



+------+-----+
|userID|total|
+------+-----+
|   718| 2063|
|   711| 2439|
|   700| 2496|
|   886| 3294|
|  1236| 2622|
|  1041| 3470|
|  1093| 3525|
|  1192| 3288|
|   685| 2779|
|   836| 2266|
|  1099| 2073|
|   863| 2687|
|   991| 2064|
|  1100| 2744|
|   710| 3381|
|  1012| 3085|
|  1314| 3455|
|   652| 3087|
|   924| 3374|
|   878| 3570|
+------+-----+
only showing top 20 rows



                                                                                

In [10]:
new_joined_df = events.join(user_id, "userID")
new_joined_df.show()



+------+--------+-----+-----+-----+
|userId|artistId|plays|skips|total|
+------+--------+-----+-----+-----+
|   125|      50|    2|    0| 3176|
|   125|     335|    8|    0| 3176|
|   125|     425|    2|    0| 3176|
|   125|     854|    7|    1| 3176|
|   125|     868|    2|    0| 3176|
|   125|    1091|    1|    0| 3176|
|   125|    1123|    1|    0| 3176|
|   125|    1179|    1|    0| 3176|
|   125|    1202|    1|    0| 3176|
|   125|    1555|    2|    0| 3176|
|   125|    1897|    2|    3| 3176|
|   125|    1993|    8|    0| 3176|
|   125|    2130|    1|    0| 3176|
|   125|    2147|    1|    0| 3176|
|   125|    2162|    1|    0| 3176|
|   125|    2267|    1|    0| 3176|
|   125|    2388|    1|    0| 3176|
|   125|    2922|    3|    0| 3176|
|   125|    3091|    5|    0| 3176|
|   125|    3104|    1|    0| 3176|
+------+--------+-----+-----+-----+
only showing top 20 rows



                                                                                

In [11]:
unique_values = new_joined_df.select("userId").distinct()
q1 = unique_values.count()
q1

                                                                                

1705

In [12]:
from pyspark.sql.functions import desc

artist_count = new_joined_df.groupBy("artistId").count()
sorted_artist_count = artist_count.orderBy(desc("count"))
sorted_artist_count.show()



+--------+-----+
|artistId|count|
+--------+-----+
|   11368| 1421|
|    3629| 1274|
|     259| 1221|
|   44148| 1191|
|   23524| 1167|
|   59783| 1138|
|   21042| 1067|
|   21643| 1037|
|   20272| 1035|
|     645| 1032|
|    3568| 1026|
|   23595| 1023|
|   28412| 1011|
|   63958| 1007|
|   48965|  985|
|   64965|  983|
|   36545|  970|
|   48246|  961|
|   34243|  960|
|   45433|  958|
+--------+-----+
only showing top 20 rows



                                                                                

In [13]:
sorted_artist_count=sorted_artist_count.limit(5)
sorted_artist_count.show()



+--------+-----+
|artistId|count|
+--------+-----+
|   11368| 1421|
|    3629| 1274|
|     259| 1221|
|   44148| 1191|
|   23524| 1167|
+--------+-----+



                                                                                

In [14]:
artist_ids=[]
for val in sorted_artist_count.collect():
    artist_ids.append(val["artistId"])
print(artist_ids)



['11368', '3629', '259', '44148', '23524']


                                                                                

In [15]:
! echo '{ "q1": 1705, "q2": [ 11368, 3629, 259, 44148, 23524 ] }' > result.json

In [18]:
! curl -F file=@result.json "51.250.54.133:80/MDS-LSML1/mishafoniakov/w2/1"

1.0
Correct q1 answer! Correct q2 answer!


In [17]:
# stop Spark (and YARN application)
sc.stop()