# Evaluation Assignment

Data: outbrain click prediction

Tasks:
Using Spark RDD, DataFrame API and Python, calculate:

**1**. Top 10 most visited document_ids in the page_views_sample log

**2**. How many users have at least 2 different traffic_sources in the page_views_sample log (note the value is not a count, it's an encoded enum)

**3***. Top 10 most visited topic_ids in page_views_sample log (use documents_topics table)

The submission format is the result.json json file with top_10_documents, users and top_10_topics keys.
For TOP-10 results, the answer must be written in the form of a sheet ordered from TOP-1 to TOP-10 with an id.

result.json example:

    {
        "top_10_documents": [
            111,
            222,
            333,
            ...,
            1010
        ],
        "users": 10000,
        "top_10_topics": [
            11,
            22,
            33,
            ...,
            101
        ]
    }
    
# Preparation

In [1]:
import tqdm.notebook as tqdm
import json


!/home/jovyan/start-hadoop.sh;

jovyan
 * Starting OpenBSD Secure Shell server sshd
start-stop-daemon: unable to set gid to 0 (Operation not permitted)
   ...fail!
 * sshd is running
Starting namenodes on [localhost]
localhost: namenode is running as process 155.  Stop it first and ensure /tmp/hadoop-jovyan-namenode.pid file is empty before retry.
Starting datanodes
localhost: datanode is running as process 276.  Stop it first and ensure /tmp/hadoop-jovyan-datanode.pid file is empty before retry.
Starting secondary namenodes [0cddd7a95d84]
0cddd7a95d84: secondarynamenode is running as process 514.  Stop it first and ensure /tmp/hadoop-jovyan-secondarynamenode.pid file is empty before retry.
Starting resourcemanager
resourcemanager is running as process 749.  Stop it first and ensure /tmp/hadoop-jovyan-resourcemanager.pid file is empty before retry.
Starting nodemanagers
localhost: nodemanager is running as process 871.  Stop it first and ensure /tmp/hadoop-jovyan-nodemanager.pid file is empty before retry.
221938 org

In [73]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName='jupyter')

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
se = SparkSession(sc)

In [6]:
!hdfs dfs -df -h

Filesystem                Size   Used  Available  Use%
hdfs://localhost:9000  196.8 G  3.8 G    145.6 G    2%


In [10]:
!hdfs dfs -ls /.

Found 3 items
drwxrwx---   - root   supergroup          0 2023-04-19 14:53 /tmp
drwxr-xr-x   - jovyan supergroup          0 2023-04-19 14:53 /user
drwxr-xr-x   - jovyan supergroup          0 2023-04-19 22:00 /wiki


In [11]:
!mkdir -p ~/.kaggle

In [54]:
!cat kaggle.json > ~/.kaggle/kaggle.json

In [57]:
!chmod 600 ~/.kaggle/kaggle.json
!pip install -U urllib3 kaggle==1.5.3
!kaggle competitions download -c outbrain-click-prediction -f page_views_sample.csv.zip
!kaggle competitions download -c outbrain-click-prediction -f documents_topics.csv.zip

Collecting urllib3
  Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Downloading page_views_sample.csv.zip to /home/jovyan/work
100%|███████████████████████████████████████▊| 148M/149M [00:06<00:00, 27.9MB/s]
100%|████████████████████████████████████████| 149M/149M [00:06<00:00, 24.0MB/s]
Downloading documents_topics.csv.zip to /home/jovyan/work
 99%|███████████████████████████████████████▋| 120M/121M [00:05<00:00, 28.4MB/s]
100%|████████████████████████████████████████| 121M/121M [00:05<00:00, 22.7MB/s]


In [67]:
# !unzip '*.zip'
# !rm -rf *.zip

In [69]:
!hdfs dfs -put page_views_sample.csv
!hdfs dfs -put documents_topics.csv

put: `page_views_sample.csv': File exists
put: `documents_topics.csv': File exists


In [75]:
page_views_sample = se.read.csv("page_views_sample.csv", header=True)
documents_topics = se.read.csv("documents_topics.csv", header=True)

                                                                                

# Solution
## 1. Top 10 most visited document_ids in the page_views_sample log

In [76]:
first = page_views_sample \
            .groupBy("document_id") \
            .count() \
            .orderBy("count", ascending=False) \
            .limit(10) \
            .rdd.map(lambda row: int(row.document_id)) \
            .collect()

                                                                                

## 2. How many users have at least 2 different traffic_sources in the page_views_sample log (note the value is not a count, it's an encoded enum)

In [77]:
second = page_views_sample \
            .groupBy("uuid") \
            .agg(countDistinct("traffic_source").alias("distinct_traffic_sources")) \
            .where("distinct_traffic_sources >= 2") \
            .count()

                                                                                

## 3*. Top 10 most visited topic_ids in page_views_sample log (use documents_topics table)

In [78]:
third = page_views_sample \
            .join(documents_topics, on="document_id", how="inner") \
            .groupBy("topic_id") \
            .count() \
            .orderBy("count", ascending=False) \
            .limit(10) \
            .rdd.map(lambda row: int(row.topic_id)) \
            .collect()

                                                                                

# Finish

In [79]:
with open("result.json", "w") as f:
    json.dump(
        {
            "top_10_documents": first,
            "users": second,
            "top_10_topics": third,
        }, f
    )

!curl -F file=@result.json "51.250.54.133:80/MDS-LSML1/m_shishonkov/w4/1"

1.0
Well done!
