# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("Wrangling Data")
    .getOrCreate()
)

In [3]:
sdf = spark.read.json("data/sparkify_log_small.json")
sdf.createOrReplaceTempView("logs")

In [6]:
query = lambda x: spark.sql(x).toPandas()

# Question 1

Which page did user id ""(empty string) NOT visit?

In [8]:
query("SELECT DISTINCT page FROM logs WHERE userid = ''")

Unnamed: 0,page
0,Home
1,About
2,Login
3,Help


# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [16]:
# TODO: write your code to answer question 3
query("""
    SELECT DISTINCT gender, count(*)
    FROM logs
    GROUP BY gender
""")

Unnamed: 0,gender,count(1)
0,F,3820
1,,336
2,M,5844


# Question 4

How many songs were played from the most played artist?

In [17]:
# TODO: write your code to answer question 4
query("""
    SELECT DISTINCT artist, count(*) AS cnt
    FROM logs
    WHERE artist IS NOT NULL
    GROUP BY artist
    ORDER BY cnt DESC
    LIMIT 1
""")

Unnamed: 0,artist,cnt
0,Coldplay,83


# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [20]:
query("SELECT * FROM logs LIMIT 1")

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,Showaddywaddy,Logged In,Kenneth,M,112,Matthews,232.93342,paid,"Charlotte-Concord-Gastonia, NC-SC",PUT,NextSong,1509380319284,5132,Christmas Tears Will Fall,200,1513720872284,"""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537....",1046


In [47]:
query("""
    SELECT AVG(nsongs)
    FROM (
        SELECT period, count(*) AS nsongs
        FROM (
            SELECT 
                page,
                userId,
                SUM(
                    IF(page = 'Home', 1, 0)
                ) OVER (
                    PARTITION BY userId ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                ) AS period
            FROM logs
            WHERE page IN ('Home', 'NextSong')
        )
        WHERE page = 'NextSong'
        GROUP BY period, userId
    )
""")

Unnamed: 0,avg(nsongs)
0,6.898347
