# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [74]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

In [75]:
spark = SparkSession\
        .builder\
        .appName("Spark Quiz 8")\
        .getOrCreate()

In [76]:
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

In [8]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [78]:
user_log.createOrReplaceTempView("user_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [14]:
# TODO: write your code to answer question 1
spark.sql('''
         SELECT DISTINCT page
         FROM user_log_table
         WHERE userId == ""
          ''').show()

+-----+
| page|
+-----+
| Home|
|About|
|Login|
| Help|
+-----+



In [15]:
spark.sql('''
         SELECT DISTINCT page
         FROM user_log_table
          ''').show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|            Home|
|       Downgrade|
|          Logout|
|   Save Settings|
|           About|
|        Settings|
|           Login|
|        NextSong|
|            Help|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [19]:
# TODO: write your code to answer question 3
spark.sql('''
            SELECT count(distinct userId) as count
            from user_log_table
            where gender == 'F'
            ''').show()

+-----+
|count|
+-----+
|  462|
+-----+



# Question 4

How many songs were played from the most played artist?

In [37]:
# TODO: write your code to answer question 4
spark.sql('''
          select artist, count(artist) as count from user_log_table
          group by artist
          order by count desc limit 10
          ''').show()

+--------------------+-----+
|              artist|count|
+--------------------+-----+
|            Coldplay|   83|
|       Kings Of Leon|   69|
|Florence + The Ma...|   52|
|            BjÃÂ¶rk|   46|
|       Dwight Yoakam|   45|
|       Justin Bieber|   43|
|      The Black Keys|   40|
|         OneRepublic|   37|
|        Jack Johnson|   36|
|                Muse|   36|
+--------------------+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [82]:
# TODO: write your code to answer question 5

ishomecase = spark.sql('''
            select userId, page, ts,
            case
                when page == 'Home' then 1
                else 0
            end as homevisit
            from user_log_table
            where page in ('Home', 'NextSong')
            ''')

ishomecase.createOrReplaceTempView("ishomeview")

In [86]:
cum_sum = spark.sql('''
                         select *, sum(homevisit)
                         over(partition by userId order by ts desc) as period
                         from ishomeview 
                         ''').show()

cum_sum.createOrReplaceTempView("period_table")

+------+--------+-------------+---------+------+
|userId|    page|           ts|homevisit|period|
+------+--------+-------------+---------+------+
|  1436|NextSong|1513783259284|        0|     0|
|  1436|NextSong|1513782858284|        0|     0|
|  2088|    Home|1513805972284|        1|     1|
|  2088|NextSong|1513805859284|        0|     1|
|  2088|NextSong|1513805494284|        0|     1|
|  2088|NextSong|1513805065284|        0|     1|
|  2088|NextSong|1513804786284|        0|     1|
|  2088|NextSong|1513804555284|        0|     1|
|  2088|NextSong|1513804196284|        0|     1|
|  2088|NextSong|1513803967284|        0|     1|
|  2088|NextSong|1513803820284|        0|     1|
|  2088|NextSong|1513803651284|        0|     1|
|  2088|NextSong|1513803413284|        0|     1|
|  2088|NextSong|1513803254284|        0|     1|
|  2088|NextSong|1513803057284|        0|     1|
|  2088|NextSong|1513802824284|        0|     1|
|  2162|NextSong|1513781246284|        0|     0|
|  2162|NextSong|151

AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'

In [100]:
spark.sql('''
    select avg(count_results) from (
          SELECT COUNT(*) AS count_results FROM period_table
          GROUP BY userID, page, period HAVING page = 'NextSong'
          )
          ''').show()

+------------------+
|avg(count_results)|
+------------------+
| 6.898347107438017|
+------------------+

