# First steps with DataFrames

## Learning objectives

- Learn basic transformations and actions on PySpark DataFrames
- Learn to define a temporary view and execute SQL statements using the SparkSession

In [0]:
### BEGIN STRIP ###
import pyspark

spark = (pyspark.sql.SparkSession.builder \
         .master('local') \
         .appName('Introduction to PySpark') \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate())

sc = spark.sparkContext
### END STRIP ###

In [0]:
S3_RESOURCE = 's3'
SCHEME = 's3a'
# TODO: Ask your teacher for BUCKET_NAME and PREFIX
BUCKET_NAME = ''
PREFIX = ''
### BEGIN STRIP ###
BUCKET_NAME = 'nibble-datasets'
PREFIX = ''
### END STRIP ###
INPUT_FILENAME = 'youtube-playlog.csv'

In [0]:
# This is just a utility function
def get_s3_path(key, bucket_name=BUCKET_NAME, scheme=SCHEME):
    return f"{scheme}://{bucket_name}/{key}"

In [0]:
# That's the path for our file, hosted on S3
# we will learn about S3 tomorrow
filepath = get_s3_path(f'{PREFIX}/{INPUT_FILENAME}')
filepath

`filepath` is the location of our file, which is currently hosted on S3.  
You don't have to worry about this for now, just use it like a regular filepath

We will start by loading our file into a PySpark `DataFrame`.  
Check out the doc if required, in particular the many options that can be called when reading a file onto a `DataFrame`.

In [0]:
# TODO: Load the file hosted at `filepath` onto a PySpark DataFrame: user_logs
### BEGIN STRIP ###
user_logs = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/thibaudchevrier@gmail.com/youtube_playlog.csv")
### END STRIP ###

It's easier to see PySpark DataFrames abtraction over SQL rather than to think as them as equivalent to `pandas`.  If you're familiar with data manipulation in `pandas`, it will be tempting to shortcut your thinking into `pandas`, this is the worse you can do.
The goal of this notebook is to help you counter your intuition on this.

This is why, for every task in this notebook, we will first implement it using declarative SQL (using `spark.sql(...)`, you will then try to get the same result using PySpark DataFrames imperative programming style.

---

Before we get started, we will first start by running a few actions that have no equivalent in SQL: `.show()`, `.printSchema()` and `.describe()`.  
Remember, these are actions, that means they will **actually perform computations**.  
Unlike most actions, `.show()` and `.printSchema()` won't return a result, but just print out to the screen.

In [0]:
# TODO: show the first 10 rows of `user_logs`
### BEGIN STRIP ###
user_logs.show(10)
### END STRIP ###

In [0]:
# TODO: print out the schema of `user_logs`
### BEGIN STRIP ###
user_logs.printSchema()
### END STRIP ###

Another action, `.describe()`, this ones returns a value: descriptive statistics about the DataFrame, itself in a Spark DataFrame format.

In [0]:
# TODO: use `.describe()` on `user_logs`
#       and make sure you can actually see the results
### BEGIN STRIP ###
user_logs.describe()
### END STRIP ###

Before we can query as SQL, we need a `TempView`.

In [0]:
# TODO: Create a TempView of `user_logs`: user_logs_table
### BEGIN STRIP ###
user_logs.createOrReplaceTempView('youtube')
### END STRIP ###

## Task 1: count the number of records

`.count(...)` is an action not a transformation (and will perform computation), while using COUNT in a SQL statement will still return a DataFrame (you'll have to force the compute).

In [0]:
# TODO: count the number of records using SQL
### BEGIN STRIP ###
user_logs.select("user").distinct().count()
### END STRIP ###

In [0]:
# TODO: count the number of records using PySpark DataFrames transformations and actions
### BEGIN STRIP ###
user_logs.select("song").distinct().count()
### END STRIP ###

## Task 2: select the column `user`

In [0]:
# TODO: Select the column 'user' using SQL
### BEGIN STRIP ###
user_logs.select("user").show(20)
### END STRIP ###

In [0]:
# TODO: Select the column 'user' using SQL
### BEGIN STRIP ###
spark.sql("SELECT user FROM youtube").show(20)
### END STRIP ###

## Task 3: select all distinct user

In [0]:
# TODO: select distinct user using SQL
### BEGIN STRIP ###
spark.sql("SELECT DISTINCT user FROM youtube").show(20)
### END STRIP ###

In [0]:
# TODO: select distinct user using PySpark DataFrame API
### BEGIN STRIP ###
user_logs.select("user").distinct().show(20)
### END STRIP ###

## Task 4: Select all distinct users and alias the column name to `distinct_user`

In [0]:
# TODO: select distinct user using SQL
#       and alias the name of the new column to `distinct_user`
### BEGIN STRIP ###
spark.sql("SELECT DISTINCT user AS distinct_user FROM youtube").show(20)
### END STRIP ###

In [0]:
# TODO: select distinct user using SQL
#       and alias the name of the new column to `distinct_user`
### BEGIN STRIP ###
user_logs.select("user").alias("distinct_user").distinct().show(20)
### END STRIP ###

## Task 5: count the number of distinct user

In [0]:
# TODO: Count the number of distinct user using SQL
#       Alias the resulting column to `total_distinct_user`
### BEGIN STRIP ###
spark.sql("SELECT COUNT(DISTINCT user) AS total_distinct_user FROM youtube").show()
### END STRIP ###

In [0]:
# TODO: Count the number of distinct user using PySpark DataFrame API
### BEGIN STRIP ###
user_logs.select("user").distinct().alias("total_distinct_user").count()
### END STRIP ###

## Task 6: count the number of distinct songs

In [0]:
# TODO: Count the number of distinct songs using SQL
#       Alias the resulting column to `total_distinct_song`
### BEGIN STRIP ###
spark.sql("SELECT COUNT(DISTINCT song) AS total_distinct_song FROM youtube").show()
### END STRIP ###

In [0]:
# TODO: Count the number of distinct songs using SQL
### BEGIN STRIP ###
user_logs.select("song").distinct().alias("total_distinct_song").count()
### END STRIP ###