# First steps with DataFrames

## Learning objectives

- Learn basic transformations and actions on PySpark DataFrames
- Learn to define a temporary view and execute SQL statements using the SparkSession

In [None]:
spark

In [None]:
# Load the file hosted at `filepath` onto a PySpark DataFrame: user_logs
filepath = "s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv"

user_logs = (spark.read.format('csv')\
             .option('header', 'true')\
             .option('inferSchema', 'true')\
             .load(filepath))

It's easier to see PySpark DataFrames abstractions as SQL tables rather than to think of them as equivalent to `pandas`.  If you are familiar with data manipulation in `pandas`, it will be tempting to shortcut your thinking into `pandas`, this is the worse you can do.

The goal of this notebook is to help you counter your intuition on this.

This is why, for every task in this notebook, we will first implement them using declarative SQL (using `spark.sql(...)`), you will then try to get the same result using PySpark DataFrames imperative programming style.

Before we get started, we will first start by running a few actions that have no equivalent in SQL: `.show()`, `.printSchema()` and `.describe()`.

Remember, these are actions, that means they will **actually perform computations**.

Unlike most actions, `.show()` and `.printSchema()` won't return a result, but just print out to the screen.

1. Show the first 10 rows of `user_logs`:

2. Print out the schema of `user_logs`

Another action, `.describe()`, this one returns a value: descriptive statistics about the DataFrame in a Spark DataFrame format.

3. Use `.describe()` on `user_logs` and put it inside `user_describe`:

4. Show the results with `.toPandas()`:

Unnamed: 0,summary,timestamp,user,song
0,count,25739537.0,25739537.0,25739537
1,mean,1442700656.1045842,12697.352275450798,2.532571778181818E8
2,stddev,34432848.72371195,13094.065905828476,8.334645614940468E8
3,min,-139955897.0,0.0,---AtpxbkaE
4,max,1554321113.0,45903.0,zzzcFgRMY6c


5. Show the results with `display()`:

summary,timestamp,user,song
count,25739537.0,25739537.0,25739537
mean,1442700656.1045842,12697.352275450798,2.532571778181818E8
stddev,34432848.72371195,13094.065905828476,8.334645614940468E8
min,-139955897.0,0.0,---AtpxbkaE
max,1554321113.0,45903.0,zzzcFgRMY6c


6. Show the results using `.show()`:

7. Before we can query using SQL, we need a `TempView`. Create a TempView of `user_logs` in `user_logs_table`.

## Task 1: count the number of records

`.count(...)` is an action not a transformation (and will perform computation), while using COUNT in a SQL statement will still return a DataFrame (you'll have to force the compute).

1. count the number of records using SQL

count(1)
25739537


2. count the number of records using PySpark DataFrames transformations and actions

## Task 2: select the column `user`

1. Select the column 'user' using SQL

2. Select the column 'user' using PySpark SQL

## Task 3: select all distinct user

1. Select distinct user using SQL

2. Select distinct user using PySpark DataFrame API

## Task 4: Select all distinct users and alias the column name to `distinct_user`

1. Select distinct user using SQL and alias the name of the new column to `distinct_user`

distinct_user
148
463
471
496
833
243
392
540
623
737


2. Select distinct user using SQL and alias the name of the new column to `distinct_user`

## Task 5: count the number of distinct user

1. Count the number of distinct user using SQL. Alias the resulting column to `total_distinct_user`

2. Count the number of distinct user using PySpark DataFrame API

## Task 6: count the number of distinct songs

1. Count the number of distinct songs using SQL. Alias the resulting column to `total_distinct_song`

2. Count the number of distinct songs using SQL