# Analyzing tabular data with pyspark.sql

> PySpark operates either on the whole __data frame__ objects (via methods such as `select()` and `groupby()`) or on __Column__ objects (for instance when using a function like `split()`). 
>
> - The data frame is __column-major__, so its API focuses on manipulating the columns to transform the data. 
> - Hence with data transformations, think about what operations to do and which columns will be impacted.

- RDDs on the other hand are _row-major_.  Hence you're thinking about items with attributes in which you apply functions.

In [1]:
# setup
import os
import numpy as np

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

## Data Source Info

> For this exercise, we’ll use some open data from the Government of Canada, more specifically the CRTC (Canadian Radio-television and Telecommunications Commission). Every broadcaster is mandated to provide a complete log of the programs, commercials and all, showcased to the Canadian public. 
>
> This gives us a lot of potential questions to answer, but we’ll select one specific one: __what are the channels with the most and least proportion of commercials?__

## Creating a data frame

`spark.createDataFrame`
- 1st param: data (list of lists, pandas dataframe, RDD)
- 2nd param: schema (ie. think column headers in SQL)
- Master node knows the structure of the dataframe, but actual data is on worker nodes (ie. cluster memory)

In [4]:
# Example creating a data frame with toy data
my_grocery_list = [
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["Cake", 1, 10.99],
]

df_grocery_list = spark.createDataFrame(my_grocery_list, ["Item", "Quantity", "Price"])

df_grocery_list.printSchema()

root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



## Reading a data frame

### Data frame structure

Composed of _row delimiter_ (e.g. newline `\n`) and _column delimiter_ (e.g. tabs `\t` for TSVs)

In [48]:
DIRECTORY = "./data/Ch04"
logs = spark.read.csv(
    os.path.join(DIRECTORY, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
    sep="|",  # default is ","
    quote="\"", # default is double quote.
    header=True, # set first row as column names
    inferSchema=True, # infer schema from column names default False
)

In [49]:
logs.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

--- 

##  Exercises

### 4.1
Take the following file, called sample.csv, and read it into a dataframe.

```
Item,Quantity,Price
$Banana, organic$,1,0.99
Pear,7,1.24
$Cake, chocolate$,1,14.50
```


In [17]:
sample =  spark.read.csv(
    os.path.join(DIRECTORY, "ch4_exercise.csv"),
    sep=",",
    header=True,
    quote="$",
    inferSchema=True
)

sample.show()

+---------------+--------+-----+
|           Item|Quantity|Price|
+---------------+--------+-----+
|Banana, organic|       1| 0.99|
|           Pear|       7| 1.24|
|Cake, chocolate|       1| 14.5|
+---------------+--------+-----+



### 4.2

Re-read the data in a `logs_raw` data frame, taking inspiration from the code in listing 4.3, this time without passing any optional parameters. Print the first 5 rows of data, as well as the schema. What are the differences in terms of data and schema between logs and logs_raw?

In [46]:
DIRECTORY = "./data/Ch04"
raw_logs = spark.read.csv(
    os.path.join(DIRECTORY, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
)
raw_logs.show(5, False) # False = show entire contents
raw_logs.printSchema()

# Result shows entire row concatenated into one column (_c0). Not what we want.

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------

---

## Exploring the shape of our data universe

### About Star Schema


Wiki:
> In computing, the __star schema__ is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables.

Star schemas are common in the relational database world because of __normalization__, a process used to avoid duplicating pieces of data and improve data integrity.

Spark uses __denormalized__ tables (ie __fat__ tables). Why? Mainly because it is easier to run analyses on a single table.  
  - If you do need to analyze complex star schema, best bet is to work with a database manger to get a denormalized table.
  
### `select`-ing what we want to see

Four ways to `select` colums in PySpark, all equivalent in term of results

In [50]:
# Using the string to column conversion
logs.select("BroadCastLogID", "LogServiceID", "LogDate")
logs.select(*["BroadCastLogID", "LogServiceID", "LogDate"]) # Unpack list with star prefix

# Passing the column object explicitly
logs.select(F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate"))
logs.select(*[F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")]) # Unpack list with star prefix

DataFrame[BroadCastLogID: int, LogServiceID: int, LogDate: string]

Because of the width of our data frame, we could split our columns into manageable sets of three to keep the output tidy on the screen. This gives a high-level view of what the data frame contains. 

In [51]:
# Splitting columns in groups of three using numpy
display("Columns in groups of three")
column_split = np.array_split(np.array(logs.columns), len(logs.columns) // 3)
display(column_split)

# Show columns in groups of three
display("Table display in column groups of three")
for x in column_split:
    logs.select(*x).show(5, False)

'Columns in groups of three'

[array(['BroadcastLogID', 'LogServiceID', 'LogDate'], dtype='<U22'),
 array(['SequenceNO', 'AudienceTargetAgeID', 'AudienceTargetEthnicID'],
       dtype='<U22'),
 array(['CategoryID', 'ClosedCaptionID', 'CountryOfOriginID'], dtype='<U22'),
 array(['DubDramaCreditID', 'EthnicProgramID', 'ProductionSourceID'],
       dtype='<U22'),
 array(['ProgramClassID', 'FilmClassificationID', 'ExhibitionID'],
       dtype='<U22'),
 array(['Duration', 'EndTime', 'LogEntryDate'], dtype='<U22'),
 array(['ProductionNO', 'ProgramTitle', 'StartTime'], dtype='<U22'),
 array(['Subtitle', 'NetworkAffiliationID', 'SpecialAttentionID'],
       dtype='<U22'),
 array(['BroadcastOriginPointID', 'CompositionID', 'Producer1'],
       dtype='<U22'),
 array(['Producer2', 'Language1', 'Language2'], dtype='<U22')]

'Table display in column groups of three'

+--------------+------------+----------+
|BroadcastLogID|LogServiceID|LogDate   |
+--------------+------------+----------+
|1196192316    |3157        |2018-08-01|
|1196192317    |3157        |2018-08-01|
|1196192318    |3157        |2018-08-01|
|1196192319    |3157        |2018-08-01|
|1196192320    |3157        |2018-08-01|
+--------------+------------+----------+
only showing top 5 rows

+----------+-------------------+----------------------+
|SequenceNO|AudienceTargetAgeID|AudienceTargetEthnicID|
+----------+-------------------+----------------------+
|1         |4                  |null                  |
|2         |null               |null                  |
|3         |null               |null                  |
|4         |null               |null                  |
|5         |null               |null                  |
+----------+-------------------+----------------------+
only showing top 5 rows

+----------+---------------+-----------------+
|CategoryID|ClosedCaptionID|Co

### `drop`-ing columns we don't need

Remove `BroadCastLogID` (primary key not needed in single table) and `SequenceNo`.  `drop()` returns a new data frame.

#### Warning with `drop`
Unlike `select()`, where selecting a column that doesn’t exist will return a runtime error, dropping a non-existent column is a no-op. PySpark will __just ignore the columns it doesn’t find__. Careful with the spelling of your column names!

In [53]:
logs = logs.drop("BroadCastLogID", "SequenceNo")

assert all(col not in logs.columns for col in ["BroadCastLogID", "SequenceNo"])

Alternate method of above just using `select` using list comprehension.

In [52]:
logs = logs.select(
    *[col for col in logs.columns if col not in ["BroadCastLogID", "SequenceNo"]]
)

assert all(col not in logs.columns for col in ["BroadCastLogID", "SequenceNo"])

---

## Exercises

### 4.3

Create a new data frame logs_clean that contains only the columns that do not end with ID

In [74]:
print([col for col in logs.columns if col[-2:] != "ID"])

['LogDate', 'SequenceNO', 'Duration', 'EndTime', 'LogEntryDate', 'ProductionNO', 'ProgramTitle', 'StartTime', 'Subtitle', 'Producer1', 'Producer2', 'Language1', 'Language2']


In [98]:
# Load original CSV again
DIRECTORY = "./data/Ch04"
logs = spark.read.csv(
    os.path.join(DIRECTORY, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
    sep="|",  # default is ","
    quote="\"", # default is double quote.
    header=True, # set first row as column names
    inferSchema=True, # infer schema from column names default False
)

# Filter to columns that don't end with "ID"
logs_no_id = logs.select(
    *[col for col in logs.columns if col[-2:].lower() != "id"]
)
print("Filtered results (not end with 'ID')")
logs_no_id.printSchema()

assert all("id" not in col[-2:] for col in logs_no_id.columns)

Filtered results (not end with 'ID')
root
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)



## Creating new columns with `withColumn`

### 1. Check the data type of 'Duration' column

In [103]:
logs.select(F.col("Duration")).show(5)

print("dtype of 'Duration' column is 'string'. Best to convert to timestamp:\n", logs.select(F.col("Duration")).dtypes)

+----------------+
|        Duration|
+----------------+
|02:00:00.0000000|
|00:00:30.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
+----------------+
only showing top 5 rows

dtype of 'Duration' column is 'string'. Best to convert to timestamp:
 [('Duration', 'string')]


### 2. Extract time features from Duration column only show distinct

In [116]:
logs.select(
    F.col("Duration"),
    F.col("Duration").substr(1,2).cast('int').alias('hours'),
    F.col("Duration").substr(4,2).cast('int').alias('minutes'),
    F.col("Duration").substr(7,2).cast('int').alias('seconds'),
    # Add final column converting duration into total seconds
    (
        F.col("Duration").substr(1,2).cast('int') * 60 * 60
        + F.col("Duration").substr(4,2).cast('int') * 60
        + F.col("Duration").substr(7,2).cast('int')
    ).alias('duration_seconds')
).distinct().show(5) # only show distinct entries

+----------------+-----+-------+-------+----------------+
|        Duration|hours|minutes|seconds|duration_seconds|
+----------------+-----+-------+-------+----------------+
|00:00:19.0000000|    0|      0|     19|              19|
|00:07:09.0000000|    0|      7|      9|             429|
|00:53:26.0000000|    0|     53|     26|            3206|
|00:30:43.0000000|    0|     30|     43|            1843|
|00:02:41.0000000|    0|      2|     41|             161|
+----------------+-----+-------+-------+----------------+
only showing top 5 rows



### 3. Use `withColumn()` to add 'duration_seconds' to original data frame

In [122]:
logs = logs.withColumn(
    "duration_seconds",
    F.col("Duration").substr(1,2).cast('int') * 60 * 60
    + F.col("Duration").substr(4,2).cast('int') * 60
    + F.col("Duration").substr(7,2).cast('int')
)

assert "duration_seconds" in logs.columns