### Transforming DataFrames in Spark

In this notebook we demonstrate some basic transformations on DataFrames with Apache Spark. This notebook is based on material supplied by Cloudera under their Cloudera Academic Partner program and the *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. 

Topics
- Working with columns
-- Selecting columns
-- Adding columns
-- Dropping columns
-- Changing the column name
-- Changing the column type
- Working with rows
-- Ordering rows
-- Keeping a fixed number of rows
-- Keeping distinct rows
-- Filtering rows
-- Sampling rows
- Working with missing values

**Note:** There is often more than one way to do these transformations.  In particular, there is almost always a way of expressing the transformations as SQL statements. One example is give below.

In [0]:
# Load the rider data from S3:
riders = spark.read.csv("/mnt/cis442f-data/duocar/raw/riders", header=True, inferSchema=True)
# riders.show(5) 

###**Working with Columns**

#### Selecting Columns
- We use the `select` method to select specific columns
- There are several ways of specifying columns

In [0]:
# Use the `select` method to select specific columns
# Column names can just be included in quotes as in this example
# Single or double quotes work. I recommend using double quotes which makes
# it easier to move code back and forth from scala (which only recognizes double quotes)
riders.select("birth_date", "student", "sex").show(5)

In [0]:
# Use `*` to select all columns - note that the asterisk is in quotes "*"
riders.select("*").show(2)

In [0]:
# We can specify a column using dot notation
riders.select(riders.first_name).show(3)

In [0]:
# We can also specify a column using syntax similar to Pandas
# Works if there is a space in the column name
riders.select(riders["first_name"]).show(3) 

In [0]:
# riders.first_name here is a column object
type(riders.first_name)

In [0]:
# We can also specify the column using the `column`or `col` functions, which return a column object based on the given column name.
# Will be useful when we want to explicitly refer to a column object where the column name would otherwise be treated as a simple string
from pyspark.sql.functions import column, col
riders.select(col("first_name")).show(1)

# riders.select(column("first_name")).show(1) # column not working in version  3.1.1 

#### Adding Columns

 Use the `withColumn` method to add a new column.  In this  example we have
- Chained the `select` and `withColumn` methods
- Used dot notation to access a column: `riders.student`
- Introduced a Boolean expression `riders.student == 1`

Don't forget that DataFrame are immutable in Spark. So, we are really adding a new column to a new DataFrame

In [0]:
# The first paramater is the name of the new column
# The second paramater is an expression for the new column
# Note: placing each method on its own line makes code easier to read
riders \
  .select("student") \
  .withColumn("student_boolean", riders.student == 1) \
  .show(15)

#### Expressing a transformation using SQL

You can use SQL to express transformations instead of using the DataFrame way we have been using so far 
- First you create a Temporary View using 'createOrReplaceTempView`
- Use `spark.sql` to define your query in SQL

The result of the query is a Spark DataFrame. So, it is possible to switch between SQL and DataFrame ways of expressing the transformations you want

Some transformations can also be done using the `selectExpr()` method. It is a variant of the the `select()` method that returns a DataFrame based upon SQL expressions

In [0]:
# One way of creating a DataFame using an SQL approach 
# relies on creating a temporary view as follows
riders.createOrReplaceTempView("riders_tmp")
spark.sql("select student, student = 1 as student_boolean from riders_tmp").show(15)

In [0]:
# This can also be done using the `selectExpr()` method
# It is a variant of the the `select()` method that returns 
# a DataFrame based upon SQL expressions
riders.selectExpr("student", "student = 1 as student_boolean").show(15)

#### Dropping columns 

We have already seen that we can drop columns using `select` as you only include the columns you want to keep from a DataFrame. You can also use the `drop` method to remove specific columns.

In [0]:
# Use the `drop` method to drop specific columns
riders.drop("first_name", "last_name", "ethnicity").show(2)

#### Changing the column name

In [0]:
# Use the `withColumnRenamed` method to rename a column
riders.withColumnRenamed("start_date", "join_date").printSchema()

In [0]:
# Chain multiple methods to rename more than one column
riders \
  .withColumnRenamed("start_date", "join_date") \
  .withColumnRenamed("sex", "gender") \
  .printSchema()

#### Changing the column type

The `Column` class has a `cast` method that can be used to change data types

**Note:** If we need to change the name and/or type of many columns, then we
may want to consider specifying the schema on read.

In [0]:
# Recall that `home_block` was read in as a (long) integer:
riders.printSchema()

In [0]:
# Use the `withColumn` (DataFrame) method in conjunction with the `cast`
# method associated with the Column class to change its type
riders.withColumn("home_block", riders.home_block.cast("string")).printSchema() 

### Working with rows
#### Ordering Rows
Use the `sort` or `orderBy` method (they are aliases of one another) to sort a DataFrame by particular columns. The default sort is **ascending**

In [0]:
# Use the `sort` or `orderBy` method to sort a DataFrame by particular columns
# This show one way of specifying the sorting order
riders \
  .select("birth_date", "student") \
  .sort("birth_date", ascending=True) \
  .show(10) 

In [0]:
# This time sort in descending order
riders \
  .select("birth_date", "student") \
  .orderBy("birth_date", ascending=False) \
  .show(10)

In [0]:
# Use the `asc` or `desc` column method instead of the `ascending` argument
riders \
  .select("birth_date", "student") \
  .orderBy(riders.birth_date.desc()) \
  .show(10)

In [0]:
# You can also use the `asc` and `desc` functions
from pyspark.sql.functions import asc, desc

riders \
  .orderBy(desc("birth_date"))\
  .select("birth_date", "student")\
  .show()

#### Selecting a fixed number of rows

In [0]:
# Use the `limit` method to select a fixed number of rows
riders.select("student", "sex").limit(5).show()

#### Selecting distinct rows 
Use the `distinct` or `dropDuplicates` methods to select distinct row (the two methods are equivalent)

In [0]:
# Use the `distinct` method to select distinct rows
riders.select("student", "sex").distinct().show()

In [0]:
# You can also use the `dropDuplicates` method
riders.select("student", "sex").dropDuplicates().show()

#### Filtering rows

Use the `filter` or `where` method along with a Boolean expression to select particular rows (the two methods are equivalent one is from the SQL world the other is from the DataFrame world)

In [0]:
# Use the `filter` or `where` method along with a Boolean expression to select
# particular rows:
riders.filter(riders.student == 1).count()

In [0]:
riders.where(riders.sex == "female").count()

In [0]:
# Chaining filters is logically equivalent to an AND condition within a fiter/select method
riders.filter(riders.student == 1).where(riders.sex == "female").count()

#### Sampling rows

Use the `sample` method to select a random sample of rows with or without replacement.

"Use the `sampleBy` method to select a stratified random sample - In statistical surveys, when subpopulations within an overall population vary, it is advantageous to sample each subpopulation (stratum) independently. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then simple random sampling or systematic sampling is applied within each stratum. The objective is to improve the precision of the sample by reducing sampling error. It can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the population. See [Wikipedia article](https://en.wikipedia.org/wiki/Stratified_sampling) for more details.

In [0]:
# Use `sample` method to selectrandom sample of rows with or without replacement
riders.count()
riders.sample(withReplacement=False, fraction=0.1, seed=12345).count()

In [0]:
# Use the `sampleBy` method to select a stratified random sample
# Here we sample 20% of male riders and 80% of female riders
riders \
  .groupBy("sex") \
  .count() \
  .show() 
riders \
  .sampleBy("sex", fractions={"male": 0.2, "female": 0.8}, seed=54321) \
  .groupBy("sex") \
  .count() \
  .show()

####Working with missing values

The main way  of interacting with `null` values at scale is to use the `pyspark.sql.DataFrameNaFunctions` which provides functions for working with missing data in DataFrames. Since it provides several powerful methods for dealing with missing data you should strive to always use `nulls` to represent missing or empty data in your DataFrames. See [DataFrameNaFunctions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions) class documentation for more details.

In [0]:
# Note the missing (null) values in the following DataFrame:
riders_selected = riders.select("id", "sex", "ethnicity")
riders_selected.show(25)

In [0]:
# One way of dealing with missing values is to drop records with missing values
# Drop rows with any missing values in certain columns
riders_selected.dropna(how="any", subset=["sex", "ethnicity"]).show(25)

In [0]:
# Drop rows with all missing values:
riders_selected.na.drop(how="all", subset=["sex", "ethnicity"]).show(25)
# **Note**: `dropna` and `na.drop` are equivalent.

In [0]:
# Replace missing values with a common value
riders_selected.fillna("OTHER/UNKNOWN", ["sex", "ethnicity"]).show(25)

In [0]:
# Replace missing values with different values
riders_missing = riders_selected.na.fill({"sex": "OTHER/UNKNOWN", "ethnicity": "MISSING"})
riders_missing.show(25)

In [0]:
# Replace arbitrary values with a common value
riders_missing\
    .replace(["OTHER/UNKNOWN", "MISSING"], "NA", ["sex", "ethnicity"])\
    .show(25)

In [0]:
# Replace arbitrary values with different values
riders_missing\
    .replace({"OTHER/UNKNOWN": "NA", "MISSING": "NO RESPONSE"}, ["sex", "ethnicity"])\
    .show(25)

# Note: `replace` and `na.replace` are equivalent. `replace` can be used to 
# replace sentinel values (e.g. -9999) that represent missing values in numerical columns.

# The warning from Spark just lets us know that this way of supplying the value
# to replace and what to replace it with in a dictionary works fine. Any other
# value for the replacement would be ignored (there isn't one in this case)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Read the raw driver data from `/mnt/cis442f-data/duocar/raw/drivers` into a Spark DataFrame.

(2) How young is the youngest driver?  How old is the oldest driver? You can use the function shown in the next notebook to calculate age. 

(3) How many female drivers does DuoCar have?  How many non-white, female drivers?

(4) Create a new DataFrame without any personally identifiable information (PII) 
- Create a new column "birth_year"
- PII fields to remove: "first_name", "last_name", "home_block" , "home_lat", "home_lon", "birth_date" 
    
(5) Read the raw ride data from `/mnt/cis442f-data/duocar/raw/rides` into a Spark DataFrame.  Inspect the  `service` column.  Replace the missing values with "Car" for standard DuoCar service.



**References**

[DataFrame class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame)

[Column class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html#pyspark.sql.Column)

[pyspark.sql.functions module](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions)

[DataFrameNaFunctions class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html#pyspark.sql.DataFrameNaFunctions)