### Transforming DataFrame Columns in Spark

In this notebook we demonstrate how to transform DataFrame columns. It is based on material supplied by Cloudera under their Cloudera Academic Partner program and the *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. 

Topics
- Working with numerical columns
- Working with string columns
- Working with datetime columns
- Working with Boolean columns

In [0]:
# Load the raw rides data:
rides = spark.read.csv("/mnt/cis442f-data/duocar/raw/rides/", header=True, inferSchema=True)

# Load the raw driver data:
drivers = spark.read.csv("/mnt/cis442f-data/duocar/raw/drivers", header=True, inferSchema=True)

# Load the raw rider data:
riders = spark.read.csv("/mnt/cis442f-data/duocar/raw/riders/", header=True, inferSchema=True)

### Working with numerical columns

#### Example 1: Converting ride distance from meters to miles

**Notes:**
- We use the fact that 1 mile = 1609.344 meters
- We use the `round` function to round the result to two decimal places
- We use the `alias` method to rename the column

In [0]:
from pyspark.sql.functions import round
rides \
  .select("distance", round(rides.distance / 1609.344, 2)
  .alias("distance_in_miles")) \
  .show(5)

In [0]:
# To add a new column use the `withColumn` method with a new column name
rides \
  .select(["id", "driver_id", "rider_id", "date_time", "distance"]) \
  .withColumn("distance_in_miles", round(rides.distance / 1609.344, 2)) \
  .show(1) 

In [0]:
# To replace the existing column use the `withColumn` method with the existing column name
rides \
  .select("id", "driver_id", "rider_id", "date_time", "distance") \
  .withColumn("distance", round(rides.distance / 1609.344, 2)) \
  .show(1) 

#### Example 2: Converting the ride id from an integer to a string

**Note:** We use the [printf format string](https://en.wikipedia.org/wiki/Printf_format_string) "%010d" to achieve the desired format (a 10 digit decimal)

![printf](https://cis442f-open-data.s3.amazonaws.com/pictures/printf.png "printf")

**Printf format string** refers to a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied literally into the function's output, but format specifiers, which start with a % character, indicate the location and method to translate a piece of data (such as a number) to characters. Many languages other than C copy the printf format string syntax closely or exactly in their own I/O functions.

In production contexts we should really read data it into the DataFrame in the way we want it if at all possible.

In [0]:
# Convert the `id` key to a left-zero-padded string:
from pyspark.sql.functions import format_string
rides.select("id", format_string("%010d", "id").alias("id_fixed")).show(5)

### Working with string columns

#### Example 3: Normalizing the sex column in the riders table

Trim whitespace and convert rider sex to uppercase ( a common preprocessing step)

In [0]:
# Trim whitespace and convert rider sex to uppercase
from pyspark.sql.functions import trim, upper
riders \
  .select("sex", upper(trim(riders.sex)).alias("gender")) \
  .show(5)

#### Example 4: Extracting Census Block Group from the rider's Census Block

The [Census Block Group](https://en.wikipedia.org/wiki/Census_block_group) is the first 12 digits of the [Census Block](https://en.wikipedia.org/wiki/Census_block)

In [0]:
# Extracting Census Block Group from the rider's Census Block
from pyspark.sql.functions import substring
riders \
  .select("home_block", substring("home_block", 1, 12).alias("home_block_group")) \
  .show(5)

#### Example 5: Regular Expressions

Use a regular expression to extract the Census Block Group. 

**Note:**
- We `cast` the home_block column to `string` since regex functions expect a string
- The regex expresson show that two groups of characters could be catpured `(/d{12})` and `(.*)`.
- The third parameter in the regexp_extract function (1) indicates that this first group should be extracted

In [0]:
# Use a regular expression to extract the Census Block Group
from pyspark.sql.functions import regexp_extract
riders \
  .select("home_block", regexp_extract(riders.home_block.cast("string"), "^(\d{12})(.*)", 1).alias("home_block_group")) \
  .show(5)

In [0]:
# This version extracts the second group from the regex expression just to 
# illustrate the operation of regexp_extract
from pyspark.sql.functions import regexp_extract
riders \
  .select("home_block", regexp_extract(riders.home_block.cast("string"), "^(\d{12})(.*)", 2).alias("remaining_part")) \
  .show(5)

### Working with datetime columns

#### Example 6: Fix birth date

**Note:** 
- We could use the `withColumn` method as above to add a new column or replace an existing one
- Explict specification of the data format uses [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html). 
- These paragraphs introduce some of the functions available for working with date and timestamp columns

In [0]:
# Fix birth date
from pyspark.sql.functions import to_date
riders \
  .select("birth_date", to_date("birth_date").alias("birth_date_fixed")) \
  .show(5) 

In [0]:
# Fix birth date using explict specification of the date format
from pyspark.sql.functions import to_date
riders \
  .select("birth_date", to_date("birth_date", "yyyy-MM-dd").alias("birth_date_fixed")) \
  .show(5) 

#### Example 7: Compute rider age

**Note:** Spark implicitly casts `birth_date` or `today` as necessary.  It is
probably safer to explicitly cast one of these columns before computing the
number of months between.

In [0]:
# Compute rider age
from pyspark.sql.functions import to_date, current_date, months_between, floor, to_timestamp
riders \
  .select("birth_date", current_date().alias("today")) \
  .withColumn("age", floor(months_between("today", "birth_date") / 12)) \
  .show(5)

In [0]:
# Compute rider age making sure that birth_data has been converted to DateType()
from pyspark.sql.functions import to_date, current_date, months_between, floor
riders \
  .select(to_date("birth_date", "yyyy-MM-dd").alias("birth_date_fixed"), current_date().alias("today")) \
  .withColumn("age", floor(months_between("today", "birth_date_fixed") / 12)) \
  .show(5)

#### Example 8: Fix ride date and time

Note that explict specification of the data and time format uses [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

In [0]:
# Fix ride date and time
rides \
  .select("date_time", rides.date_time.cast("timestamp").alias("date_time_fixed")) \
  .show(5, truncate=False)

In [0]:
# Fix ride date and time (with explict date time format)
rides \
  .select("date_time", to_timestamp("date_time", 'yyyy-MM-dd HH:mm').alias("date_time_fixed")) \
  .show(5, truncate=False) 

#### Example 9: Multiple Boolean Column expressions

**Note:** 
- The OR operator is `|`
- The AND operator is `&`
- How the difference in how nulls are treated in the computation:
-- true & null = null
-- false & null = false
-- true | null = true
-- false | null = null
- Spark is quite sensitive to parentheses. Parentheses are needed if the two _Boolean Column Expressions_ are used directly in the select statement
- The automatically generated column names reflect the logic that generated them

In [0]:
# Predefine the Boolean Column expressions (these are both column objects)
studentFilter = riders.student == 1
maleFilter = riders["sex"] == "male"

print ("The type of maleFilter is: " + str(type(maleFilter)))

# Combine using the AND operator
riders.select("student", "sex", studentFilter & maleFilter).show(15) 

In [0]:
# Combine using the OR operator
riders.select("student", "sex", studentFilter | maleFilter).show(15)

In [0]:
# Check the type of the Boolean Column Expression
type (riders.student == 1)

In [0]:
# Combine using the AND operator with Boolean Column Expressions directly in the select method
riders.select("student", "sex", (riders.student == 1) & (riders["sex"] == "male")).show(15) 

#### Example 10: Using multiple boolean expressions in a filter

Note: If you want to specify multiple AND fitlers, you can just chain them sequentially.

In [0]:
# Using multiple boolean expressions in a filter
riders.filter(maleFilter & studentFilter).select("student", "sex").show(5) 

In [0]:
# Using multiple boolean expressions in a filter is equivalent to this sequential chaining of filters
riders.filter(maleFilter).filter(studentFilter).select("student", "sex").show(5)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Convert the `rides.driver_id` column to a string column.

(2) Extract the year from the `rides.date_time` column (hint: you can use the `year` function)

(3) Convert `rides.duration` from seconds to minutes.

(4) Convert the `rides.cancelled` column to a boolean column.

(5) Convert the `rides.star_rating` column to a double column.

In [0]:
# Submit your solution in the following format

rides = spark.read.csv("/mnt/cis442f-data/duocar/raw/rides/", header=True, inferSchema=True)
from pyspark.sql.functions import format_string, year, col, round
rides = rides\
  .withColumn(XXXXXXXXXX)\
  .withColumn(XXXXXXXXXXXXXX)\
  .withColumn(XXXXXXXXXXXXXX)\
  .withColumn(XXXXXXXXXXXXXXXX)\
  .withColumn(XXXXXXXXXXXXXXXX)

rides.show(5)
rides.printSchema()

**References**

[DataFrame class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame)

[Column class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html#pyspark.sql.Column)

[pyspark.sql.functions module](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions)

All of the above are part of the [Python Spark SQL API Reference](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html) which you should get to know your way around