### Combining and Splitting Data in Spark

In this notebook we demonstrate how to combine and split Spark DataFrames. It is based on material supplied by Cloudera under their Cloudera Academic Partner program and *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. 

Topics
- Joining DataFrames
- Applying set operations to DataFrames
- Splitting a DataFrame 

#### Joining DataFrames

We will use the following datasets to demonstrate joins

In [0]:
scientists = spark.read.csv("/mnt/cis442f-data/duocar/raw/data_scientists/", header=True, inferSchema=True)
scientists.show()

offices = spark.read.csv("/mnt/cis442f-data/duocar/raw/offices/", header=True, inferSchema=True)
offices.show()

Use the `join` DataFrame method with different values of the `how` argument to perform various types of joins.

#### Inner join

In [0]:
# Use a join expression and the value `inner` to return only those rows 
# for which the join expression is true.  This gives us a list of data 
# scientists associated with an office and the corresponding office 
# information.

scientists.join(offices, scientists.office_id == offices.office_id, "inner") \
  .withColumnRenamed("employee_id", "emp_id") \
  .show()

In [0]:
# Since the join key has the same name on both DataFrames, we can simplify
# the join to the following

scientists.join(offices, "office_id", "inner") \
  .withColumnRenamed("employee_id", "emp_id") \
  .show()

In [0]:
# Since an inner join is the default, we can further simplify the join

scientists.join(offices, "office_id") \
  .withColumnRenamed("employee_id", "emp_id") \
  .show()

#### Left outer join

Use the value `left` or `left_outer` to return every row in the left DataFrame with or without a matching row in the right DataFrame

In [0]:
# This gives us a list of data scientists with or without an office.

scientists \
  .join(offices, scientists.office_id == offices.office_id, "left_outer") \
  .withColumnRenamed("employee_id", "emp_id") \
  .show()

#### Right outer join

Use the value `right` or `right_outer` to return every row in the right DataFrame with or without a matching row in the left DataFrame

In [0]:
# This gives us a list of offices with or without a data scientist
# Note: The Paris office has two data scientists
scientists \
  .join(offices, scientists.office_id == offices.office_id, "right_outer") \
  .withColumnRenamed("employee_id", "emp_id") \
  .withColumnRenamed("postal_code", "p_code") \
  .show()

#### Full outer join

Use the value `full`, `outer`, or `full_outer` to return the union of the left outer and right outer joins (with duplicates removed)

In [0]:
# This gives us a list of all data scientists whether or not they 
# have an office and all offices whether or not they have any data scientists.

scientists \
  .join(offices, scientists.office_id == offices.office_id, "full_outer") \
  .withColumnRenamed("employee_id", "emp_id") \
  .withColumnRenamed("postal_code", "p_code") \
  .show()


#### Left semi join

Use the value `left_semi` to return the rows in the left DataFrame that match rows in the right DataFrame.
Thus this sort of join does not include any values form the right DataFrame. We can think of left semi-join as a filter on the DataFrame.

In [0]:
# This gives us a list of data scientists associated with an office
scientists \
  .join(offices, scientists.office_id == offices.office_id, "left_semi") \
  .show()

#### Left anti join

Use the value `left_anti` to return the rows in the left DataFrame that do not match rows in the right DataFrame

Thus you can also think of the left anti join as a special type of filter

In [0]:
# This gives us a list of data scientists not associated with an office
scientists \
  .join(offices, scientists.office_id == offices.office_id, "left_anti") \
  .show() 

#### Cross join

Use the `crossJoin` DataFrame method to join every row in the left (scientists) DataFrame with every row in the right (offices) DataFrame. 

**Warning:** This can result in very big DataFrames!

**Note:**
- Columns with the same name are not renamed.
- This is called the *Cartesian product* of the two DataFrames.

In [0]:
# Use the `crossJoin` DataFrame method to join every row in the left 
# (scientists) DataFrame with every row in the right (offices) DataFrame
scientists.crossJoin(offices)\
  .withColumnRenamed("employee_id", "emp_id") \
  .withColumnRenamed("postal_code", "p_code") \
  .show()

#### Example: Joining the DuoCar data

Let us join the driver, rider, and review data with the ride data.

In [0]:
# Read the clean data
rides = spark.read.parquet("/mnt/cis442f-data/duocar/clean/rides/")
drivers = spark.read.parquet("/mnt/cis442f-data/duocar/clean/drivers/")
riders = spark.read.parquet("/mnt/cis442f-data/duocar/clean/riders/")
reviews = spark.read.parquet("/mnt/cis442f-data/duocar/clean/ride_reviews/")

In [0]:
# Since we want all the ride data, we will use a sequence of left outer joins
# Note that the id fields from three of the tables remain in the schema
joined = rides \
  .join(drivers, rides.driver_id == drivers.id, "left_outer") \
  .join(riders, rides.rider_id == riders.id, "left_outer") \
  .join(reviews, rides.id == reviews.ride_id, "left_outer")
joined.printSchema()

In [0]:
# We can disambiguate the fields with the same names as follows
joined.select(rides.id.alias("rides_id2"), joined.driver_id, drivers.id.alias("driver_id2"), \
              joined.rider_id, riders.id.alias("rider_id2")).show(5)

In [0]:
# We could change column names before the join but we can drop duplicate columns
# by refering to parent DataFrames afterwards as well
joined = rides \
  .join(drivers, rides.driver_id == drivers.id, "left_outer") \
  .join(riders, rides.rider_id == riders.id, "left_outer") \
  .join(reviews, rides.id == reviews.ride_id, "left_outer") \
  .drop(riders.id)\
  .drop(drivers.id)\
  .drop(reviews.ride_id)
  
# In principle we can also create a new DataFrame with unabiguous column names as well  
joined.select(drivers["home_block"].alias("driver_home_block")).printSchema()

# withColumnRename does not work as it only takes a string for the column name and will
# not resolve a column name including the parent DataFrame

#### Applying set operations to DataFrames

Spark SQL provides the following DataFrame methods that implement set operations
- `union`
- `intersect`
- `subtract`

In [0]:
# Use the `union` method to get the union of rows in two DataFrames with similar schema
driver_names = drivers.select("first_name")
print(driver_names.count())

rider_names = riders.select("first_name")
print(rider_names.count())

names_union = driver_names.union(rider_names).orderBy("first_name")
print (names_union.count())
names_union.show()

In [0]:
# Note that `union` does not remove duplicates.  Use the `distinct` method to remove duplicates
names_distinct = names_union.distinct()
print(names_distinct.count())
names_distinct.show()

In [0]:
# Use the `intersect` method to return rows that exist in both DataFrames
name_intersect = driver_names.intersect(rider_names).orderBy("first_name")
print(name_intersect.count())
name_intersect.show()

In [0]:
# Use the `subtract` method to return rows in the left DataFrame that do not exist in the right DataFrame
names_subtract = driver_names.subtract(rider_names).orderBy("first_name")
print(names_subtract.count())
names_subtract.show()


#### Splitting a DataFrame

Use the `randomSplit` DataFrame method to split a DataFrame into random subsets

**Note:** 
- The proportions must be doubles.
- We will use this functionality to create train, validation, and test datasets for machine learning pipelines.

In [0]:
# Use the `randomSplit` DataFrame method to split a DataFrame into random
# subsets. Use the `seed` argument to ensure replicability
riders.count()
(train, validate, test) = riders.randomSplit([0.6, 0.2, 0.2], seed=42)
print (train.count())
print(validate.count())
print(test.count())

In [0]:
# If the proportions do not add up to one, then Spark will normalize the values
(train, validate, test) = riders.randomSplit([60.0, 20.0, 20.0], seed=42)
print (train.count())
print(validate.count())
print(test.count())

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")

#### Exercises

(1) Create a DataFrame with all combinations of vehicle make and vehicle year (regardless of whether the combination is observed in the data). Use the drivers dataset `drivers = spark.read.parquet("/mnt/cis442f-data/duocar/clean/drivers/")`

(2) Join the demographic data (on block_group) and weather data (on date) with the joined rides data. Demographic and weather data are in the 'raw' directory. Use the joined dataset `joined = spark.read.parquet("/mnt/cis442f-data/duocar/joined/")`

(3) Are there any drivers who have not provided a ride? Use the drivers and rides datasets `rides = spark.read.parquet("/mnt/cis442f-data/duocar/clean/rides/")`


#### References

join, crossJoin, union, intersect, subtract, and randomSplit are all methods of the [Spark DataFrame class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame)