# Joins and Lookup Tables

Apache Spark&trade; and Azure Databricks&reg; allow you to join new records to existing tables in an ETL job.

-sandbox
### Shuffle and Broadcast Joins

A common use case in ETL jobs involves joining new data to either lookup tables or historical data. You need different considerations to guide this process when working with distributed technologies such as Spark, rather than traditional databases that sit on a single machine.

Traditional databases join tables by pairing values on a given column. When all the data sits in a single database, it often goes unnoticed how computationally expensive row-wise comparisons are.  When data is distributed across a cluster, the expense of joins becomes even more apparent.

**A standard (or shuffle) join** moves all the data on the cluster for each table to a given node on the cluster. This is expensive not only because of the computation needed to perform row-wise comparisons, but also because data transfer across a network is often the biggest performance bottleneck of distributed systems.

By contrast, **a broadcast join** remedies this situation when one DataFrame is sufficiently small. A broadcast join duplicates the smaller of the two DataFrames on each node of the cluster, avoiding the cost of shuffling the bigger DataFrame.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/shuffle-and-broadcast-joins.png" style="height: 400px; margin: 20px"/></div>

### Lookup Tables

Lookup tables are normally small, historical tables used to enrich new data passing through an ETL pipeline.

Run the cell below to mount the data.

In [5]:
%run "./Includes/Classroom-Setup"

Import a small table that will enrich new data coming into a pipeline.

In [7]:
labelsDF = spark.read.parquet("/mnt/training/day-of-week")

display(labelsDF)

Import a larger DataFrame that gives a column to combine back to the lookup table. In this case, use Wikipedia site requests data.

In [9]:
from pyspark.sql.functions import col, date_format

pageviewsDF = (spark.read
  .parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/")
  .withColumn("dow", date_format(col("timestamp"), "u").alias("dow"))
)

display(pageviewsDF)

Join the two DataFrames together.

In [11]:
pageviewsEnhancedDF = pageviewsDF.join(labelsDF, "dow")

display(pageviewsEnhancedDF)

Now aggregate the results to see trends by day of the week.

:NOTE: `pageviewsEnhancedDF` is a large DataFrame so it can take a while to process depending on the size of your cluster.

In [13]:
from pyspark.sql.functions import col

aggregatedDowDF = (pageviewsEnhancedDF
  .groupBy(col("dow"), col("longName"), col("abbreviated"), col("shortName"))  
  .sum("requests")                                             
  .withColumnRenamed("sum(requests)", "Requests")
  .orderBy(col("dow"))
)

display(aggregatedDowDF)

-sandbox
### Exploring Broadcast Joins

In joining these two DataFrames together, no type of join was specified.  In order to examine this, look at the physical plan used to return the query. This can be done with the `.explain()` DataFrame method. Look for **BroadcastHashJoin** and/or **BroadcastExchange**.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/broadcasthashjoin.png" style="height: 400px; margin: 20px"/></div>

In [15]:
aggregatedDowDF.explain()

By default, Spark did a broadcast join rather than a shuffle join.  In other words, it broadcast `labelsDF` to the larger `pageviewsDF`, replicating the smaller DataFrame on each node of our cluster.  This avoided having to mover the larger DataFrame across the cluster.

Take a look at the broadcast threshold by accessing the configuration settings.

In [17]:
threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
print("Threshold: {0:,}".format( int(threshold) ))

This is the maximize size in bytes for a table that broadcast to worker nodes.  Dropping it to `-1` disables broadcasting.

In [19]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

Now notice the lack of broadcast in the query physical plan.

In [21]:
pageviewsDF.join(labelsDF, "dow").explain()

Next reset the original threshold.

In [23]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", threshold)

### Explicitly Broadcasting Tables

There are two ways of telling Spark to explicitly broadcast tables. The first is to change the Spark configuration, which affects all operations. The second is to declare it using the `broadcast()` function in the `functions` package.

In [25]:
from pyspark.sql.functions import broadcast

pageviewsDF.join(broadcast(labelsDF), "dow").explain()

## Exercise 1: Join a Lookup Table

Join a table that includes country name to a lookup table containing the full country name.

### Step 1: Import the Data

Create the following DataFrames:<br><br>

- `countryLookupDF`: A lookup table with ISO country codes located at `/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet`
- `logWithIPDF`: A server log including the results from an IPLookup table located at `/mnt/training/EDGAR-Log-20170329/enhanced/logDFwithIP.parquet`

In [28]:
# TODO
countryLookupDF = # FILL_IN
logWithIPDF = # FILL_IN

In [29]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-05-01-01", 249, countryLookupDF.count())
dbTest("ET2-P-05-01-02", 5000, logWithIPDF.count())

print("Tests passed!")

### Step 2: Broadcast the Lookup Table

Complete the following:<br><br>

- Create a new DataFrame `logWithIPEnhancedDF`
- Get the full country name by performing a broadcast join that broadcasts the lookup table to the server log
- Drop all columns other than `EnglishShortName`

In [31]:
# TODO
logWithIPEnhancedDF = # FILL_IN

In [32]:
# TEST - Run this cell to test your solution
cols = set(logWithIPEnhancedDF.columns)

dbTest("ET2-P-05-02-01", True, "EnglishShortName" in cols and "ip" in cols)
dbTest("ET2-P-05-02-02", True, "alpha2Code" not in cols and "ISO31662SubdivisionCode" not in cols)
dbTest("ET2-P-05-02-03", 5000, logWithIPEnhancedDF.count())

print("Tests passed!")

## Review
**Question:** Why are joins expensive operations?  
**Answer:** Joins perform a large number of row-wise comparisons, making the cost associated with joining tables grow with the size of the data in the tables.

**Question:** What is the difference between a shuffle and broadcast join? How does Spark manage these differences?  
**Answer:** A shuffle join shuffles data between nodes in a cluster. By contrast, a broadcast join moves the smaller of two DataFrames to where the larger DataFrame sits, minimizing the overall data transfer. By default, Spark performs a broadcast join if the total number of records is below a certain threshold. The threshold can be manually specified or you can manually specify that a broadcast join should take place. Since the automatic determination of whether a shuffle join should take place is by number of records, this could mean that really wide data would take up significantly more space per record and should therefore be specified manually.

**Question:** What is a lookup table?  
**Answer:** A lookup table is small table often used for referencing commonly used data such as mapping cities to countries.

## Next Steps

Start the next lesson, [Database Writes]($./06-Database-Writes ).

## Additional Topics & Resources

**Q:** Where can I get more information on optimizing table joins where data skew is an issue?  
**A:** Check out the Databricks documentation on <a href="https://docs.azuredatabricks.net/spark/latest/spark-sql/skew-join.html" target="_blank">Skew Join Optimization</a>