<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Joining-Data" data-toc-modified-id="Joining-Data-1">Joining Data</a></span><ul class="toc-item"><li><span><a href="#Understanding-the-join-recipe" data-toc-modified-id="Understanding-the-join-recipe-1.1">Understanding the <code>join</code> recipe</a></span><ul class="toc-item"><li><span><a href="#Important-points" data-toc-modified-id="Important-points-1.1.1">Important points</a></span></li><li><span><a href="#Pyspark-helpers-in-join-logic" data-toc-modified-id="Pyspark-helpers-in-join-logic-1.1.2">Pyspark helpers in join logic</a></span></li><li><span><a href="#Setting-up-join-logic-with-how" data-toc-modified-id="Setting-up-join-logic-with-how-1.1.3">Setting up join logic with <code>how</code></a></span></li></ul></li><li><span><a href="#Warning:-What-happens-when-joining-columns-in-a-distributed-environment" data-toc-modified-id="Warning:-What-happens-when-joining-columns-in-a-distributed-environment-1.2">Warning: What happens when joining columns in a distributed environment</a></span></li><li><span><a href="#Warning:-Joining-tables-with-identically-named-columns-leads-to-errors-downstream" data-toc-modified-id="Warning:-Joining-tables-with-identically-named-columns-leads-to-errors-downstream-1.3">Warning: Joining tables with identically named columns leads to errors downstream</a></span></li><li><span><a href="#Solutions-for-preventing-ambiguous-column-references" data-toc-modified-id="Solutions-for-preventing-ambiguous-column-references-1.4">Solutions for preventing ambiguous column references</a></span></li></ul></li></ul></div>

## Joining Data

In [33]:
# Set up
import os
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import pyspark.sql.functions as F


spark = SparkSession.builder.getOrCreate()

# Read the data
DIRECTORY = "./data/Ch04"
logs = spark.read.csv(
    os.path.join(DIRECTORY, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
    sep="|",  # default is ","
    quote='"',  # default is double quote.
    header=True,  # set first row as column names
    inferSchema=True,  # infer schema from column names default False
)


# Read link table and filter to only primary channels (ie. PrimaryFG == 1)
log_identifier = spark.read.csv(
    os.path.join(DIRECTORY, "ReferenceTables", "LogIdentifier.csv"),
    sep="|",
    header=True,
    inferSchema=True,
)
log_identifier = log_identifier.where(F.col("PrimaryFG") == 1)


# Show results
log_identifier.printSchema()
log_identifier.show(5)
print("Unique primary channels: ", log_identifier.count())

root
 |-- LogIdentifierID: string (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- PrimaryFG: integer (nullable = true)

+---------------+------------+---------+
|LogIdentifierID|LogServiceID|PrimaryFG|
+---------------+------------+---------+
|           13ST|        3157|        1|
|         2000SM|        3466|        1|
|           70SM|        3883|        1|
|           80SM|        3590|        1|
|           90SM|        3470|        1|
+---------------+------------+---------+
only showing top 5 rows

Unique primary channels:  758


### Understanding the `join` recipe

```py
[LEFT].join(
    [RIGHT],
    on=[PREDICTES],
    how=[METHOD]
)
```

#### Important points

1. If one record in the left table resolves the predicate with more than one record in the right table (or vice versa), __this record will be duplicated in the joined table__.
2. If one record in the left or in the right table does not resolve the predicate with any record in the other table, __it will not be present in the resulting table, unless the join method specifies a protocol for failed predicates__.

#### Pyspark helpers in join logic

- You can put multiple `and` predicates into a list, like:
    ```py
    [
        left["col1"] == right["colA"], 
        left["col2"] > right["colB"],  # value on left table is greater than the right
        left["col3"] != right["colC"]
    ]
    ```
- You can test equality just by specifying the column name, or list of column names

#### Setting up join logic with `how`

1. `cross` - returns a record for every record pair. not common.
2. `inner` = returns record if predicate is true, otherwise drops it. most common, pyspark `join` default. 
3. `left` & `right` - similar to `inner`, except on what to do with false predicates:
    - `left` join adds unmatched records from the left table in the joined table, and fills in columns from right able with `None`
    - `right` join adds unmatched records nad fills in column vice versa.
4. `outer` - adds unmatched records from the left and right able, padding with `None`.
5. `left_semi` - same as inner join but only keeps columns in left table. 
6. `left_anti` - returns only records that don't match the predicate with any record in the right table.  opposite of `left` join.

In [27]:
# Join `logs` with `log_identifier` using the 'LogServiceID' column
joined = logs.join(log_identifier, on="LogServiceID", how="inner")

In [30]:
# Additionally join CategoryID and ProgramClassID table
# Use left joins since keys may not be available in the link table.

# CategoryID
cd_category = spark.read.csv(
    os.path.join(DIRECTORY, "ReferenceTables", "CD_Category.csv"),
    sep="|",
    header=True,
    inferSchema=True,
).select(
    "CategoryID",
    "CategoryCD",
    F.col("EnglishDescription").alias("Category_Description"),
)

# ProgramClass
cd_program_class = spark.read.csv(
    os.path.join(DIRECTORY, "ReferenceTables", "CD_ProgramClass.csv"),
    sep="|",
    header=True,
    inferSchema=True,
).select(
    "ProgramClassID",
    "ProgramClassCD",
    F.col("EnglishDescription").alias("ProgramClass_Description"),
)


# Join all to joined table
full_log = joined.join(cd_category, "CategoryID", how="left",).join(
    cd_program_class, "ProgramClassID", how="left",
)


# Check if additional columns were joined to original log data frame
full_log.printSchema()

root
 |-- ProgramClassID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogDate: string (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: string (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nulla

### Warning: What happens when joining columns in a distributed environment

>  To be able to process a comparison between records, the data needs to be on the same machine. If not, PySpark will move the data in an operation called a _shuffle_, which is slow and expensive.  More on join strategies in later chapters.

### Warning: Joining tables with identically named columns leads to errors downstream

PySpark happily joins the two data frames together but fails when we try to work with the ambiguous column.

In [47]:
# Joining two tables with the same LogServiceID column
logs_and_channels_verbose = logs.join(
    log_identifier, logs["LogServiceID"] == log_identifier["LogServiceID"]
)


print(
    'Joined table now has two "LogServiceID" columns: ',
    [col for col in logs_and_channels_verbose.columns if col == "LogServiceID"],
    "\n",
)
print('Selecting "LogServiceID" will now throw an error')


# Selecting "LogServiceID" will throw an error
try:
    logs_and_channels_verbose.select("LogServiceID")
except AnalysisException as err:
    print("AnalysisException: ", err)

Joined table now has two "LogServiceID" columns:  ['LogServiceID', 'LogServiceID'] 

Selecting "LogServiceID" will now throw an error
+------------+
|LogServiceID|
+------------+
|        3157|
|        3157|
|        3157|
|        3157|
|        3157|
+------------+
only showing top 5 rows

AnalysisException:  Reference 'LogServiceID' is ambiguous, could be: LogServiceID, LogServiceID.;


### Solutions for preventing ambiguous column references

1. Use simplified syntax (ie. passing string of column you want). Auto-removes second instance of predicate column.  Can only use on equi-joins.
    ```logs_and_channels = logs.join(log_identifier, "LogServiceID")```
2. Refer to the pre-existing table name.
    ```
    logs_and_channels_verbose.select(log_identifier["LogServiceID"])
    ```
3. Use the `Column` object directly
    ```
    logs_and_channels_verbose = logs.alias("left").join(
    log_identifier.alias("right"),
    logs["LogServiceID"] == log_identifier["LogServiceID"],
    )

    logs_and_channels_verbose.drop(F.col("right.LogServiceID")).select(
        "LogServiceID"
    )
    ```