# Complex join

Our first `JOIN` example will cover joins with non-trivial predicate - one with anything beyond AND clauses and equality operations. Those may happen from time to time in your data pipelines.

## Our example
Our job is joining _orders_ with _stores_ to enrich data with stores' metadata. Some invoices are not generated in any of the stores, but online - we do not have any "store" for them, but also do not want to lose them. Doing a simple right outer join is also not an option - to ensure data quality we want to drop all not-null and not matched invoices. The current approach is very SQL-like: we specify join with condition described above.

## The goal
As you might seen (if you try running this example), this job is a bit slow. We will try to identify _why_, then optimize it. If you feel lost, try running the current implementation, then check the execution plan in the Spark UI.

In [0]:
import pyspark.sql.functions as F

# Input data path - this is the data that you need to load.
INPUT_ORDERS_PATH = "/Volumes/tantusdata_playground/default/bde-2023/input/orders.parquet/"
INPUT_STORES_PATH = "/Volumes/tantusdata_playground/default/bde-2023/input/stores.parquet/"

# Volume path - make sure you have created a volume for yourself.
VOLUME_PATH = "/Volumes/tantusdata_playground/default/test-user-001"

WRITE_PATH = f"{VOLUME_PATH}/02-complex-join/products-with-stores.parquet"

In [0]:
orders = spark.read.parquet(INPUT_ORDERS_PATH)
stores = spark.read.parquet(INPUT_STORES_PATH)

# FixMe: try to optimize this join. Make sure to check the data being processed by this job.

orders_with_stores = (orders
    .join(stores.withColumnRenamed("storeID", "storeID_2"), (F.col("storeID") == F.col("storeID_2")) | (F.col("storeID") == -F.col("storeID_2")))
    .drop("storeID_2")
)

orders_with_stores.write.mode("overwrite").parquet(WRITE_PATH)

## Assertions

In [0]:
assert(spark.read.parquet(WRITE_PATH).filter("storeID < 0").count() == 374_236)
assert(spark.read.parquet(WRITE_PATH).filter("storeID > 0").count() == 2_125_523)