# Cross Join
Cross join is always an expensive operation, mostly due to the amount of data it produces. However even even here there are ways to do it efficiently and ways to keep waiting long hours for a job to complete. In this notebook we will look at a job that computes cross-join of orders, grouped by order ID. We want to produce all product pairs per every order available.

## The goal
The current (naive) implementation takes over 1 hour to compute all the pairs - you can verify this by running the current implementation. Browse through the data and see what is causing the problem, then try to correct this. A really optimized solution may take as little as 10 minutes. 

In [0]:
import pyspark.sql.functions as F

INPUT_PATH = "/FileStore/input/orders.parquet/"

# Volume path - make sure you have created a volume for yourself.
VOLUME_PATH = "/Volumes/tantusdata_playground/default/test-user-001"

WRITE_PATH = f"{VOLUME_PATH}/03-cross-join/orders-pairs.parquet"

## Your own code:
1. Do not change input nor output paths.
2. The aim is to optimize the join. 
3. Non-destructable changes (adding columns, partitions, hints) are all welcomed. You may need to change both sides of the join.
4. The default implementation is very slow - takes over 1h to complete. Try to identify the issue - browsing through the data may help here.

In [0]:
orders = spark.read.parquet(INPUT_PATH)

orders_lhs = orders.select(
    F.col("orderID1"), 
    F.col("productID").alias(f"productID1")
)

orders_rhs = orders.select(
    F.col("orderID2"), 
    F.col("productID").alias(f"productID1")
)

# FixMe: try to optimize this join. Make sure to check the data being processed by this job.

orders_pairs = (orders_lhs
    .join(orders_rhs, F.col("orderID1") == F.col("orderID2")) & (F.col("productID1") < F.col("productID2"))
)

orders_pairs.write.parquet(WRITE_PATH)