# UDF
Using UDF is one of the "classic" optimization guideline. Indeed using UDF (or worse - using non-arrow python UDF) addds _linear_ overhead to the application and may interfere with Catalyst optimizer. In this example we will see it in action

## The goal
The current (naive) implementation uses three UDF - they are not overly complicated and should be easy to translate to native Spark function calls. Try doing so and run _both_ functions. What difference can you see in e.g. physical execution plan?

In [0]:
import pyspark.sql.functions as F
import random

INPUT_PATH = "/Volumes/tantusdata_playground/default/bde-2023/input/orders.parquet/"

# Volume path - make sure you have created a volume for yourself.
VOLUME_PATH = "/Volumes/tantusdata_playground/default/admin-user"

WRITE_PATH = f"{VOLUME_PATH}/03-cross-join/orders-pairs.parquet"

In [0]:
orders = spark.read.parquet(INPUT_PATH)

udf_1 = udf(lambda orderID: random.randint(1, 100) + orderID, "BIGINT")
udf_2 = udf(lambda orderID: orderID == 0, "BOOLEAN")
udf_3 = udf(lambda storeID: "invalid" if int(storeID) < 0 else str(storeID), "STRING")
            
orders_with_extras = (orders
    .withColumn("extra_1", udf_1("orderID"))
    .withColumn("extra_2", udf_2("orderID"))
    .withColumn("extra_3", udf_3("storeID"))
    .filter(F.col("extra_2") == True)
)

orders_with_extras.write.parquet(WRITE_PATH, mode="overwrite")

In [0]:
orders = spark.read.parquet(INPUT_PATH)

# FixMe: try to replace UDFs with native functions, such as randn or when.

orders_with_extras = (orders
    .withColumn("extra_1", )
    .withColumn("extra_2", )
    .withColumn("extra_3", )
    .filter(F.col("extra_2") == True)
)

orders_with_extras.write.parquet(WRITE_PATH, mode="overwrite")