# Refactoring

Sometimes the Spark jobs are a subject for refactoring. Instead of a simple read - process - write script in Scala / Python they become an integration work of multiple commons and libraries. In this cases it is important to keep the actual logic afloat instead of bury it under the function calls.

## Our example
We have an example of a over-engineered function that loads events data. It is used to load the data and register it as a temporary view. Sadly, it does a bit of unnecessary computations along - look at the actual job (the SQL query) that is being run against this data.

## The goal
No actual goal - this is just an ilustrative example.

In [0]:
import pyspark.sql.functions as F

INPUT_PATH = "/Volumes/tantusdata_playground/default/bde-2023/input/events/events.parquet"

In [0]:
def load_events():
    window = Window.partitionBy("userID").orderBy("eventTimestamp")

    (spark.read.parquet(INPUT_PATH)
        .withColumn("date_ranking", F.rank().over(window))
        .cache()
        .createOrReplaceTempView("refactoring_example"))

In [0]:
load_events()
display(spark.sql("SELECT * FROM refactoring_example ORDER BY eventTimestamp DESC LIMIT 1"))

In [0]:
display(spark.read.parquet(INPUT_PATH)
    .orderBy(F.col("eventTimestamp").desc())
    .limit(1)        
)