
Problem Statement:

It seems to involve retrieving the most recent price change for each product and determining a new price based on specific conditions related to the ChangeDate. Here’s the breakdown:

Input Table:

The table ProductPrice Changes contains the following columns.

ProductID: The unique identifier for each product.
ChangeDate: The date when the price change occurred.
NewPrice: The price after the change.

In [0]:
from pyspark.sql.functions import row_number, col, when
from pyspark.sql.window import Window

# Create DataFrame
data = [
    (1, 10, "2024-08-11"),
    (1, 20, "2024-08-12"),
    (2, 20, "2024-08-14"),
    (1, 40, "2024-08-16"),
    (2, 50, "2024-08-15"),
    (3, 90, "2024-08-18"),
]
#Schema
columns = ["ProductID", "NewPrice", "ChangeDate"]

df = spark.createDataFrame(data, columns)
#display the dataframe
df.display

Out[2]: <bound method apply_dataframe_display_patch.<locals>.df_display of DataFrame[ProductID: bigint, NewPrice: bigint, ChangeDate: string]>

In [0]:
# Convert ChangeDate column to DateType
df = df.withColumn("ChangeDate", col("ChangeDate").cast("date"))

# Define the window specification
window_spec = Window.partitionBy("ProductID").orderBy(col("ChangeDate").desc())

# Add row numbers to identify the latest entry for each ProductID
df_with_rownum = df.withColumn("RN", row_number().over(window_spec))

# Filter the rows where RN = 1 (latest entry per ProductID)
latest_prices_df = df_with_rownum.filter(col("RN") == 1)

# Apply conditional logic for NewPrice based on ChangeDate
result_df = latest_prices_df.withColumn(
    "NewPrice",
    when(col("ChangeDate") == "2024-08-16", col("NewPrice").cast("string"))
    .when(col("ChangeDate") < "2024-08-16", col("NewPrice").cast("string"))
    .when(col("ChangeDate") > "2024-08-16", "10"),
)

# Select required columns
result_df = result_df.select("ProductID", "NewPrice")

# display the final result
result_df.display()

ProductID,NewPrice
1,40
2,50
3,10


In [0]:
df.createOrReplaceTempView("ProductPriceChanges")

In [0]:
%sql
WITH cte AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY ProductID
      ORDER BY
        ChangeDate DESC
    ) AS RN
  FROM
    ProductPriceChanges
)
SELECT
  ProductID,
  CASE
    WHEN ChangeDate = '2024-08-16' THEN NewPrice
    WHEN ChangeDate < '2024-08-16' THEN NewPrice
    WHEN ChangeDate > '2024-08-16' THEN CAST(10 AS STRING)
  END AS newPrice
FROM
  cte
WHERE
  RN = 1;

ProductID,newPrice
1,40
2,50
3,10


Explanation:

ROW_NUMBER Function: Spark SQL supports ROW_NUMBER() with PARTITION BY and ORDER BY for window functions.

CASE Expression: The CASE statement works the same way in Spark SQL.

CAST Function: Use CAST(value AS STRING) to ensure consistent data types in the CASE statement since NewPrice might have a different data type.