Problem Statement:

You are working with an orderss dataset containing customer order information, including customer_id, order_date, and order_amount. The goal is to:

Identify the latest two orders for each customer based on the order_date.
Extract the order_amount of the latest order and the second latest order for each customer.
Filter out customers who do not have at least two orders (i.e., ensure that both latest and second latest order amounts are available).

Output the following columns:

customer_id
latest_order_amount (amount of the most recent order)
second_latest_order_amount (amount of the second most recent order)

In [0]:
from pyspark.sql.types import (
    StructType,
    StructField,
    IntegerType,
    DateType,
    DecimalType,
)
from datetime import datetime
from decimal import Decimal

# Define schema for the DataFrame
schema = StructType(
    [
        StructField("order_id", IntegerType(), True),
        StructField("customer_id", IntegerType(), True),
        StructField("order_date", DateType(), True),
        StructField("order_amount", DecimalType(10, 2), True),
    ]
)

# Create data for the DataFrame with correct types
data = [
    (1, 101, datetime.strptime("2024-01-10", "%Y-%m-%d").date(), Decimal("150.00")),
    (2, 101, datetime.strptime("2024-02-15", "%Y-%m-%d").date(), Decimal("200.00")),
    (3, 101, datetime.strptime("2024-03-20", "%Y-%m-%d").date(), Decimal("180.00")),
    (4, 102, datetime.strptime("2024-01-12", "%Y-%m-%d").date(), Decimal("200.00")),
    (5, 102, datetime.strptime("2024-02-25", "%Y-%m-%d").date(), Decimal("250.00")),
    (6, 102, datetime.strptime("2024-03-10", "%Y-%m-%d").date(), Decimal("320.00")),
    (7, 103, datetime.strptime("2024-01-25", "%Y-%m-%d").date(), Decimal("400.00")),
    (8, 103, datetime.strptime("2024-02-15", "%Y-%m-%d").date(), Decimal("420.00")),
]

# Create DataFrame
orders_df = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
orders_df.display()

order_id,customer_id,order_date,order_amount
1,101,2024-01-10,150.0
2,101,2024-02-15,200.0
3,101,2024-03-20,180.0
4,102,2024-01-12,200.0
5,102,2024-02-25,250.0
6,102,2024-03-10,320.0
7,103,2024-01-25,400.0
8,103,2024-02-15,420.0


In [0]:
orders_df.createOrReplaceTempView("orderss")

In [0]:
%sql
with cte as(
  select
    orderss.*,
    DENSE_RANK() OVER(
      PARTITION BY customer_id
      ORDER BY
        order_date DESC
    ) as r1
  FROM
    orderss
),
cte1 as(
  select
    customer_id,
    order_amount as latest_order_amount,
    order_date
  FROM
    cte
  where
    r1 <= 2
),
cte2 as(
  select
    cte1.*,
    LEAD(latest_order_amount) OVER(
      PARTITION BY customer_id
      ORDER BY
        order_date DESC
    ) as second_latest_order_amount
  FROM
    cte1
)
select
  customer_id,
  latest_order_amount,
  second_latest_order_amount
FROM
  cte2
where
  second_latest_order_amount is not null;

customer_id,latest_order_amount,second_latest_order_amount
101,180.0,200.0
102,320.0,250.0
103,420.0,400.0


In [0]:
from pyspark.sql import Window
from pyspark.sql.functions import col, dense_rank, lead

# Define a window for DENSE_RANK based on customer_id and order_date
rank_window = Window.partitionBy("customer_id").orderBy(col("order_date").desc())

# Step 1: Add a column for DENSE_RANK
orders_df_ranked = orders_df.withColumn("r1", dense_rank().over(rank_window))

# Step 2: Filter rows where r1 <= 2
latest_orders = orders_df_ranked.filter(col("r1") <= 2)

# Step 3: Add a column for LEAD to get the second_latest_order_amount
lead_window = Window.partitionBy("customer_id").orderBy(col("order_date").desc())
latest_orders_with_lead = latest_orders.withColumn(
    "second_latest_order_amount", lead("order_amount").over(lead_window)
)

# Step 4: Filter rows where second_latest_order_amount is not null
result_df = latest_orders_with_lead.filter(
    col("second_latest_order_amount").isNotNull()
).select("customer_id", "order_amount", "second_latest_order_amount")

# Show the result
result_df.display()

customer_id,order_amount,second_latest_order_amount
101,180.0,200.0
102,320.0,250.0
103,420.0,400.0


Explanation of Steps:
    
dense_rank():

Adds a rank (r1) to orders for each customer_id, ordered by order_date in descending order.
Filter r1 <= 2:

Selects only the latest and second-latest orders for each customer.
lead():

Adds a column (second_latest_order_amount) with the next order's order_amount for the same customer_id.
Filter second_latest_order_amount IS NOT NULL:

Ensures that only rows with valid second-latest order amounts are included in the final result.
select:

Retrieves only the relevant columns: customer_id, latest_order_amount, and second_latest_order_amount.