Problem Statement:

You are tasked with analyzing delivery data to calculate the percentage of immediate first orders for each customer. An immediate order is defined as an order where the order_date is the same as the customer_pref_delivery_date. The analysis should focus only on the first order for each customer, based on the order_date.

Requirements:

Input Data:
A dataset containing delivery records with the following fields:

delivery_id (Integer): A unique identifier for each delivery.

customer_id (Integer): The ID of the customer who placed the order.

order_date (Date): The date the order was placed.

customer_pref_delivery_date (Date): The date the customer prefers to have the delivery.

In [0]:
from pyspark.sql.types import *
import datetime

delivery_data = [
    (1, 1, datetime.date(2019, 8, 1), datetime.date(2019, 8, 2)),
    (2, 2, datetime.date(2019, 8, 2), datetime.date(2019, 8, 2)),
    (3, 1, datetime.date(2019, 8, 11), datetime.date(2019, 8, 12)),
    (4, 3, datetime.date(2019, 8, 24), datetime.date(2019, 8, 24)),
    (5, 3, datetime.date(2019, 8, 21), datetime.date(2019, 8, 22)),
    (6, 2, datetime.date(2019, 8, 11), datetime.date(2019, 8, 13)),
    (7, 4, datetime.date(2019, 8, 9), datetime.date(2019, 8, 9))
]

# Define schema for the Delivery table
delivery_schema = StructType([
    StructField("delivery_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("customer_pref_delivery_date", DateType(), True)
])
# Create a DataFrame for the Delivery table
delivery_df = spark.createDataFrame(delivery_data, schema=delivery_schema)

# display the content of the DataFrame
delivery_df.display()


delivery_id,customer_id,order_date,customer_pref_delivery_date
1,1,2019-08-01,2019-08-02
2,2,2019-08-02,2019-08-02
3,1,2019-08-11,2019-08-12
4,3,2019-08-24,2019-08-24
5,3,2019-08-21,2019-08-22
6,2,2019-08-11,2019-08-13
7,4,2019-08-09,2019-08-09


In [0]:
# Register the DataFrame as a temporary SQL table
delivery_df.createOrReplaceTempView("DELIVERY")

In [0]:
from pyspark.sql.functions import col, when, count, sum as _sum, round
# Spark SQL query to calculate immediate orders percentage
query = """
WITH ORDER_RANK AS (
    SELECT
        CUSTOMER_ID,
        ORDER_DATE,
        RANK() OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE) AS RNK
    FROM DELIVERY
)
SELECT 
    ROUND(
        (SUM(
            CASE 
                WHEN D.ORDER_DATE = D.CUSTOMER_PREF_DELIVERY_DATE THEN 1
                ELSE 0
            END
        ) * 100.0) / COUNT(*), 2
    ) AS IMMEDIATE_ORDERS_PERCENTAGE
FROM DELIVERY D
JOIN ORDER_RANK OR1
ON D.CUSTOMER_ID = OR1.CUSTOMER_ID
AND D.ORDER_DATE = OR1.ORDER_DATE
WHERE OR1.RNK = 1
"""

# Execute the query
result_df = spark.sql(query)

# Show the result
result_df.display()

IMMEDIATE_ORDERS_PERCENTAGE
50.0


In [0]:
from pyspark.sql.functions import col, when, rank, round
from pyspark.sql.window import Window
import datetime
# Add rank column for each customer's orders
window_spec = Window.partitionBy("customer_id").orderBy("order_date")
delivery_ranked_df = delivery_df.withColumn("rank", rank().over(window_spec))

# Filter for the first order for each customer (rank = 1)
first_orders_df = delivery_ranked_df.filter(col("rank") == 1)

# Calculate immediate orders (where order_date = customer_pref_delivery_date)
immediate_orders_df = first_orders_df.withColumn(
    "is_immediate",
    when(col("order_date") == col("customer_pref_delivery_date"), 1).otherwise(0)
)

# Calculate the percentage of immediate orders
result_df = immediate_orders_df.agg(
    round(
        (_sum(col("is_immediate")) * 100.0) / count("*"), 2
    ).alias("immediate_orders_percentage")
)

# Show the result
result_df.display()

immediate_orders_percentage
50.0


Explanation of the Code:

Data Initialization:

The delivery_data contains tuples with Python datetime.date objects for date fields.

Rank Calculation:

A Window specification (Window.partitionBy("customer_id").orderBy("order_date")) is used to compute the rank of orders for each customer.
The rank() function assigns a rank to each order based on the order date for a customer.

Filter for First Orders:

Only rows where rank = 1 are kept, representing the first orders for each customer.

Immediate Order Calculation:

A new column is_immediate is added to indicate whether the order_date matches the customer_pref_delivery_date.

Percentage Calculation:


The formula (sum(is_immediate) * 100.0) / total_count calculates the percentage of immediate orders.
The result is rounded to 2 decimal places using round().