Problem Statement:
We have two datasets:

Tickets Table (ticket_df):

Contains records of support tickets with their issue and resolution dates.

Contains a list of holidays that fall within the ticket issue and resolution dates.

The goal is to calculate the final working days required to resolve each ticket. The calculation considers the following:

Exclude weekends (Saturday and Sunday) from the total days between issue_date and resolve_date.

Subtract any holidays listed in the holiday_df that fall between the issue_date and resolve_date.

In [0]:
from pyspark.sql.types import *
from datetime import date

# Define schema for ticket table
ticket_schema = StructType(
    [
        StructField("ticket_id", IntegerType(), True),
        StructField("issue_date", DateType(), True),
        StructField("resolve_date", DateType(), True),
    ]
)

# Define schema for holiday_cal table
holiday_schema = StructType(
    [
        StructField("holiday_date", DateType(), True),
        StructField("occasion", StringType(), True),
    ]
)

# Data for ticket table
ticket_data = [
    (1, date(2024, 12, 18), date(2025, 1, 7)),
    (2, date(2024, 12, 20), date(2025, 1, 10)),
    (3, date(2024, 12, 22), date(2025, 1, 11)),
    (4, date(2025, 1, 2), date(2025, 1, 13)),
]

# Data for holiday_cal table
holiday_data = [(date(2024, 12, 25), "christmas"), (date(2025, 1, 1), "new_year")]

# Create DataFrames
ticket_df = spark.createDataFrame(ticket_data, schema=ticket_schema)
holiday_df = spark.createDataFrame(holiday_data, schema=holiday_schema)

# Show the DataFrames
ticket_df.display()
holiday_df.display()

ticket_id,issue_date,resolve_date
1,2024-12-18,2025-01-07
2,2024-12-20,2025-01-10
3,2024-12-22,2025-01-11
4,2025-01-02,2025-01-13


holiday_date,occasion
2024-12-25,christmas
2025-01-01,new_year


In [0]:
# Register DataFrames as temporary views
ticket_df.createOrReplaceTempView("ticket")
holiday_df.createOrReplaceTempView("holiday_cal")

# Query the DataFrames using Spark SQL (if needed)
queried_ticket_df = spark.sql("SELECT * FROM ticket")
queried_holiday_df = spark.sql("SELECT * FROM holiday_cal")

# Show the DataFrames
queried_ticket_df.display()
queried_holiday_df.display()

ticket_id,issue_date,resolve_date
1,2024-12-18,2025-01-07
2,2024-12-20,2025-01-10
3,2024-12-22,2025-01-11
4,2025-01-02,2025-01-13


holiday_date,occasion
2024-12-25,christmas
2025-01-01,new_year


In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Window

# Step 1: Create ticket_cte with `actual_days`
ticket_cte = ticket_df.withColumn(
    "actual_days",
    F.datediff(F.col("resolve_date"), F.col("issue_date")) -
    (F.floor(F.datediff(F.col("resolve_date"), F.col("issue_date")) / 7) * 2)
)

# Step 2: Perform the left join with holiday_cal
joined_df = ticket_cte.join(
    holiday_df,
    (holiday_df["holiday_date"] >= ticket_cte["issue_date"]) &
    (holiday_df["holiday_date"] <= ticket_cte["resolve_date"]),
    "left"
)

# Step 3: Group by ticket_id and calculate `final_working_days`
result_df = joined_df.groupBy(
    "ticket_id", "issue_date", "resolve_date", "actual_days"
).agg(
    (F.col("actual_days") - F.count("occasion")).alias("final_working_days")
)

# Show the final result
result_df.select("ticket_id", "issue_date", "resolve_date", "final_working_days").display()

ticket_id,issue_date,resolve_date,final_working_days
1,2024-12-18,2025-01-07,14
2,2024-12-20,2025-01-10,13
3,2024-12-22,2025-01-11,14
4,2025-01-02,2025-01-13,9


In [0]:
%sql
WITH ticket_cte AS (
  SELECT
    *,
    DATEDIFF(resolve_date, issue_date) - FLOOR(DATEDIFF(resolve_date, issue_date) / 7) * 2 AS actual_days
  FROM ticket
)
SELECT
  tc.ticket_id,
  tc.issue_date,
  tc.resolve_date,
  tc.actual_days - COUNT(hc.occasion) AS final_working_days
FROM ticket_cte AS tc
LEFT JOIN holiday_cal AS hc
  ON hc.holiday_date BETWEEN tc.issue_date AND tc.resolve_date
GROUP BY
  tc.ticket_id,
  tc.issue_date,
  tc.resolve_date,
  tc.actual_days

ticket_id,issue_date,resolve_date,final_working_days
1,2024-12-18,2025-01-07,14
2,2024-12-20,2025-01-10,13
3,2024-12-22,2025-01-11,14
4,2025-01-02,2025-01-13,9


Explanation:
    
Step 1: Compute actual_days:

Subtract weekends from the total days between issue_date and resolve_date using F.datediff and F.floor.

Step 2: Left join with holiday_cal:

Match rows where the holiday_date falls between the issue_date and resolve_date.

Step 3: Group by the ticket details:

Aggregate to calculate final_working_days by subtracting the count of holidays from actual_days.

Step 4: Select only the required columns:

Filter out unwanted columns (actual_days and others).

Step 5: Display the result.