 
 ✅ Problem Statement
The problem is to calculate the minimum number of platforms required at a train station based on the given arrival_times and departure_times.

✅ Problem Breakdown:
>We need to merge both arrival_time and departure_time into a unified dataset.
>We’ll use a window function to track how many platforms are required at each point in time.
>For each train arrival, we’ll add a platform (+1) and for each train departure, we’ll subtract a platform (-1).
>Finally, we will calculate the maximum number of platforms required at any point in time during the day.
✅ Input Data (Train Arrival and Departure times below)


In [0]:
arrivals_data = [
 (1, '2024-11-17 08:00'),
 (2, '2024-11-17 08:05'),
 (3, '2024-11-17 08:05'),
 (4, '2024-11-17 08:10'),
 (5, '2024-11-17 08:10'),
 (6, '2024-11-17 12:15'),
 (7, '2024-11-17 12:20'),
 (8, '2024-11-17 12:25'),
 (9, '2024-11-17 15:00'),
 (10, '2024-11-17 15:00'),
 (11, '2024-11-17 15:00'),
 (12, '2024-11-17 15:06'),
 (13, '2024-11-17 20:00'),
 (14, '2024-11-17 20:10')
]

departures_data = [
 (1, '2024-11-17 08:15'),
 (2, '2024-11-17 08:10'),
 (3, '2024-11-17 08:20'),
 (4, '2024-11-17 08:25'),
 (5, '2024-11-17 08:20'),
 (6, '2024-11-17 13:00'),
 (7, '2024-11-17 12:25'),
 (8, '2024-11-17 12:30'),
 (9, '2024-11-17 15:05'),
 (10, '2024-11-17 15:10'),
 (11, '2024-11-17 15:15'),
 (12, '2024-11-17 15:15'),
 (13, '2024-11-17 20:15'),
 (14, '2024-11-17 20:15')
]

# Define schema for the data
arrival_schema = ['train_id', 'arrival_time']
departure_schema = ['train_id', 'departure_time']

# Create DataFrames
arrivals_df = spark.createDataFrame(arrivals_data, arrival_schema)
departures_df = spark.createDataFrame(departures_data, departure_schema)

####Step 1: Merge Arrival and Departure Data
####Combine the arrivals_df and departures_df into a unified dataset with an indicator for arrival (+1) and departure (-1).

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Add 'event_type' column to indicate arrival (+1) or departure (-1)
arrivals_df = arrivals_df.withColumn("event_type", F.lit(1))
departures_df = departures_df.withColumn("event_type", F.lit(-1))

# Rename columns for consistency
arrivals_df = arrivals_df.withColumnRenamed("arrival_time", "event_time")
departures_df = departures_df.withColumnRenamed("departure_time", "event_time")

# Union the two datasets
events_df = arrivals_df.union(departures_df)

# Sort by event_time and event_type (arrival should come before departure for same time)
events_df = events_df.orderBy("event_time", "event_type")


In [0]:
#arrivals_df.display()
#departures_df.display()
#events_df.display()

####Step 2: Calculate the Running Sum of Platforms
####Use a window function to compute the running total of platforms required at each event.

In [0]:
# Define a window specification for cumulative sum
window_spec = Window.orderBy("event_time")

# Calculate running sum of event_type to track platform usage
events_df = events_df.withColumn("platforms_in_use", F.sum("event_type").over(window_spec))

# Show the intermediate results
events_df.display()


train_id,event_time,event_type,platforms_in_use
1,2024-11-17 08:00,1,1
2,2024-11-17 08:05,1,3
3,2024-11-17 08:05,1,3
2,2024-11-17 08:10,-1,4
4,2024-11-17 08:10,1,4
5,2024-11-17 08:10,1,4
1,2024-11-17 08:15,-1,3
3,2024-11-17 08:20,-1,1
5,2024-11-17 08:20,-1,1
4,2024-11-17 08:25,-1,0


####Step 3: Find the Maximum Platforms Required
####The maximum value of platforms_in_use column gives the minimum number of platforms required.

In [0]:
# Find the maximum number of platforms in use
#max_platforms = events_df.agg(F.max("platforms_in_use").alias("max_platforms")).collect()[0]["max_platforms"]

max_platforms = events_df.agg(F.max("platforms_in_use")).collect()[0][0]

print(f"Minimum number of platforms required: {max_platforms}")


Minimum number of platforms required: 4
