Problem Statement:

Given two datasets, one representing lifts and another representing passengers using those lifts, the task is to determine which passengers can be accommodated in each lift based on their cumulative weight.

Datasets:

Lifts Dataset (lifts):

Schema:
id (int): Identifier for the lift.
capacity_kg (int): Maximum weight capacity of the lift in kilograms.
Passengers Dataset (passengers):

Schema:
passenger_name (string): Name of the passenger.
weight_kg (int): Weight of the passenger in kilograms.
lift_id (int): Identifier of the lift the passenger is using.
Task:

For each lift, group the passengers who can fit within the lift’s capacity and collect their names into a list.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the lift data
lift_data = [
    (1, 300),
    (2, 350)
]

lift_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("capacity_kg", IntegerType(), True)
])

# Create lift DataFrame
lift_df = spark.createDataFrame(data=lift_data, schema=lift_schema)

# Display lift DataFrame
lift_df.display()

# Define lift passengers data
lift_passengers_data = [
    ('Rahul', 85, 1),
    ('Adarsh', 73, 1),
    ('Riti', 95, 1),
    ('Viraj', 80, 1),
    ('Vimal', 83, 2),
    ('Neha', 77, 2),
    ('Priti', 73, 2),
    ('Himanshi', 85, 2)
]

lift_passengers_schema = StructType([
    StructField("passenger_name", StringType(), True),
    StructField("weight_kg", IntegerType(), True),
    StructField("lift_id", IntegerType(), True)
])

# Create lift passengers DataFrame
lift_passengers_df = spark.createDataFrame(data=lift_passengers_data, schema=lift_passengers_schema)

# Display lift passengers DataFrame
lift_passengers_df.display()


id,capacity_kg
1,300
2,350


passenger_name,weight_kg,lift_id
Rahul,85,1
Adarsh,73,1
Riti,95,1
Viraj,80,1
Vimal,83,2
Neha,77,2
Priti,73,2
Himanshi,85,2


In [0]:
# Join lift_passengers_df with lift_df on lift_id
df_result = lift_passengers_df.join(lift_df, lift_passengers_df.lift_id == lift_df.id, how="inner") \
    .select("passenger_name", "weight_kg", "lift_id", "capacity_kg")

# Define the window specification to calculate running weight for each lift_id
window_spec = Window.partitionBy("lift_id").orderBy("weight_kg")

# Add the running weight column
df_result = df_result.withColumn("running_wt", F.sum("weight_kg").over(window_spec))

# Filter rows where running weight is less than or equal to the lift capacity
df_filtered = df_result.filter(df_result["running_wt"] <= df_result["capacity_kg"])

# Group by lift_id and collect the passenger names
df_grouped = df_filtered.groupBy(F.col("lift_id")) \
    .agg(F.collect_list(F.col("passenger_name")).alias("passenger_names"))

# Display the result
df_grouped.display()


lift_id,passenger_names
1,"List(Adarsh, Viraj, Rahul)"
2,"List(Priti, Neha, Vimal, Himanshi)"


Explanation:

Filter: The filter condition ensures that only rows where the running weight is less than or equal to the lift capacity are included.
groupBy: After filtering, we group by lift_id and collect the list of passenger names who can fit within the lift's weight capacity using F.collect_list.
Corrected functions: F.col() and F.collect_list() are used correctly, and the method show() is used instead of display() to output the DataFrame.

In [0]:
# Register the DataFrames as SQL temporary views
lift_df.createOrReplaceTempView("lifts")
lift_passengers_df.createOrReplaceTempView("passengers")

In [0]:
# Write the SQL query
query = """
SELECT lift_id, 
       collect_list(passenger_name) AS passenger_names
FROM (
    SELECT p.lift_id, 
           p.passenger_name, 
           p.weight_kg, 
           l.capacity_kg,
           sum(p.weight_kg) OVER (PARTITION BY p.lift_id ORDER BY p.weight_kg) AS running_wt
    FROM passengers p
    JOIN lifts l ON p.lift_id = l.id
) 
WHERE running_wt <= capacity_kg
GROUP BY lift_id
"""

# Execute the query
df_result = spark.sql(query)

# Display the result
df_result.display()


lift_id,passenger_names
1,"List(Adarsh, Viraj, Rahul)"
2,"List(Priti, Neha, Vimal, Himanshi)"


Explanation:

Create Temporary Views:
Register the DataFrames lift_df and lift_passengers_df as temporary views named lifts and passengers, respectively.

SQL Query:
The inner query performs a JOIN between passengers and lifts, calculates the running weight using the SUM window function, and filters rows where the running weight is less than or equal to the lift's capacity.
The outer query groups by lift_id and collects the passenger names into a list.

Execute the Query: 
  Use spark.sql(query) to run the SQL query.
Display Results: 
  Display the resulting DataFrame.