Problem Statement:

visit_date is the column with unique values for this table.
Each row of this table contains the visit date and visit id to the stadium with the number of people during the visit.
As the id increases, the date increases as well.
 
Write a solution to display the records with three or more rows with consecutive id's, and the number of people is greater than or equal to 100 for each.

Return the result table ordered by visit_date in ascending order.

The result format is in the following example.

In [0]:
from pyspark.sql.functions import col, rank, count, expr
from pyspark.sql.window import Window
# Sample data
data = [
    (1, "2017-01-01", 10),
    (2, "2017-01-02", 109),
    (3, "2017-01-03", 150),
    (4, "2017-01-04", 99),
    (5, "2017-01-05", 145),
    (6, "2017-01-06", 1455),
    (7, "2017-01-07", 199),
    (8, "2017-01-09", 188),
]

# Define the schema
columns = ["id", "visit_date", "people"]

# Create DataFrame
stadium_df = spark.createDataFrame(data, schema=columns)

# dispaly the result
stadium_df.display()


id,visit_date,people
1,2017-01-01,10
2,2017-01-02,109
3,2017-01-03,150
4,2017-01-04,99
5,2017-01-05,145
6,2017-01-06,1455
7,2017-01-07,199
8,2017-01-09,188


In [0]:
stadium_df.createOrReplaceTempView('Stadium')

In [0]:
# Step 1: Filter rows where `people` >= 100
filtered_df = stadium_df.filter(col("people") >= 100)

# Step 2: Add rank column using Window function
window_spec = Window.orderBy("id")
ranked_df = filtered_df.withColumn("rnk", rank().over(window_spec))

# Step 3: Add the "island" column (id - rnk)
ranked_df = ranked_df.withColumn("island", col("id") - col("rnk"))

# Step 4: Identify islands with at least 3 consecutive rows
island_counts = (
    ranked_df.groupBy("island")
    .agg(count("*").alias("island_count"))
    .filter(col("island_count") >= 3)
)

# Step 5: Join back to filter rows matching the islands with >= 3 rows
result_df = (
    ranked_df.join(island_counts, on="island", how="inner")
    .select("id", "visit_date", "people")
)

# Step 6: Order by id (ascending) and people (descending) as an example
final_result = result_df.orderBy(col("id").asc(), col("people").desc())

# display the final result
final_result.display()

id,visit_date,people
5,2017-01-05,145
6,2017-01-06,1455
7,2017-01-07,199
8,2017-01-09,188


In [0]:
%sql
WITH stadium_with_rnk AS
(
    SELECT id, visit_date, people, rnk, (id - rnk) AS island
    FROM (
        SELECT id, visit_date, people, RANK() OVER(ORDER BY id) AS rnk
        FROM Stadium
        WHERE people >= 100) AS t0
)
SELECT id, visit_date, people 
FROM stadium_with_rnk
WHERE island IN (SELECT island 
                 FROM stadium_with_rnk 
                 GROUP BY island 
                 HAVING COUNT(*) >= 3)
ORDER BY visit_date

id,visit_date,people
5,2017-01-05,145
6,2017-01-06,1455
7,2017-01-07,199
8,2017-01-09,188


Explanation of Code:

Filter Rows: Keep rows where people >= 100.
Rank Rows: Use the rank() function with a window ordered by id.
Calculate Island: Subtract the rank (rnk) from the id to identify groups of consecutive rows.
Group by Island: Count the number of rows in each island, keeping only those with at least 3 rows.
Filter Islands: Use a join to keep only rows belonging to these valid islands.
Order Result: Customize the order with orderBy().

Output:

If the data matches the input, the code will produce a DataFrame with rows ordered by id and people as specified. You can adjust the orderBy() to change the sorting.