able: Trips


+-------------+----------+

| Column Name | Type     |

+-------------+----------+

| id          | int      |
| client_id   | int      |
| driver_id   | int      |
| city_id     | int      |
| status      | enum     |
| request_at  | varchar  |     
+-------------+----------+
id is the primary key (column with unique values) for this table.
The table holds all taxi trips. Each trip has a unique id, while client_id and driver_id are foreign keys to the users_id at the Users table.
Status is an ENUM (category) type of ('completed', 'cancelled_by_driver', 'cancelled_by_client').
 

Table: Users

+-------------+----------+

| Column Name | Type     |

+-------------+----------+

| users_id    | int      |
| banned      | enum     |
| role        | enum     |
+-------------+----------+
users_id is the primary key (column with unique values) for this table.
The table holds all users. Each user has a unique users_id, and role is an ENUM type of ('client', 'driver', 'partner').
banned is an ENUM (category) type of ('Yes', 'No').
 

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day.

Write a solution to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points.

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Trips table:

+----+-----------+-----------+---------+---------------------+------------+

| id | client_id | driver_id | city_id | status              | request_at |

+----+-----------+-----------+---------+---------------------+------------+
| 1  | 1         | 10        | 1       | completed           | 2013-10-01 |
| 2  | 2         | 11        | 1       | cancelled_by_driver | 2013-10-01 |
| 3  | 3         | 12        | 6       | completed           | 2013-10-01 |
| 4  | 4         | 13        | 6       | cancelled_by_client | 2013-10-01 |
| 5  | 1         | 10        | 1       | completed           | 2013-10-02 |
| 6  | 2         | 11        | 6       | completed           | 2013-10-02 |
| 7  | 3         | 12        | 6       | completed           | 2013-10-02 |
| 8  | 2         | 12        | 12      | completed           | 2013-10-03 |
| 9  | 3         | 10        | 12      | completed           | 2013-10-03 |
| 10 | 4         | 13        | 12      | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+
Users table:

+----------+--------+--------+

| users_id | banned | role   |

+----------+--------+--------+
| 1        | No     | client |
| 2        | Yes    | client |
| 3        | No     | client |
| 4        | No     | client |
| 10       | No     | driver |
| 11       | No     | driver |
| 12       | No     | driver |
| 13       | No     | driver |
+----------+--------+--------+

Output: 

+------------+-------------------+

| Day        | Cancellation Rate |

+------------+-------------------+
| 2013-10-01 | 0.33              |
| 2013-10-02 | 0.00              |
| 2013-10-03 | 0.50              |
+------------+-------------------+

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, sum as spark_sum, count, round as spark_round, floor

# Initialize SparkSession
spark = SparkSession.builder.appName("CancellationRate").getOrCreate()

# Sample data for Trips table
trips_data = [
    (1, 1, 10, 1, "completed", "2013-10-01"),
    (2, 2, 11, 1, "cancelled_by_driver", "2013-10-01"),
    (3, 3, 12, 6, "completed", "2013-10-01"),
    (4, 4, 13, 6, "cancelled_by_client", "2013-10-01"),
    (5, 1, 10, 1, "completed", "2013-10-02"),
    (6, 2, 11, 6, "completed", "2013-10-02"),
    (7, 3, 12, 6, "completed", "2013-10-02"),
    (8, 2, 12, 12, "completed", "2013-10-03"),
    (9, 3, 10, 12, "completed", "2013-10-03"),
    (10, 4, 13, 12, "cancelled_by_driver", "2013-10-03")
]

# Sample data for Users table
users_data = [
    (1, "No", "client"),
    (2, "Yes", "client"),
    (3, "No", "client"),
    (4, "No", "client"),
    (10, "No", "driver"),
    (11, "No", "driver"),
    (12, "No", "driver"),
    (13, "No", "driver")
]

# Create DataFrames
trips_df = spark.createDataFrame(trips_data, ["id", "client_id", "driver_id", "city_id", "status", "request_at"])
users_df = spark.createDataFrame(users_data, ["users_id", "banned", "role"])

In [0]:

# Join the Trips table with the Users table twice: once for clients and once for drivers
client_df = users_df.filter(col("role") == "client").alias("c")
driver_df = users_df.filter(col("role") == "driver").alias("d")

# Filter out banned users and join
joined_df = trips_df.join(client_df, trips_df.client_id == col("c.users_id")) \
                    .join(driver_df, trips_df.driver_id == col("d.users_id")) \
                    .filter((col("c.banned") == "No") & (col("d.banned") == "No"))

# Calculate the cancellation rate for each day
result_df = joined_df.groupBy("request_at").agg(
    spark_round(
        spark_sum(when(col("status").isin("cancelled_by_driver", "cancelled_by_client"), 1).otherwise(0)) / 
        count("*"), 2
    ).alias("Cancellation Rate")
)

# Format the result by removing trailing zeros for whole numbers
formatted_result_df = result_df.select(
    col("request_at").alias("Day"),
    when(
        floor(col("Cancellation Rate")) == col("Cancellation Rate"),
        col("Cancellation Rate").cast("int")
    ).otherwise(col("Cancellation Rate")).alias("Cancellation Rate")
)

# Show the result
formatted_result_df.display()

Day,Cancellation Rate
2013-10-03,0.5
2013-10-01,0.33
2013-10-02,0.0


In [0]:
trips_df.createOrReplaceTempView('tips')
users_df.createOrReplaceTempView('Users')

In [0]:
# Write the SQL query
query = """
SELECT 
    t.request_at AS Day,
    CASE 
        WHEN ROUND(SUM(CASE 
                         WHEN t.status IN ('cancelled_by_driver', 'cancelled_by_client') THEN 1 
                         ELSE 0 
                      END) / COUNT(*), 2) = FLOOR(SUM(CASE 
                                                        WHEN t.status IN ('cancelled_by_driver', 'cancelled_by_client') THEN 1 
                                                        ELSE 0 
                                                     END) / COUNT(*)) 
        THEN CAST(FLOOR(SUM(CASE 
                             WHEN t.status IN ('cancelled_by_driver', 'cancelled_by_client') THEN 1 
                             ELSE 0 
                          END) / COUNT(*)) AS int)
        ELSE ROUND(SUM(CASE 
                         WHEN t.status IN ('cancelled_by_driver', 'cancelled_by_client') THEN 1 
                         ELSE 0 
                      END) / COUNT(*), 2)
    END AS "Cancellation Rate"
FROM 
    Trips t
JOIN 
    Users c ON t.client_id = c.users_id AND c.banned = 'No' AND c.role = 'client'
JOIN 
    Users d ON t.driver_id = d.users_id AND d.banned = 'No' AND d.role = 'driver'
WHERE 
    t.request_at BETWEEN '2013-10-01' AND '2013-10-03'
GROUP BY 
    t.request_at
"""

# Execute the query using Spark SQL
result_df = spark.sql(query)

# Show the result
result_df.display()