Problem Statement:

Write a SQL query to find the confirmatio rate of each user.
The confiramtion rate of a user is the number of 'confirmed' messages divided by the total number of requested confirmation messages.
The confirmation rate of a user that did not requets any confirmation messages is 0. Round the confiramtion rate to two decimal places.


In [0]:
from pyspark.sql.types import *
from datetime import datetime

In [0]:
# Schema for Signups
signups_schema = StructType(
    [
        StructField("user_id", IntegerType(), True),
        StructField("time_stamp", TimestampType(), True),
    ]
)

# Schema for Confirmations
confirmations_schema = StructType(
    [
        StructField("user_id", IntegerType(), True),
        StructField("time_stamp", TimestampType(), True),
        StructField("action_", StringType(), True),
    ]
)

In [0]:
# Data for Signups with datetime objects
signups_data = [
    (3, datetime.strptime("2020-03-21 10:16:13", "%Y-%m-%d %H:%M:%S")),
    (7, datetime.strptime("2020-01-04 13:57:59", "%Y-%m-%d %H:%M:%S")),
    (2, datetime.strptime("2020-07-29 23:09:44", "%Y-%m-%d %H:%M:%S")),
    (6, datetime.strptime("2020-12-09 10:39:37", "%Y-%m-%d %H:%M:%S")),
]

# Data for Confirmations with datetime objects
confirmations_data = [
    (3, datetime.strptime("2021-01-06 03:30:46", "%Y-%m-%d %H:%M:%S"), "timeout"),
    (3, datetime.strptime("2021-07-14 14:00:00", "%Y-%m-%d %H:%M:%S"), "timeout"),
    (7, datetime.strptime("2021-06-12 11:57:29", "%Y-%m-%d %H:%M:%S"), "confirmed"),
    (7, datetime.strptime("2021-06-13 12:58:28", "%Y-%m-%d %H:%M:%S"), "confirmed"),
    (7, datetime.strptime("2021-06-14 13:59:27", "%Y-%m-%d %H:%M:%S"), "confirmed"),
    (2, datetime.strptime("2021-01-22 00:00:00", "%Y-%m-%d %H:%M:%S"), "confirmed"),
    (2, datetime.strptime("2021-02-28 23:59:59", "%Y-%m-%d %H:%M:%S"), "timeout"),
]

In [0]:
# Create Signups DataFrame
signups_df = spark.createDataFrame(signups_data, schema=signups_schema)

# Create Confirmations DataFrame
confirmations_df = spark.createDataFrame(
    confirmations_data, schema=confirmations_schema
)

In [0]:
signups_df.display()
confirmations_df.display()

user_id,time_stamp
3,2020-03-21T10:16:13.000+0000
7,2020-01-04T13:57:59.000+0000
2,2020-07-29T23:09:44.000+0000
6,2020-12-09T10:39:37.000+0000


user_id,time_stamp,action_
3,2021-01-06T03:30:46.000+0000,timeout
3,2021-07-14T14:00:00.000+0000,timeout
7,2021-06-12T11:57:29.000+0000,confirmed
7,2021-06-13T12:58:28.000+0000,confirmed
7,2021-06-14T13:59:27.000+0000,confirmed
2,2021-01-22T00:00:00.000+0000,confirmed
2,2021-02-28T23:59:59.000+0000,timeout


In [0]:
signups_df.createOrReplaceTempView("signups")
confirmations_df.createOrReplaceTempView("confirmations")

In [0]:
from pyspark.sql.functions import col, when, avg, format_number

# Left join Signups with Confirmations
joined_df = signups_df.join(confirmations_df, on="user_id", how="left")
# Add confirmation status column
joined_df = joined_df.withColumn(
    "confirmation_status", when(col("action") == "confirmed", 1.0).otherwise(0.0)
)
# Calculate confirmation rate
confirmation_rate_df = joined_df.groupBy("user_id").agg(
    avg("confirmation_status").alias("confirmation_rate")
)

# Format the confirmation rate to two decimal places
confirmation_rate_df = confirmation_rate_df.withColumn(
    "confirmation_rate", format_number("confirmation_rate", 2)
)
confirmation_rate_df.display()

user_id,confirmation_rate
3,0.0
7,1.0
2,0.5
6,0.0


In [0]:
%sql
WITH cte AS (
  SELECT
    s.user_id,
    c.action_
  FROM
    signups s
    LEFT JOIN confirmations c ON s.user_id = c.user_id
)
SELECT
  user_id,
  CAST(
    AVG(
      CASE
        WHEN action_ = 'confirmed' THEN 1.00
        ELSE 0.00
      END
    ) AS DECIMAL(6, 2)
  ) AS confirmation_rate
FROM
  cte
GROUP BY
  user_id;

user_id,confirmation_rate
3,0.0
7,1.0
2,0.5
6,0.0


Explanation:

CTE Definition:

The cte combines data from the signups and confirmations tables.
It ensures that all user_ids from signups are included (via the LEFT JOIN), along with the corresponding action from confirmations.
Main Query:

The AVG function computes the average of 1.00 (for confirmed actions) and 0.00 (for non-confirmed actions) to calculate the confirmation_rate.
The CAST ensures the result is displayed with a precision of 2 decimal places.
GROUP BY Clause:

This groups the results by user_id, ensuring a confirmation_rate is calculated for each user.![](path)