Assume you have an events table on Facebook app analytics. Write a query to calculate the click-through rate (CTR) for the app in 2022 and round the results to 2 decimal places.

Definition and note:

Percentage of click-through rate (CTR) = 100.0 * Number of clicks / Number of impressions
To avoid integer division, multiply the CTR by 100.0, not 100.

In [0]:
from pyspark.sql.types import (
    StructType,
    StructField,
    IntegerType,
    StringType,
    TimestampType,
)
from pyspark.sql.functions import col
import datetime

# Define schema for the data
schema = StructType(
    [
        StructField("app_id", IntegerType(), True),
        StructField("event_type", StringType(), True),
        StructField("timestamp", TimestampType(), True),
    ]
)

# Create data as a list of tuples
data = [
    (123, "impression", datetime.datetime(2022, 7, 18, 11, 36, 12)),
    (123, "impression", datetime.datetime(2022, 7, 18, 11, 37, 12)),
    (123, "click", datetime.datetime(2022, 7, 18, 11, 37, 42)),
    (234, "impression", datetime.datetime(2022, 8, 18, 14, 15, 12)),
    (234, "click", datetime.datetime(2022, 8, 18, 14, 16, 12)),
    (123, "impression", datetime.datetime(2021, 10, 23, 12, 11, 56)),
    (123, "click", datetime.datetime(2021, 10, 23, 12, 22, 12)),
    (123, "impression", datetime.datetime(2022, 4, 6, 13, 13, 13)),
    (123, "click", datetime.datetime(2022, 4, 7, 12, 20, 30)),
    (234, "impression", datetime.datetime(2022, 2, 9, 10, 5, 2)),
    (234, "impression", datetime.datetime(2022, 5, 20, 12, 0, 0)),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display the DataFrame
df.display()

app_id,event_type,timestamp
123,impression,2022-07-18T11:36:12.000+0000
123,impression,2022-07-18T11:37:12.000+0000
123,click,2022-07-18T11:37:42.000+0000
234,impression,2022-08-18T14:15:12.000+0000
234,click,2022-08-18T14:16:12.000+0000
123,impression,2021-10-23T12:11:56.000+0000
123,click,2021-10-23T12:22:12.000+0000
123,impression,2022-04-06T13:13:13.000+0000
123,click,2022-04-07T12:20:30.000+0000
234,impression,2022-02-09T10:05:02.000+0000


In [0]:
from pyspark.sql import functions as F

# Filter the data for the desired date range
filtered_df = df.filter((df.timestamp >= "2022-01-01") & (df.timestamp < "2023-01-01"))

# Calculate CTR for each app_id
ctr_df = filtered_df.groupBy("app_id").agg(
    F.round(
        100.0
        * F.sum(F.when(F.col("event_type") == "click", 1).otherwise(0))
        / F.sum(F.when(F.col("event_type") == "impression", 1).otherwise(0)),
        2,
    ).alias("ctr_app")
)

# Show the results
ctr_df.display()

app_id,ctr_app
123,66.67
234,33.33


In [0]:
df.createOrReplaceTempView("events")

In [0]:
%sql
WITH ctr_rate AS (
  SELECT
    app_id,
    ROUND(
      100.0 * COUNT(
        CASE
          WHEN event_type = 'click' THEN 1
        END
      ) / NULLIF(
        COUNT(
          CASE
            WHEN event_type = 'impression' THEN 1
          END
        ),
        0
      ),
      2
    ) AS ctr_rate
  FROM
    events
  WHERE
    timestamp >= '2022-01-01'
    AND timestamp < '2023-01-01'
  GROUP BY
    app_id
)
SELECT
  *
FROM
  ctr_rate;

app_id,ctr_rate
123,66.67
234,33.33


Explanation:

Filter: 
Limits data to timestamps within the specified date range.

Aggregation:
F.when(...).otherwise(...) conditions are used to count click and impression events separately.
F.sum(...) is used to aggregate counts for clicks and impressions.

We calculate CTR by dividing the sum of clicks by the sum of impressions and then multiply by 100.
F.round(..., 2) rounds the CTR to two decimal places.

Alias: 
Assigns the final column a name (ctr_app) similar to the CTE query.
This will output the Click-Through Rate (CTR) per app_id for the specified date range.