Problem Statement:

Google's marketing team is making a Superbowl commercial and needs a simple statistic to put on their TV ad: the median number of searches a person made last year.

However, at Google scale, querying the 2 trillion searches is too costly. Luckily, you have access to the summary table which tells you the number of searches made last year and how many Google users fall into that bucket.

Write a query to report the median of searches made by a user. Round the median to one decimal point.

In [0]:
from pyspark.sql.types import *

# Define the schema for the DataFrame
schema = StructType([
    StructField("searches", IntegerType(), True),
    StructField("num_users", IntegerType(), True)
])

# Create the data
data = [
    (1, 2),
    (2, 2),
    (3, 3),
    (4, 1)
]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# display DataFrame
df.display()


searches,num_users
1,2
2,2
3,3
4,1


In [0]:
from pyspark.sql.functions import *

# Sort DataFrame by 'searches' column
df_sorted = df.orderBy("searches")

# Count total number of rows
count = df_sorted.count()

# Calculate median based on the row count (odd or even)
if count % 2 == 1:
    # Odd number of rows, select the middle one
    median_value = df_sorted.collect()[count // 2]["searches"]
else:
    # Even number of rows, calculate the average of the two middle values
    middle1 = df_sorted.collect()[(count // 2) - 1]["searches"]
    middle2 = df_sorted.collect()[(count // 2)]["searches"]
    median_value = (middle1 + middle2) / 2

# Show the result
print(f"Median: {median_value}")


Median: 2.5


In [0]:
# Create a temporary view for SQL operations
df.createOrReplaceTempView("search_frequency")

In [0]:
%sql
WITH OrderedSearches AS (
  SELECT 
    searches,
    ROW_NUMBER() OVER (ORDER BY searches) AS row_num,
    COUNT(*) OVER () AS total_rows
  FROM 
    search_frequency
)

SELECT 
  AVG(searches) AS median
FROM 
  OrderedSearches
WHERE 
  row_num IN (FLOOR((total_rows + 1) / 2.0)::INTEGER, CEIL((total_rows + 1) / 2.0)::INTEGER);


median
2.5


Explanation:

df.orderBy("searches") sorts the DataFrame by the searches column.

count = df_sorted.count() gives the total number of rows.

For an odd number of rows, we pick the middle element.

For an even number of rows, we take the average of the two middle elements.

This code calculates the median without using approxQuantile() and provides the same output of 2.5 for the provided data.