Problem Statement:

Write an SQL query to find the best-selling product in each product category. If there are two or more products with the same sales quantity, go by whichever product which has the higher review rating.

Return the category name and product name in alphabetical order of the category.

In [0]:
# Create Products DataFrame
products_data = [
    (3690, "Game of Thrones", "Books"),
    (5520, "Refrigerator", "Home Appliances"),
    (5952, "Dishwasher", "Home Appliances"),
    (3561, "IKGAI", "Books"),
    (8741, "Convertible Laptop", "Tech Gadgets"),
    (8154, "Gaming Keyboard", "Tech Gadgets"),
    (8963, "Ultra Slim Smartphone", "Tech Gadgets"),
    (5666, "Washing Machine", "Home Appliances"),
    (3300, "Ace the Data Science Interview", "Books"),
    (6078, "Kindle Oasis", "Amazon Kindle"),
    (6077, "Kindle Paperwhite", "Amazon Kindle"),
]

products_columns = ["product_id", "product_name", "category_name"]
products_df = spark.createDataFrame(products_data, products_columns)

# Create Product_Sales DataFrame
sales_data = [
    (3690, 300, 4.9),
    (5520, 70, 3.8),
    (5952, 70, 4.0),
    (3561, 290, 4.5),
    (3300, 450, 5.0),
    (6077, 126, 4.1),
    (6078, 230, 4.3),
    (8741, 40, 3.5),
    (8963, 190, 4.5),
    (5666, 30, 3.4),
    (8154, 190, 4.6),
]

sales_columns = ["product_id", "sales_quantity", "rating"]
sales_df = spark.createDataFrame(sales_data, sales_columns)

# display the data
products_df.display()
sales_df.display()

product_id,product_name,category_name
3690,Game of Thrones,Books
5520,Refrigerator,Home Appliances
5952,Dishwasher,Home Appliances
3561,IKGAI,Books
8741,Convertible Laptop,Tech Gadgets
8154,Gaming Keyboard,Tech Gadgets
8963,Ultra Slim Smartphone,Tech Gadgets
5666,Washing Machine,Home Appliances
3300,Ace the Data Science Interview,Books
6078,Kindle Oasis,Amazon Kindle


product_id,sales_quantity,rating
3690,300,4.9
5520,70,3.8
5952,70,4.0
3561,290,4.5
3300,450,5.0
6077,126,4.1
6078,230,4.3
8741,40,3.5
8963,190,4.5
5666,30,3.4


In [0]:
products_df.createOrReplaceTempView("products")
sales_df.createOrReplaceTempView("product_sales")

In [0]:
%sql
WITH RankedProducts AS (
  SELECT
    RANK() OVER (
      PARTITION BY p.category_name
      ORDER BY
        ps.sales_quantity DESC,
        ps.rating DESC
    ) AS product_rank,
    p.category_name,
    p.product_name,
    ps.sales_quantity,
    ps.rating
  FROM
    products p
    JOIN product_sales ps ON p.product_id = ps.product_id
)
SELECT
  category_name,
  product_name
FROM
  RankedProducts
WHERE
  product_rank = 1

category_name,product_name
Amazon Kindle,Kindle Oasis
Books,Ace the Data Science Interview
Home Appliances,Dishwasher
Tech Gadgets,Gaming Keyboard


In [0]:
from pyspark.sql import Window
from pyspark.sql.functions import *

# Define the window specification
windowSpec = Window.partitionBy("category_name").orderBy(
    col("sales_quantity").desc(), col("rating").desc()
)

# Join the products and sales DataFrames
joined_df = products_df.join(sales_df, on="product_id")

# Add the rank column
ranked_df = joined_df.withColumn("product_rank", rank().over(windowSpec))

# Filter to get only the top-ranked products (product_rank = 1)
top_products_df = ranked_df.filter(col("product_rank") == 1).select(
    "category_name", "product_name"
)

# Show the result
top_products_df.display()

category_name,product_name
Amazon Kindle,Kindle Oasis
Books,Ace the Data Science Interview
Home Appliances,Dishwasher
Tech Gadgets,Gaming Keyboard


Explanation:

A CTE (HighestRatedProducts) is used to:

Join the products and product_sales tables on the product_id.

Use rank() to rank products within each category based on their rating (highest rating first).

The final query selects the product with the highest rating (rank = 1) for each category.

Example Output:

category_name	product_name

Books	Game of Thrones

Home Appliances	Dishwasher

The dataset you are querying against may have different input & output - this is just an example!