Probelm statement:

Given the reviews table, write a query to retrieve the average star rating for each product, grouped by month. The output should display the month as a numerical value, product ID, and average star rating rounded to two decimal places. Sort the output first by month and then by product ID.

In [0]:
from pyspark.sql.types import *

# Define schema
schema = StructType(
    [
        StructField("review_id", IntegerType(), True),
        StructField("user_id", IntegerType(), True),
        StructField("submit_date", StringType(), True),
        StructField("product_id", IntegerType(), True),
        StructField("stars", IntegerType(), True),
    ]
)

# Define data
data = [
    (6171, 123, "08/16/2022 12:00:00", 50001, 4),
    (7802, 265, "10/28/2022 12:00:00", 69852, 4),
    (5293, 362, "10/04/2021 12:00:00", 50001, 3),
    (6352, 192, "10/06/2024 12:00:00", 69852, 3),
    (4517, 981, "09/16/2024 12:00:00", 69852, 2),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display DataFrame
df.display()

review_id,user_id,submit_date,product_id,stars
6171,123,08/16/2022 12:00:00,50001,4
7802,265,10/28/2022 12:00:00,69852,4
5293,362,10/04/2021 12:00:00,50001,3
6352,192,10/06/2024 12:00:00,69852,3
4517,981,09/16/2024 12:00:00,69852,2


In [0]:
from pyspark.sql import functions as F

# Convert 'submit_date' to timestamp, extract month, and calculate average stars
df = df.withColumn("submit_date", F.to_timestamp("submit_date", "MM/dd/yyyy HH:mm:ss"))
result_df = (
    df.groupBy(F.month("submit_date").alias("mth"), "product_id")
    .agg(F.round(F.avg("stars"), 2).alias("avg_stars"))
    .orderBy("mth", "product_id")
)

# display the result
result_df.display()

mth,product_id,avg_stars
8,50001,4.0
9,69852,2.0
10,50001,3.0
10,69852,3.5


In [0]:
df.createOrReplaceTempView("reviews")

In [0]:
%sql
SELECT
  MONTH(submit_date) AS mth,
  product_id,
  ROUND(AVG(stars), 2) AS avg_stars
FROM
  reviews
GROUP BY
  MONTH(submit_date),
  product_id
ORDER BY
  mth,
  product_id;

mth,product_id,avg_stars
8,50001,4.0
9,69852,2.0
10,50001,3.0
10,69852,3.5


Explanation:

MONTH(submit_date) AS mth: Extracts the month directly using MONTH(), which is supported by Spark SQL.
ROUND(AVG(stars), 2) AS avg_stars: Computes the average of stars and rounds it to two decimal places.
This query groups and sorts by mth and product_id as in your original SQL version, giving you the monthly average stars rating per product.

Let me know if there’s more you need with this!