Problem Statement:

You are working with a sales dataset that includes information on products sold, their sale dates, and product IDs. The product names are not standardized and contain inconsistencies such as leading/trailing spaces and varied letter cases (e.g., "LCPhone", " LCPHONE", "LcPhOnE"). The objective is to clean the product names and analyze the sales data by:

Formatting the sale_date to show only the year and month (YYYY-MM).
Standardizing the product_name by trimming any extra spaces and converting all names to lowercase.
Counting the number of products sold for each combination of product name and sale month.
The results should be grouped by the formatted sale date and the standardized product name.

In [0]:
from pyspark.sql.types import *
import datetime

# Define the schema
schema = StructType(
    [
        StructField("sale_id", IntegerType(), True),
        StructField("product_name", StringType(), True),
        StructField("sale_date", DateType(), True),
    ]
)

# Sample data
data = [
    (1, "  LCPHONE", datetime.date(2000, 1, 16)),
    (2, "LCPhone", datetime.date(2000, 1, 17)),
    (3, "LcPhOnE", datetime.date(2000, 2, 18)),
    (4, "LCKeyCHAiN  ", datetime.date(2000, 2, 19)),
    (5, "LCKeyChain", datetime.date(2000, 2, 28)),
    (6, " Matryoshka", datetime.date(2000, 3, 31)),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display the DataFrame
df.display()

sale_id,product_name,sale_date
1,LCPHONE,2000-01-16
2,LCPhone,2000-01-17
3,LcPhOnE,2000-02-18
4,LCKeyCHAiN,2000-02-19
5,LCKeyChain,2000-02-28
6,Matryoshka,2000-03-31


In [0]:
from pyspark.sql.functions import *

# Apply transformations and group by date and product name
df_grouped = (
    df.withColumn("year_months", date_format("sale_date", "yyyy-MM"))
    .withColumn("product_name", lower(trim("product_name")))
    .groupBy("year_months", "product_name")
    .agg(count("sale_id").alias("product_sold"))
)

# display the result
df_grouped.display()

year_months,product_name,product_sold
2000-01,lcphone,2
2000-02,lcphone,1
2000-02,lckeychain,2
2000-03,matryoshka,1


In [0]:
# Register the DataFrame as a temporary view
df.createOrReplaceTempView("sales")

In [0]:
# Write and execute the Spark SQL query
result = spark.sql(
    """
    SELECT DATE_FORMAT(sale_date, 'yyyy-MM') as year_months,
           LOWER(TRIM(product_name)) as product_names,
           COUNT(sale_id) as product_sold
    FROM sales
    GROUP BY year_months, product_names
"""
)

# Show the result
result.display()

year_months,product_names,product_sold
2000-01,lcphone,2
2000-02,lcphone,1
2000-02,lckeychain,2
2000-03,matryoshka,1


Explanation:

createOrReplaceTempView("sales_"): Registers the DataFrame as a SQL temporary view, allowing you to use SQL queries.

Spark SQL query: The SQL syntax is almost the same as regular SQL, except that the DATE_FORMAT function is slightly different ('yyyy-MM' instead of "%Y-%m").

spark.sql(): Executes the SQL query on the registered temporary view.

This will give you the grouped result based on the formatted date and cleaned product name, along with the count of products sold.