# **Fix the Issue**

## **Problem Statement**

You are given a `Sales` table with the following schema:

| Column Name   | Data Type |
|---------------|-----------|
| sale_id       | INT       |
| product_name  | VARCHAR   |
| sale_date     | DATE      |

- `sale_id` is the primary key.
- `product_name` may contain **leading/trailing spaces** and is **case-insensitive**.
- You need to:
  1. Normalize the `product_name` (trim + lowercase).
  2. Format `sale_date` as `'YYYY-MM'`.
  3. Count how many times each normalized product was sold per month.
  4. Return results sorted by `product_name` and `sale_date`.

---

#### **Expected Output**

| product_name | sale_date | total |
|--------------|-----------|-------|
| lckeychain   | 2000-02   | 2     |
| lcphone      | 2000-01   | 2     |
| lcphone      | 2000-02   | 1     |
| matryoshka   | 2000-03   | 1     |

---


## **Approach 1: PySpark DataFrame API**

### **Steps:**
1. Clean `product_name` using `trim()` and `lower()`.
2. Format `sale_date` using `date_format()`.
3. Group by normalized name and formatted date.
4. Count rows and sort as required.

### **Code**

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, lower, date_format, col, count

# Step 1: Create Spark session
spark = SparkSession.builder.appName("FixTheIssue").getOrCreate()

# Step 2: Sample Data
data = [
    (1, " LCPHONE", "2000-01-16"),
    (2, "LCPhone ", "2000-01-17"),
    (3, " LcPhOnE ", "2000-02-18"),
    (4, "LCKeyCHAiN", "2000-02-19"),
    (5, "LCKeyChain", "2000-02-28"),
    (6, "Matryoshka", "2000-03-31")
]
columns = ["sale_id", "product_name", "sale_date"]

df = spark.createDataFrame(data, columns)

# Step 3: Clean, transform, and aggregate
cleaned_df = df.withColumn("product_name", lower(trim(col("product_name")))) \
               .withColumn("sale_date", date_format("sale_date", "yyyy-MM"))

result_df = cleaned_df.groupBy("product_name", "sale_date") \
                      .agg(count("*").alias("total")) \
                      .orderBy("product_name", "sale_date")

# Step 4: Show result
result_df.show()

StatementMeta(, 9d759812-c3c7-42b1-af85-29a034c410ca, 7, Finished, Available, Finished)

+------------+---------+-----+
|product_name|sale_date|total|
+------------+---------+-----+
|  lckeychain|  2000-02|    2|
|     lcphone|  2000-01|    2|
|     lcphone|  2000-02|    1|
|  matryoshka|  2000-03|    1|
+------------+---------+-----+



---

## **Approach 2: SQL Query in PySpark**

### **Steps:**
1. Register the DataFrame as a temp SQL table.
2. Write a query to clean `product_name`, format `sale_date`, group and count.
3. Use `ORDER BY` on `product_name` and `sale_date`.

### **Code**

In [6]:
df.createOrReplaceTempView("Sales")

query = """
SELECT 
    LOWER(TRIM(product_name)) AS product_name,
    DATE_FORMAT(sale_date, 'yyyy-MM') AS sale_date,
    COUNT(*) AS total
FROM Sales
GROUP BY LOWER(TRIM(product_name)), DATE_FORMAT(sale_date, 'yyyy-MM')
ORDER BY product_name, sale_date
"""

result_sql = spark.sql(query)
result_sql.show()

StatementMeta(, 9d759812-c3c7-42b1-af85-29a034c410ca, 8, Finished, Available, Finished)

+------------+---------+-----+
|product_name|sale_date|total|
+------------+---------+-----+
|  lckeychain|  2000-02|    2|
|     lcphone|  2000-01|    2|
|     lcphone|  2000-02|    1|
|  matryoshka|  2000-03|    1|
+------------+---------+-----+



---

## **Summary**
| Approach  | Method                      | Key Operations  |
|-----------|-----------------------------|----------------|
| **Approach 1** | SQL Query in PySpark     | Uses `LOWER(TRIM())`, `DATE_FORMAT()`, `COUNT(*)` |
| **Approach 2** | PySpark DataFrame API    | Uses `.withColumn()`, `groupBy().agg()`, `.orderBy()` |