
# **Orders With Maximum Quantity Above Average**

## **Problem Statement**
You are given a table **OrdersDetails** with the following structure:

### **Table: OrdersDetails**
| Column Name | Type |
|------------|------|
| `order_id` | int  |
| `product_id` | int  |
| `quantity` | int  |

### **Objective**
Find all `order_id` values where the **maximum quantity of a product** in the order is **strictly greater** than the **average quantity of all orders**.

### **Definitions**
- **Order's Average Quantity** = (Total quantity in the order) / (Number of different products in the order)
- **Order's Maximum Quantity** = The highest `quantity` in the order
- **Imbalanced Order Condition**:
  \[
  \{max(quantity) of order} >{average quantity of all orders}
  \]

---

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, max, col, count, sum

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("ImbalancedOrders").getOrCreate()

# Step 2: Create DataFrame for OrdersDetails Table
orders_data = [
    (1, 1, 12), (1, 2, 10), (1, 3, 15),
    (2, 1, 8), (2, 4, 4), (2, 5, 6), (2, 9, 4),
    (3, 3, 5), (3, 4, 18), (3, 9, 20),
    (4, 5, 2), (4, 6, 8),
    (5, 7, 9), (5, 8, 9)
]
orders_columns = ["order_id", "product_id", "quantity"]

orders_df = spark.createDataFrame(orders_data, orders_columns)

# Step 3: Compute Average Quantity for Each Order
order_avg_df = orders_df.groupBy("order_id").agg(
    (sum(col("quantity")) / count(col("product_id"))).alias("avg_quantity")
)

# Step 4: Compute Maximum Quantity for Each Order
order_max_df = orders_df.groupBy("order_id").agg(
    max(col("quantity")).alias("max_quantity")
)

# Step 5: Compute Global Average Quantity of All Orders
global_avg = order_avg_df.select(avg("avg_quantity")).collect()[0][0]

# Step 6: Filter Orders with Max Quantity Greater than Global Average
imbalanced_orders_df = order_max_df.filter(col("max_quantity") > global_avg)

# Step 7: Display the Result
imbalanced_orders_df.select("order_id").show()

StatementMeta(, b13602fc-4f23-4104-9c25-7f8a6dc7d961, 10, Finished, Available, Finished)

+--------+
|order_id|
+--------+
|       1|
|       3|
+--------+




---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrame for `OrdersDetails` Table**
3. **Register it as a SQL View**
4. **Write and Execute SQL Query**
5. **Display the Output**
 
### **Code**


In [9]:
# Step 1: Register DataFrame as a SQL View
orders_df.createOrReplaceTempView("OrdersDetails")

# Step 2: Run SQL Query
sql_query = """
WITH OrderStats AS (
    SELECT order_id,
           MAX(quantity) AS max_quantity,
           SUM(quantity) * 1.0 / COUNT(product_id) AS avg_quantity
    FROM OrdersDetails
    GROUP BY order_id
),
GlobalAvg AS (
    SELECT AVG(avg_quantity) AS global_avg FROM OrderStats
)
SELECT o.order_id
FROM OrderStats o
JOIN GlobalAvg g
ON o.max_quantity > g.global_avg;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, b13602fc-4f23-4104-9c25-7f8a6dc7d961, 11, Finished, Available, Finished)

+--------+
|order_id|
+--------+
|       1|
|       3|
+--------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `groupBy()`, `agg()`, and `filter()` |
| **Approach 2** | SQL Query in PySpark     | Uses `WITH`, `AVG()`, `MAX()`, `JOIN` |