# **Customer Placing the Largest Number of Orders**

## **Problem Statement**
You are given a table **Orders** with the following structure:

### **Table: Orders**
| Column Name     | Type  |
|----------------|-------|
| `order_number` | int   |
| `customer_number` | int |

- `order_number` is the primary key.
- The table contains information about the order ID and the customer ID.

### **Objective**
Write a query to find the `customer_number` of the customer who has placed the **largest number of orders**.

- It is **guaranteed** that exactly **one customer** has placed more orders than any other customer.

---



## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create a DataFrame for `Orders` Table**
3. **Group by `customer_number` and count orders**
4. **Find the customer with the maximum order count**
5. **Select and display the result**

### **Code**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, desc

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("LargestNumberOrders").getOrCreate()

# Step 2: Create DataFrame for Orders Table
orders_data = [
    (1, 1),
    (2, 2),
    (3, 3),
    (4, 3),
]
orders_columns = ["order_number", "customer_number"]

orders_df = spark.createDataFrame(orders_data, orders_columns)

# Step 3: Group by customer_number and Count Orders
customer_orders_df = orders_df.groupBy("customer_number").count()

# Step 4: Find the Customer with Maximum Order Count
max_orders = customer_orders_df.orderBy(desc("count")).limit(1)

# Step 5: Select customer_number Column and Display Result
result_df = max_orders.select("customer_number")
result_df.show()


StatementMeta(, b819523a-cbef-4c0b-8b1b-872db21a9b1b, 3, Finished, Available, Finished)

+---------------+
|customer_number|
+---------------+
|              3|
+---------------+



## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrame for `Orders` Table**
3. **Register it as a SQL View**
4. **Write and Execute SQL Query**
5. **Display the Output**


### **Code**

In [2]:
# Step 1: Register DataFrame as a SQL View
orders_df.createOrReplaceTempView("Orders")

# Step 2: Run SQL Query
sql_query = """
SELECT customer_number 
FROM Orders
GROUP BY customer_number
ORDER BY COUNT(order_number) DESC
LIMIT 1;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, b819523a-cbef-4c0b-8b1b-872db21a9b1b, 4, Finished, Available, Finished)

+---------------+
|customer_number|
+---------------+
|              3|
+---------------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `groupBy().count()` and `orderBy().limit(1)` |
| **Approach 2** | SQL Query in PySpark     | Uses `GROUP BY`, `ORDER BY COUNT()` and `LIMIT 1` |
