
# **Product's Worth Over Invoices**

## **Problem Statement**
We are given two tables:

### **Table: Product**
| Column Name | Type    |
|------------|---------|
| `product_id`  | int     |
| `name`        | varchar |

`product_id` is the **primary key**.

### **Table: Invoice**
| Column Name | Type |
|------------|------|
| `invoice_id`  | int  |
| `product_id`  | int  |
| `rest`        | int  |
| `paid`        | int  |
| `canceled`    | int  |
| `refunded`    | int  |

`invoice_id` is the **primary key**.

### **Objective**
For **each product**, calculate the **total amount** of:
- `rest` (amount left to pay)
- `paid` (amount paid)
- `canceled` (amount canceled)
- `refunded` (amount refunded)

The result should be ordered by `name`.

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrames for `Product` and `Invoice`**
3. **Join both DataFrames on `product_id`**
4. **Group by product name and aggregate sum of `rest`, `paid`, `canceled`, and `refunded`**
5. **Order by `name`**
6. **Display the result**

### **Code**

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("ProductWorthOverInvoices").getOrCreate()

# Step 2: Create DataFrames for Product and Invoice Tables
product_data = [(0, "ham"), (1, "bacon")]
invoice_data = [
    (23, 0, 2, 0, 5, 0),
    (12, 0, 0, 4, 0, 3),
    (1, 1, 1, 1, 0, 1),
    (2, 1, 1, 0, 1, 1),
    (3, 1, 0, 1, 1, 1),
    (4, 1, 1, 1, 1, 0)
]

product_columns = ["product_id", "name"]
invoice_columns = ["invoice_id", "product_id", "rest", "paid", "canceled", "refunded"]

product_df = spark.createDataFrame(product_data, product_columns)
invoice_df = spark.createDataFrame(invoice_data, invoice_columns)

# Step 3: Join the DataFrames on 'product_id'
merged_df = product_df.join(invoice_df, "product_id")

# Step 4: Aggregate Sum of 'rest', 'paid', 'canceled', 'refunded'
result_df = merged_df.groupBy("name").agg(
    sum("rest").alias("rest"),
    sum("paid").alias("paid"),
    sum("canceled").alias("canceled"),
    sum("refunded").alias("refunded")
)

# Step 5: Order by 'name'
result_df = result_df.orderBy("name")

# Step 6: Display Result
result_df.show()

StatementMeta(, 4844201e-3748-4bf8-95b9-95cf0b73d027, 7, Finished, Available, Finished)

+-----+----+----+--------+--------+
| name|rest|paid|canceled|refunded|
+-----+----+----+--------+--------+
|bacon|   3|   3|       3|       3|
|  ham|   2|   4|       5|       3|
+-----+----+----+--------+--------+




---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create DataFrames for `Product` and `Invoice`**
2. **Register them as SQL Views**
3. **Write and Execute SQL Query**
4. **Display the Output**

### **Code**

In [6]:
# Step 1: Register DataFrames as SQL Views
product_df.createOrReplaceTempView("Product")
invoice_df.createOrReplaceTempView("Invoice")

# Step 2: Run SQL Query
sql_query = """
SELECT 
    p.name,
    SUM(i.rest) AS rest,
    SUM(i.paid) AS paid,
    SUM(i.canceled) AS canceled,
    SUM(i.refunded) AS refunded
FROM Product p
JOIN Invoice i ON p.product_id = i.product_id
GROUP BY p.name
ORDER BY p.name;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 4844201e-3748-4bf8-95b9-95cf0b73d027, 8, Finished, Available, Finished)

+-----+----+----+--------+--------+
| name|rest|paid|canceled|refunded|
+-----+----+----+--------+--------+
|bacon|   3|   3|       3|       3|
|  ham|   2|   4|       5|       3|
+-----+----+----+--------+--------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `groupBy().agg(sum())` |
| **Approach 2** | SQL Query in PySpark     | Uses `SUM()` with `GROUP BY` |