# **NPV Query**

## **Problem Statement**
You are given two tables **NPV** and **Queries** with the following structure:

### **Table: NPV**
| Column Name | Type |
|------------|------|
| `id`       | int  |
| `year`     | int  |
| `npv`      | int  |

- **Primary Key:** (`id`, `year`)
- This table contains the net present value for each inventory.

### **Table: Queries**
| Column Name | Type |
|------------|------|
| `id`       | int  |
| `year`     | int  |

- **Primary Key:** (`id`, `year`)
- This table contains the queries for which we need to retrieve the NPV.

### **Objective**
For each (`id`, `year`) in the **Queries** table:
- Retrieve the corresponding **npv** value from the **NPV** table.
- If no matching (`id`, `year`) exists in the **NPV** table, return `0` as the npv value.

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrames for `NPV` and `Queries` Tables**
3. **Perform Left Join on `id` and `year` Columns**
4. **Replace NULL npv Values with 0**
5. **Display the Result**

### **Code**

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, lit

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("NPVQueryMatching").getOrCreate()

# Step 2: Create DataFrames for NPV and Queries Tables
npv_data = [
    (1, 2018, 100), (7, 2020, 30), (13, 2019, 40), 
    (1, 2019, 113), (2, 2008, 121), (3, 2009, 12), 
    (11, 2020, 99), (7, 2019, 0)
]
npv_columns = ["id", "year", "npv"]
npv_df = spark.createDataFrame(npv_data, npv_columns)

queries_data = [
    (1, 2019), (2, 2008), (3, 2009), (7, 2018), 
    (7, 2019), (7, 2020), (13, 2019)
]
queries_columns = ["id", "year"]
queries_df = spark.createDataFrame(queries_data, queries_columns)

# Step 3: Perform Left Join on 'id' and 'year'
result_df = queries_df.join(npv_df, on=["id", "year"], how="left")

# Step 4: Replace NULL npv Values with 0
result_df = result_df.withColumn("npv", coalesce(col("npv"), lit(0)))

# Step 5: Display the Result
result_df.show()

StatementMeta(, a780d1ee-d731-4607-8e92-79522c7344ff, 4, Finished, Available, Finished)

+---+----+---+
| id|year|npv|
+---+----+---+
|  1|2019|113|
|  2|2008|121|
|  3|2009| 12|
|  7|2018|  0|
|  7|2019|  0|
|  7|2020| 30|
| 13|2019| 40|
+---+----+---+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create DataFrames for `NPV` and `Queries`**
2. **Register Both DataFrames as SQL Views**
3. **Write SQL Query Using `LEFT JOIN` and `COALESCE`**
4. **Execute SQL Query**
5. **Display Output**

### **Code**

In [3]:
# Step 1: Register DataFrames as SQL Views
npv_df.createOrReplaceTempView("NPV")
queries_df.createOrReplaceTempView("Queries")

# Step 2: Run SQL Query
sql_query = """
SELECT q.id, q.year, COALESCE(n.npv, 0) AS npv
FROM Queries q
LEFT JOIN NPV n
ON q.id = n.id AND q.year = n.year;
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, a780d1ee-d731-4607-8e92-79522c7344ff, 5, Finished, Available, Finished)

+---+----+---+
| id|year|npv|
+---+----+---+
|  1|2019|113|
|  2|2008|121|
|  3|2009| 12|
|  7|2018|  0|
|  7|2019|  0|
|  7|2020| 30|
| 13|2019| 40|
+---+----+---+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `join()` and `coalesce()` |
| **Approach 2** | SQL Query in PySpark     | Uses `LEFT JOIN` and `COALESCE()` |
