# **Recyclable and Low Fat Products**

## **Problem Statement**
We have a table named **Products**.

### **Table: Products**
This table contains product details.

| Column Name | Type  |
|-------------|------|
| product_id  | int  |
| low_fats    | enum ('Y', 'N') |
| recyclable  | enum ('Y', 'N') |

- `product_id` is the **primary key**.
- `low_fats` indicates if the product is **low fat** (`Y` for Yes, `N` for No).
- `recyclable` indicates if the product is **recyclable** (`Y` for Yes, `N` for No).

---

## **Objective**
Find the **product_id** of products that are **both low fat (`Y`) and recyclable (`Y`)**.

### **Example**
#### **Input: Products Table**
| product_id | low_fats | recyclable |
|------------|----------|------------|
| 0          | Y        | N          |
| 1          | Y        | Y          |
| 2          | N        | Y          |
| 3          | Y        | Y          |
| 4          | N        | N          |

#### **Expected Output**
| product_id |
|------------|
| 1          |
| 3          |

- **Product 1** has `low_fats = Y` and `recyclable = Y`, so it's included.
- **Product 3** has `low_fats = Y` and `recyclable = Y`, so it's included.
- **Product 0, 2, and 4** do not meet both conditions, so they are excluded.

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrame for Products**
3. **Filter Products Where `low_fats = 'Y'` and `recyclable = 'Y'`**
4. **Select Only `product_id`**
5. **Display the Result**

### **Code**

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("RecyclableAndLowFat").getOrCreate()

# Step 2: Create DataFrame
products_data = [
    (0, "Y", "N"),
    (1, "Y", "Y"),
    (2, "N", "Y"),
    (3, "Y", "Y"),
    (4, "N", "N"),
]
products_columns = ["product_id", "low_fats", "recyclable"]

products_df = spark.createDataFrame(products_data, products_columns)

# Step 3: Filter the DataFrame
filtered_df = products_df.filter((col("low_fats") == "Y") & (col("recyclable") == "Y"))

# Step 4: Select Only product_id
result_df = filtered_df.select("product_id")

# Step 5: Display Output
result_df.show()


StatementMeta(, 9f23b603-8017-4283-9561-33d47db583c5, 4, Finished, Available, Finished)

+----------+
|product_id|
+----------+
|         1|
|         3|
+----------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create DataFrame for Products**
2. **Register as SQL View**
3. **Run SQL Query Using `WHERE` Clause**
4. **Return the Result**

### **Code**

In [3]:
# Step 1: Register DataFrame as SQL View
products_df.createOrReplaceTempView("Products")

# Step 2: Run SQL Query
sql_query = """
SELECT product_id
FROM Products
WHERE low_fats = 'Y' AND recyclable = 'Y';
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()


StatementMeta(, 9f23b603-8017-4283-9561-33d47db583c5, 5, Finished, Available, Finished)

+----------+
|product_id|
+----------+
|         1|
|         3|
+----------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses **filter()** to select low fat and recyclable products |
| **Approach 2** | SQL Query in PySpark     | Uses **SQL `WHERE` clause** to filter results |
