# **Employee Bonus**

## **Problem Statement**

You are given two tables: `Employee` and `Bonus`. Write a SQL query to **select all employees’ names and their bonuses**, but **only display those whose bonus is less than 1000** or whose bonus **does not exist** (i.e., NULL).

---

### **Table: Employee**

| empId | name   | supervisor | salary |
|-------|--------|------------|--------|
| 1     | John   | 3          | 1000   |
| 2     | Dan    | 3          | 2000   |
| 3     | Brad   | NULL       | 4000   |
| 4     | Thomas | 3          | 4000   |

---

### **Table: Bonus**

| empId | bonus |
|-------|-------|
| 2     | 500   |
| 4     | 2000  |

---

### **Expected Output**

| name  | bonus |
|-------|-------|
| Dan   | 500   |
| John  | NULL  |
| Brad  | NULL  |

---

## **Approach 1: PySpark DataFrame API**

### **Steps**
1. Load both Employee and Bonus data into DataFrames.
2. Perform a **left join** on `empId` to include all employees.
3. Filter:
   - `bonus < 1000`
   - OR `bonus IS NULL`
4. Select `name` and `bonus`.

### **Code**


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Step 1: Create Spark session
spark = SparkSession.builder.appName("EmployeeBonus").getOrCreate()

# Step 2: Sample data
employee_data = [(1, 'John', 3, 1000),
                 (2, 'Dan', 3, 2000),
                 (3, 'Brad', None, 4000),
                 (4, 'Thomas', 3, 4000)]
employee_columns = ["empId", "name", "supervisor", "salary"]
employee_df = spark.createDataFrame(employee_data, employee_columns)

bonus_data = [(2, 500), (4, 2000)]
bonus_columns = ["empId", "bonus"]
bonus_df = spark.createDataFrame(bonus_data, bonus_columns)

# Step 3: Left join
joined_df = employee_df.join(bonus_df, on="empId", how="left")

# Step 4: Filter based on bonus < 1000 or bonus is null
filtered_df = joined_df.filter((col("bonus") < 1000) | col("bonus").isNull())

# Step 5: Select name and bonus
result_df = filtered_df.select("name", "bonus")

result_df.show()

StatementMeta(, 0ba253e9-e372-41b6-9689-72c9a9218d7d, 6, Finished, Available, Finished)

+----+-----+
|name|bonus|
+----+-----+
|John| null|
| Dan|  500|
|Brad| null|
+----+-----+



---

## **Approach 2: SQL Query in PySpark**

### **Steps**
1. Perform a **LEFT JOIN** between `Employee` and `Bonus` on `empId`.
2. Use **WHERE** clause to filter for:
   - `bonus < 1000`
   - OR `bonus IS NULL`.

### **Code**

In [5]:
employee_df.createOrReplaceTempView("Employee")
bonus_df.createOrReplaceTempView("Bonus")

spark.sql("""
    SELECT e.name, b.bonus
    FROM Employee e
    LEFT JOIN Bonus b ON e.empId = b.empId
    WHERE b.bonus < 1000 OR b.bonus IS NULL
""").show()

StatementMeta(, 0ba253e9-e372-41b6-9689-72c9a9218d7d, 7, Finished, Available, Finished)

+----+-----+
|name|bonus|
+----+-----+
|John| null|
| Dan|  500|
|Brad| null|
+----+-----+



---

## **Summary**

| Approach         | Method               | Key Techniques                     |
|------------------|----------------------|-------------------------------------|
| **Approach 1**   | PySpark DataFrame API| `left join + filter by bonus`       |
| **Approach 2**   | SQL in PySpark       | `LEFT JOIN + WHERE (bonus check)`   |