# **Employees Earning More Than Their Managers**

## **Problem Statement**
Given a table `Employee` containing employee information including their `Id`, `Name`, `Salary`, and `ManagerId`, write a query to find employees who earn **more than their managers**.

---

## **Table: Employee**

| Column Name | Type    |
|-------------|---------|
| Id          | int     |
| Name        | varchar |
| Salary      | int     |
| ManagerId   | int     |

- `Id` is the primary key.
- `ManagerId` is a foreign key that refers to `Id`.

---

### **Example**

#### **Input Table**
| Id | Name  | Salary | ManagerId |
|----|-------|--------|-----------|
| 1  | Joe   | 70000  | 3         |
| 2  | Henry | 80000  | 4         |
| 3  | Sam   | 60000  | NULL      |
| 4  | Max   | 90000  | NULL      |

#### **Expected Output**
| Employee |
|----------|
| Joe      |

Joe earns 70,000 while his manager Sam earns 60,000 — so only Joe qualifies.

---

## **Approach 1: PySpark DataFrame API**

### **Steps**
1. Create the Employee DataFrame.
2. Perform a self-join to compare each employee with their manager.
3. Filter employees whose salary is greater than their manager's.
4. Select the employee's name.

### **Code**


In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("Employee").getOrCreate()

# Step 2: Create Employee DataFrame
data = [
    (1, "Joe", 70000, 3),
    (2, "Henry", 80000, 4),
    (3, "Sam", 60000, None),
    (4, "Max", 90000, None)
]
columns = ["Id", "Name", "Salary", "ManagerId"]
df = spark.createDataFrame(data, columns)

# Step 3: Self-join on Employee and Manager
joined = df.alias("e").join(
    df.alias("m"),
    col("e.ManagerId") == col("m.Id")
)

# Step 4: Filter where employee's salary > manager's salary
result = joined.filter(col("e.Salary") > col("m.Salary"))

# Step 5: Select employee names
result.select(col("e.Name").alias("Employee")).show()

StatementMeta(, c7c94a52-d67c-4a12-802d-0af1caa93a41, 7, Finished, Available, Finished)

+--------+
|Employee|
+--------+
|     Joe|
+--------+



---

## **Approach 2: SQL Query in PySpark**

### **Steps**
1. Register the DataFrame as a temp SQL view.
2. Use a self-join to compare employee and manager salaries.
3. Filter and return employees who earn more than their manager.

### **Code**

In [6]:
df.createOrReplaceTempView("Employee")

sql_result = spark.sql("""
    SELECT e.Name AS Employee
    FROM Employee e
    JOIN Employee m ON e.ManagerId = m.Id
    WHERE e.Salary > m.Salary
""")

sql_result.show()

StatementMeta(, c7c94a52-d67c-4a12-802d-0af1caa93a41, 8, Submitted, Running, Running)

---

## **Summary**

| Approach         | Method                | Key Technique         |
|------------------|------------------------|------------------------|
| **Approach 1**   | PySpark DataFrame API  | `self-join`, `filter` |
| **Approach 2**   | SQL in PySpark         | `JOIN`, `WHERE`       |