Problem Statement:

Companies often perform salary analyses to ensure fair compensation practices. One useful analysis is to check if there are any employees earning more than their direct managers.

As a HR Analyst, you're asked to identify all employees who earn more than their direct managers. The result should include the employee's ID and name.

In [0]:
# Sample data
data = [
    (1, "Emma Thompson", 3800, 1, 6),
    (2, "Daniel Rodriguez", 2230, 1, 7),
    (3, "Olivia Smith", 7000, 1, 8),
    (4, "Noah Johnson", 6800, 2, 9),
    (5, "Sophia Martinez", 1750, 1, 11),
    (6, "Liam Brown", 13000, 3, None),
    (7, "Ava Garcia", 12500, 3, None),
    (8, "William Davis", 6800, 2, None),
    (9, "Isabella Wilson", 11000, 3, None),
    (10, "James Anderson", 4000, 1, 11),
    (11, "Mia Taylor", 10800, 3, None),
    (12, "Benjamin Hernandez", 9500, 3, 8),
    (13, "Charlotte Miller", 7000, 2, 6),
    (14, "Logan Moore", 8000, 2, 6),
    (15, "Amelia Lee", 4000, 1, 7),
]

# Column names
columns = ["employee_id", "name", "salary", "department_id", "manager_id"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# display DataFrame
df.display()

employee_id,name,salary,department_id,manager_id
1,Emma Thompson,3800,1,6.0
2,Daniel Rodriguez,2230,1,7.0
3,Olivia Smith,7000,1,8.0
4,Noah Johnson,6800,2,9.0
5,Sophia Martinez,1750,1,11.0
6,Liam Brown,13000,3,
7,Ava Garcia,12500,3,
8,William Davis,6800,2,
9,Isabella Wilson,11000,3,
10,James Anderson,4000,1,11.0


In [0]:
df.createOrReplaceTempView("employee")

In [0]:
%sql
WITH ManagerEmployee AS (
  SELECT
    emp.employee_id AS employee_id,
    emp.name AS employee_name,
    emp.salary AS employee_salary,
    mgr.salary AS manager_salary
  FROM
    employee AS mgr
    INNER JOIN employee AS emp ON mgr.employee_id = emp.manager_id
)
SELECT
  employee_id,
  employee_name
FROM
  ManagerEmployee
WHERE
  employee_salary > manager_salary;

employee_id,employee_name
3,Olivia Smith
12,Benjamin Hernandez


In [0]:
from pyspark.sql.functions import *

# Join DataFrame with itself to compare employee and manager salaries
result_df = (
    df.alias("emp")
    .join(df.alias("mgr"), col("emp.manager_id") == col("mgr.employee_id"))
    .select(
        col("emp.employee_id").alias("employee_id"),
        col("emp.name").alias("employee_name"),
    )
    .filter(col("emp.salary") > col("mgr.salary"))
)

# Show the result
result_df.display()

employee_id,employee_name
3,Olivia Smith
12,Benjamin Hernandez


Explanation:

Data: A dataset is defined and converted into a DataFrame.

Self-Join: The DataFrame is joined with itself (emp as employees and mgr as managers) based on the manager_id and employee_id.

Select Columns: The select method retrieves the desired columns (employee ID and name).

Filter Condition: The filter method applies the condition to get employees with salaries greater than their managers'.

Display Result: Finally, the result is displayed using display().

You can run this code in your PySpark environment, and it will give you the employees who earn more than their managers. If you have further questions or need modifications, let me know!