Problem Statement:

Objective: Identify and retrieve all employees who have the same salary within the same department.

Input: A dataset containing employee information with the following fields:
emp_id (Employee ID)
name (Employee Name)
salary (Employee Salary)
dept_id (Department ID)

Output: A list of employees who share the same salary with at least one other employee in the same department. The results should be ordered by dept_id and salary.

In [0]:
# Sample data
data = [
    (102, "sohan", 3000, 11),
    (102, "rohan", 4000, 12),
    (103, "mohan", 5000, 13),
    (104, "cat", 3000, 11),
    (105, "suresh", 4000, 12),
    (109, "mahesh", 7000, 12),
    (108, "kamal", 8000, 11),
]

# Define the schema
columns = ["emp_id", "name", "salary", "dept_id"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# display the DataFrame
df.display()

emp_id,name,salary,dept_id
102,sohan,3000,11
102,rohan,4000,12
103,mohan,5000,13
104,cat,3000,11
105,suresh,4000,12
109,mahesh,7000,12
108,kamal,8000,11


In [0]:
df.createOrReplaceTempView("emp_salary")

In [0]:
# Execute the SQL query
result_sql = spark.sql(
    """
    SELECT emp_id, name, salary, dept_id
    FROM (
        SELECT emp_id, name, salary, dept_id, COUNT(*) OVER (PARTITION BY dept_id, salary) AS employee_count
        FROM emp_salary
    ) subquery
    WHERE employee_count > 1
    ORDER BY dept_id, salary
"""
)

# display the result
result_sql.display()

emp_id,name,salary,dept_id
102,sohan,3000,11
104,cat,3000,11
102,rohan,4000,12
105,suresh,4000,12


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Define a window partitioned by dept_id and salary
window_spec = Window.partitionBy("dept_id", "salary")

# Count the number of employees with the same dept_id and salary
df_with_counts = df.withColumn("employee_count", F.count("emp_id").over(window_spec))

# Filter only groups where the count is greater than 1
result_df = df_with_counts.filter(F.col("employee_count") > 1).drop("employee_count")

# display the result
result_df.display()

emp_id,name,salary,dept_id
102,sohan,3000,11
104,cat,3000,11
102,rohan,4000,12
105,suresh,4000,12


Explanation:

Window Function:
COUNT(*) OVER (PARTITION BY dept_id, salary) calculates how many employees share the same dept_id and salary.

Subquery:
Used to filter employees where the employee_count is greater than 1.

Final Selection:
Only includes relevant columns (emp_id, name, salary, dept_id).

Order By:
Orders the results by dept_id and salary for clarity.