Problem Statement:

Here we have a DataFrame containing employee details, including employee IDs (empid), departments (dept), salaries (salary), and dates (date). Your goal is to identify and retrieve the details of employees who appear exactly once in the DataFrame, i.e., employees whose empid is unique in the dataset.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType

# Initialize Spark session
spark = SparkSession.builder.appName("Create DataFrame Example").getOrCreate()

# Define schema for the DataFrame
schema = StructType(
    [
        StructField("empid", IntegerType(), True),
        StructField("dept", StringType(), True),
        StructField("salary", IntegerType(), True),
        StructField("date", StringType(), True),  # Assuming date as StringType for now
    ]
)

# Input data
data = [
    (100, "IT", 100, "2024-05-12"),
    (200, "IT", 100, "2024-06-12"),
    (100, "FIN", 400, "2024-07-12"),
    (300, "FIN", 500, "2024-07-12"),
    (300, "FIN", 1543, "2024-07-12"),
    (300, "FIN", 1500, "2024-07-12"),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# Show DataFrame
df.display()

empid,dept,salary,date
100,IT,100,2024-05-12
200,IT,100,2024-06-12
100,FIN,400,2024-07-12
300,FIN,500,2024-07-12
300,FIN,1543,2024-07-12
300,FIN,1500,2024-07-12


In [0]:
df.createOrReplaceTempView("test")

In [0]:
%sql
with cte as (
  select
    *,
    row_number() over(
      partition by empid
      order by
        empid
    ) ranks
  from
    test
),
cte1 as(
  select
    empid,
    count(ranks) as varun
  from
    cte
  group by
    empid
  having
    count(ranks) = 1
)
select
  *
from
  test
where
  empid in (
    select
      empid
    from
      cte1
  )

empid,dept,salary,date
200,IT,100,2024-06-12


In [0]:
from pyspark.sql.functions import col, count

# Step 1: Group by 'empid' and count occurrences
emp_count_df = df.groupBy("empid").agg(count("empid").alias("count"))

# Step 2: Filter for empid with count of 1
single_occurrence_df = emp_count_df.filter(col("count") == 1).select("empid")

# Step 3: Join with original DataFrame to get full details
non_repeated_emp_details_df = df.join(single_occurrence_df, on="empid", how="inner")

# Show the result
non_repeated_emp_details_df.display()

empid,dept,salary,date
200,IT,100,2024-06-12


In [0]:
Explanation:
    
To solve this problem, follow these steps:

Group by empid:

Objective: Determine how many times each empid appears in the dataset.
Approach: Use the groupBy method on the empid column and aggregate using count. This will give you a new DataFrame where each row represents an empid and the number of occurrences of that empid.
Filter for empid with Count of 1:

Objective: Identify empid values that appear exactly once.
Approach: Use the filter method to select rows from the grouped DataFrame where the count is 1. This will give you a DataFrame containing only those empid values which are unique in the dataset.
Join with Original DataFrame:

Objective: Retrieve the full details of the employees identified in the previous step.
Approach: Perform an inner join between the original DataFrame and the filtered DataFrame (containing unique empid values). This join will return only the rows from the original DataFrame where the empid is in the filtered list of unique empid values.