Problem Statement:

Objective:

You are tasked with identifying employees who will be retiring at the end of the current month, based on their BirthDate. The goal is to create a solution that calculates the RetirementDate by adding 60 years to each employee’s BirthDate and filters those who are turning 60 years old on or before the last day of the current month.

In [0]:
from pyspark.sql.functions import *

# Sample data
data = [
    (101, "John Smith", "Manager", "Sales", "1990-01-31"),
    (102, "Jane Doe", "Engineer", "IT", "1985-02-25"),
    (103, "Alice Johnson", "Analyst", "Finance", "1992-01-15"),
    (104, "Robert Brown", "Technician", "Maintenance", "1988-03-31"),
    (105, "Emily Davis", "Consultant", "Marketing", "1975-01-30"),
    (106, "Michael Clark", "Supervisor", "Operations", "1980-01-31"),
    (107, "Sarah Miller", "Designer", "Design", "1991-04-29"),
    (108, "David Wilson", "HR Specialist", "HR", "1961-01-31"),
    (109, "Linda Thompson", "Developer", "IT", "1962-05-20"),
    (110, "Kevin Harris", "Architect", "Engineering", "1960-01-31"),
]

# Add Birthdate column
columns = ["EmployeeID", "Name", "JobTitle", "Department", "Birthdate"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Convert Birthdate to date type
df = df.withColumn("Birthdate", col("Birthdate").cast("date"))

df.display()

EmployeeID,Name,JobTitle,Department,Birthdate
101,John Smith,Manager,Sales,1990-01-31
102,Jane Doe,Engineer,IT,1985-02-25
103,Alice Johnson,Analyst,Finance,1992-01-15
104,Robert Brown,Technician,Maintenance,1988-03-31
105,Emily Davis,Consultant,Marketing,1975-01-30
106,Michael Clark,Supervisor,Operations,1980-01-31
107,Sarah Miller,Designer,Design,1991-04-29
108,David Wilson,HR Specialist,HR,1961-01-31
109,Linda Thompson,Developer,IT,1962-05-20
110,Kevin Harris,Architect,Engineering,1960-01-31


In [0]:
from pyspark.sql.functions import col, add_months, current_date, last_day

# Assuming df represents dbo.DimEmployee
result_df = (
    df.withColumn("RetirementDate", add_months(col("BirthDate"), 60 * 12))
    .filter(col("RetirementDate") <= last_day(current_date()))
    .select("EmployeeID", "Name", "JobTitle", "Department", "RetirementDate")
)

# Show results
result_df.display()

EmployeeID,Name,JobTitle,Department,RetirementDate
108,David Wilson,HR Specialist,HR,2021-01-31
109,Linda Thompson,Developer,IT,2022-05-20
110,Kevin Harris,Architect,Engineering,2020-01-31


In [0]:
df.createOrReplaceTempView("DimEmployee ")

In [0]:
# Spark SQL query
result_df = spark.sql(
    """
    SELECT 
        EmployeeID, 
        Name, 
        JobTitle, 
        Department, 
        ADD_MONTHS(BirthDate, 60 * 12) AS RetirementDate
    FROM 
        DimEmployee
    WHERE 
        ADD_MONTHS(BirthDate, 60 * 12) <= LAST_DAY(current_date())
"""
)

# Display results
result_df.display()

EmployeeID,Name,JobTitle,Department,RetirementDate
108,David Wilson,HR Specialist,HR,2021-01-31
109,Linda Thompson,Developer,IT,2022-05-20
110,Kevin Harris,Architect,Engineering,2020-01-31


In [0]:
%sql
SELECT
  EmployeeID,
  Name,
  Department,
  ADD_MONTHS(Birthdate, 0) AS RetirementDate,
  LAST_DAY(current_date())
FROM
  DimEmployee
WHERE
  ADD_MONTHS(Birthdate, 0) <= LAST_DAY(current_date())

Explanation:

add_months(col("BirthDate"), 60 * 12):
Adds 60 years (60 × 12 months) to the BirthDate.

last_day(current_date()):
Returns the last day of the current month.

Filter Clause:
Filters employees whose 60th birthday falls on or before the last day of the current month.

In [0]:
%sql
SELECT
  LAST_DAY(current_date())