# Employee Data Processing with PySpark
This notebook demonstrates an end-to-end PySpark data processing pipeline using an employee dataset. It covers:

- Data ingestion (using sample data)
- Data cleaning & transformation
- Feature engineering
- Aggregation & analysis
- Applying a wide variety of PySpark SQL functions (statistical, date functions, window functions, string operations, etc.)


Dataset Source: You can use the Human Resources Data Set from Kaggle or any other similar open-source employee dataset. In this example, we simulate sample data for demonstration.

# 1. Setup Spark Session

In [127]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col, collect_list, collect_set, lit, to_date, to_timestamp, split,\
    trim, min, expr, add_months, format_number, window, cume_dist, current_date, current_timestamp, dense_rank, \
    rank, row_number, month, dayofmonth, dayofweek, dayofyear, month, dayofmonth, dayofweek, dayofyear, \
    add_months,month,months_between, decode
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType, StructField, StructType, BinaryType
from pyspark.sql.window import Window


from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame

In [128]:
spark = SparkSession.builder.appName("HRSparkFunctionsSession").config("spark.executor.memory", "2g").config("spark.driver.memory", "2g").config("spark.sql.shuffle.partitions", "4").config("spark.ui.port", "4040").getOrCreate()


glueContext = GlueContext(spark.sparkContext)




# 2. Create Sample DataFrame

For demonstration, we create a sample DataFrame with columns similar to those in an HR dataset.

In [129]:
# Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
data_skills = "python,java, c, devops"
schema = StructType([
    StructField("Employee_Name", StringType(), True),
    StructField("EmpID", IntegerType(), True),
    StructField("MarriedID", IntegerType(), True),
    StructField("MaritalStatusID", IntegerType(), True),
    StructField("GenderID", IntegerType(), True),
    StructField("EmpStatusID", IntegerType(), True),
    StructField("DeptID", IntegerType(), True),
    StructField("PerfScoreID", IntegerType(), True),
    StructField("FromDiversityJobFairID", IntegerType(), True),
    StructField("Salary", DoubleType(), True),
    StructField("Termd", StringType(), True),  # Could also be BooleanType() if represented as True/False
    StructField("PositionID", IntegerType(), True),
    StructField("Position", StringType(), True),
    StructField("State", StringType(), True),
    StructField("Zip", StringType(), True),  # Using StringType for ZIP codes to preserve any leading zeros
    StructField("DOB", StringType(), True),
    StructField("Sex", StringType(), True),
    StructField("MaritalDesc", StringType(), True),
    StructField("CitizenDesc", StringType(), True),
    StructField("HispanicLatino", StringType(), True),  # Change to BooleanType() if appropriate
    StructField("RaceDesc", StringType(), True),
    StructField("DateofHire", StringType(), True),
    StructField("DateofTermination", StringType(), True),
    StructField("TermReason", StringType(), True),
    StructField("EmploymentStatus", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("ManagerName", StringType(), True),
    StructField("ManagerID", IntegerType(), True),
    StructField("RecruitmentSource", StringType(), True),
    StructField("PerformanceScore", StringType(), True),  # Or IntegerType() if numerical
    StructField("EngagementSurvey", DoubleType(), True),
    StructField("EmpSatisfaction", DoubleType(), True),
    StructField("SpecialProjectsCount", IntegerType(), True),
    StructField("LastPerformanceReview_Date", StringType(), True),
    StructField("DaysLateLast30", IntegerType(), True),
    StructField("Absences", IntegerType(), True)
])


local_df = spark.read.format('csv').option("header", "true").option("mode","PERMISSIVE").option("columnNameOfCorruptRecord", "_corrupt_record").schema(schema).load('/mount_folder/alpha/NYC_Taxi_Data_Pipeline_git/Practice/pyspark_functions/Employeee/HRDataset_v14.csv')
dynamic_frame = DynamicFrame.fromDF(local_df, glueContext, "dynamic_frame")
# dynamic_frame.show()
df = dynamic_frame.toDF()





In [130]:
df.show(1)

+-------------------+-----+---------+---------------+--------+-----------+------+-----------+----------------------+-------+-----+----------+--------------------+-----+-----+--------+---+-----------+-----------+--------------+--------+----------+-----------------+-----------------+----------------+-----------------+--------------+---------+-----------------+----------------+----------------+---------------+--------------------+--------------------------+--------------+--------+
|      Employee_Name|EmpID|MarriedID|MaritalStatusID|GenderID|EmpStatusID|DeptID|PerfScoreID|FromDiversityJobFairID| Salary|Termd|PositionID|            Position|State|  Zip|     DOB|Sex|MaritalDesc|CitizenDesc|HispanicLatino|RaceDesc|DateofHire|DateofTermination|       TermReason|EmploymentStatus|       Department|   ManagerName|ManagerID|RecruitmentSource|PerformanceScore|EngagementSurvey|EmpSatisfaction|SpecialProjectsCount|LastPerformanceReview_Date|DaysLateLast30|Absences|
+-------------------+-----+-------

# 3. Data Cleaning & Transformation


- Convert date and time columns to proper data types (using to_date(), to_timestamp(), etc.).
- Handle null values, trim spaces, and drop duplicates as necessary.
- Standardize string fields (e.g., using lower(), upper(), lpad()).

In [131]:
data_skills = data_skills.replace(" ",'')

In [132]:
# from pyspark.sql.functions import udf
# from pyspark.sql.types import ArrayType, StringType

# # Define a UDF to split and trim the Skills string
# def process_skills(skills_str):
#     return [skill.strip() for skill in skills_str.split(",")]

# process_skills_udf = udf(process_skills, (StringType()))

# # Add the Skills column using the UDF
# df = df.withColumn("Skills", process_skills_udf(lit(data_skills)))

# # Show the DataFrame
# df.select("Skills").show(truncate=False)


In [133]:
min_dateofhire = df.select(min(df["DateofHire"])).collect()[0][0]

df_temp =df.withColumn("to_date_DateofHire", to_date(trim('DateofHire') , 'M/d/yyyy')).withColumn("to_date_LastPerformanceReviewDate", to_date(trim('LastPerformanceReview_Date') , 'M/d/yyyy')).withColumn("to_date_DateofTermination", to_date(trim('DateofTermination') , 'M/d/yyyy')).withColumn("Skills", lit(data_skills))

df  = df_temp.fillna({"to_date_DateofHire":min_dateofhire})      
                

df = df.drop("DateofHire","LastPerformanceReview_Date", "DateofTermination")

df = df.withColumnRenamed("to_date_DateofHire", "DateofHire").withColumnRenamed("to_date_LastPerformanceReviewDate","LastPerformanceReview_Date").withColumnRenamed("to_date_DateofTermination","DateofTermination")

# 4. Feature Engineering

Create new features using various PySpark SQL functions.

- Compute new features like bonus amount, total compensation, and probation end dates (using add_months(), arithmetic operations, etc.).
- Split string columns into arrays (using split()), and process array data (using explode()).

In [134]:
df = df.withColumn("Department1", trim(df['Department']))
df = df.withColumnRenamed("Department1","Department")
df = df.drop("Department1")

In [135]:
df_cols = df.columns
# get index of the duplicate columns
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
print(duplicate_col_index)
# rename by adding suffix '_duplicated'
for i in duplicate_col_index:
    df_cols[i] = df_cols[i] + '_duplicated'

# rename the column in DF
df = df.toDF(*df_cols)

[23]


In [136]:
df = df.drop("Department_duplicated")

In [137]:
df.select(df['department']).distinct().collect()

[Row(department='Production'),
 Row(department='Sales'),
 Row(department='IT/IS'),
 Row(department='Software Engineering'),
 Row(department='Admin Offices'),
 Row(department='Executive Office')]

In [138]:
#adding bonus department wise
df = df.withColumn("Bonus",when(df["department"] == "Production", 4.5).when(df["department"] == "Sales", 4).when(df["department"] == "IT/IS", 4.5).when(df["department"] == "Software Engineering", 3).when(df["department"] == "Admin Offices", 3).when(df["department"] == "Executive Office", 3)).withColumn("Probation_end_date", add_months(df["DateofHire"], 3)).withColumn("formatedSalary", format_number(df['salary'], 2)).withColumn("Bonus_Amount", col("Salary") * col("Bonus") / 100).withColumn("Total_Compensation", col("Salary") + col("Bonus_Amount"))

# Data Aggregation & Analysis:

- Group data by relevant keys (e.g., Team) and compute aggregations (using avg(), sum(), collect_list(), collect_set()).
- Apply window functions for ranking and lag/lead operations (using dense_rank(), lag(), lead(), cume_dist()).

# 5. Applying Additional PySpark SQL Functions

Below we demonstrate each additional function from your list.

- Demonstrate usage of statistical, date, and string functions such as exp(), floor(), format_number(), from_unixtime(), hour(), second(), pmod(), rand(), round(), substring(), etc.
- Ensure each of the functions listed in the provided table is used at least once in the pipeline.

# 6. Aggregation & Analysis Example


We combine aggregated data back into the original DataFrame (join on "Team" & optionally "Gender"). Here’s an example join (using only "Team" for simplicity):


- Optimize Spark configurations (e.g., spark.executor.memory, spark.sql.shuffle.partitions).
- Cache intermediate DataFrames when necessary.
- Use broadcast joins, partitioning, and column pruning where applicable.

In [139]:
windowSpec = Window.partitionBy("Department").orderBy("Salary")
df = df.withColumn("CumeDist", cume_dist().over(windowSpec))

In [140]:
df = df.withColumn("Current_date", current_date()).withColumn("Current_timestamp", current_timestamp())

In [147]:
# 3. dense_rank() - Rank employees by Salary (no gaps)
# department_wise
window_dept_salary = Window.partitionBy("Department").orderBy("salary")
df = df.withColumn("sal_rank_by_department", dense_rank().over(window_dept_salary))

sal_rank_emp = Window.orderBy('Salary')
df = df.withColumn("sal_rank_empDense", dense_rank().over(sal_rank_emp))
df = df.withColumn("sal_rank_empRow", row_number().over(sal_rank_emp))
df = df.withColumn("sal_rank_empRank", rank().over(sal_rank_emp))



# # 5. dayofmonth(), dayofweek(), dayofyear() from Start Date
df = df.withColumn("month", dayofmonth(df['DateofHire'])).withColumn("day",dayofweek(df['DateofHire'])).withColumn("day",dayofyear(df['DateofHire']))




# 6. decode() - Decode BinaryData column (assume UTF-8)
# adding columns
binary_data = bytearray(b'hello')
df = df.withColumn("BinaryData", lit(binary_data).cast(BinaryType()))

df = df.withColumn('decode_of_BinaryData', decode(df['BinaryData'], 'utf-8'))


In [149]:
from pyspark.sql.functions import exp

In [152]:
exp(lit(2))

Column<'EXP(2)'>

In [None]:
df.columns

['Employee_Name',
 'EmpID',
 'MarriedID',
 'MaritalStatusID',
 'GenderID',
 'EmpStatusID',
 'DeptID',
 'PerfScoreID',
 'FromDiversityJobFairID',
 'Salary',
 'Termd',
 'PositionID',
 'Position',
 'State',
 'Zip',
 'DOB',
 'Sex',
 'MaritalDesc',
 'CitizenDesc',
 'HispanicLatino',
 'RaceDesc',
 'TermReason',
 'EmploymentStatus',
 'ManagerName',
 'ManagerID',
 'RecruitmentSource',
 'PerformanceScore',
 'EngagementSurvey',
 'EmpSatisfaction',
 'SpecialProjectsCount',
 'DaysLateLast30',
 'Absences',
 'DateofHire',
 'LastPerformanceReview_Date',
 'DateofTermination',
 'Skills',
 'Department',
 'Bonus',
 'Probation_end_date',
 'formatedSalary',
 'Bonus_Amount',
 'Total_Compensation',
 'CumeDist',
 'Current_date',
 'Current_timestamp',
 'sal_rank_by_department',
 'sal_rank_empDense',
 'sal_rank_empRow',
 'sal_rank_empRank',
 'month',
 'day']

In [None]:


# # 7. exp() - Exponential of Salary (for demonstration)
# df = df.withColumn("ExpSalary", exp(col("Salary")))

# # 8. explode() - Explode the Skills array into individual rows
# df_exploded = df.select("First Name", explode(col("Skills")).alias("Skill"))

# # 9. extract year (using year() function) from Start Date (already used above as HireYear)
# df = df.withColumn("HireYear", year(col("Start Date")))

# # 10. floor() - Floor of Salary
# df = df.withColumn("FloorSalary", floor(col("Salary")))

# # 11. format_number() - Already used above in FormattedSalary

# # 12. from_unixtime() - Convert a Unix timestamp (simulate one)
# df = df.withColumn("SimulatedUnix", unix_timestamp(col("Start Date"))) \
#        .withColumn("FormattedDate", from_unixtime(col("SimulatedUnix"), "yyyy-MM-dd"))

# # 13. hour() - Extract hour from Last Login Time
# df = df.withColumn("LoginHour", hour(col("Last Login Time")))

# # 14. isnull() - Check if Phone column is null (simulate Phone column)
# df = df.withColumn("Phone", lit(None).cast(StringType()))
# df = df.withColumn("PhoneIsNull", isnull(col("Phone")))

# # 15. lag() and lead() - Previous and next Salary in ordering by Salary
# window_spec_salary = Window.orderBy("Salary")
# df = df.withColumn("PrevSalary", lag(col("Salary"), 1).over(window_spec_salary)) \
#        .withColumn("NextSalary", lead(col("Salary"), 1).over(window_spec_salary))

# # 16. length() - Length of the First Name
# df = df.withColumn("NameLength", length(col("First Name")))

# # 17. lower() - Already used above as LowerName

# # 18. lpad() - Pad First Name to length 10 with "*"
# df = df.withColumn("PaddedName", lpad(col("First Name"), 10, "*"))

# # 19. max() and min() - Maximum and Minimum Salary (as aggregation example)
# max_salary = df.agg(max("Salary").alias("MaxSalary")).collect()[0]["MaxSalary"]
# min_salary = df.agg(min("Salary").alias("MinSalary")).collect()[0]["MinSalary"]

# # 20. month() - Extract month from Start Date
# df = df.withColumn("StartMonth", month(col("Start Date")))

# # 21. nvl() - Replace null Phone with "N/A"
# df = df.withColumn("PhoneNumber", nvl(col("Phone"), "N/A"))

# # 22. pmod() - Salary modulo 10
# df = df.withColumn("SalaryMod10", pmod(col("Salary"), 10))

# # 23. rand() - Random number for each row (seed 100)
# df = df.withColumn("RandomValue", rand(100))

# # 24. round() - Round Salary to nearest integer
# df = df.withColumn("RoundedSalary", round(col("Salary"), 0))

# # 25. second() - Extract seconds from Last Login Time
# df = df.withColumn("LoginSecond", second(col("Last Login Time")))

# # 26. split() - Split a simulated Address column into array (simulate Address)
# df = df.withColumn("Address", lit("123 Main St, Apt 4B, New York, NY"))
# df = df.withColumn("AddressParts", split(col("Address"), ", "))

# # 27. substring() - Extract first 3 characters from First Name
# df = df.withColumn("NamePrefix", substring(col("First Name"), 1, 3))

# # 28. sum() - Total sum of Salary (as aggregation example)
# total_salary = df.agg(sum("Salary").alias("TotalSalary")).collect()[0]["TotalSalary"]

# # 29. unix_timestamp() - Already used above as SimulatedUnix

# # 30. upper() - Already used above as UpperName

# # Show the final DataFrame with new columns (selecting a subset for clarity)
# selected_columns = [
#     "First Name", "Gender", "Start Date", "Last Login Time", "Salary", "Bonus %",
#     "Senior Management", "Team", "Bonus Amount", "Total Compensation", "Probation End Date",
#     "UpperName", "LowerName", "FormattedSalary", "CumeDist", "CurrentDate", "CurrentTimestamp",
#     "DenseRank_Salary", "DaysSinceStart", "StartDayOfMonth", "StartDayOfWeek", "StartDayOfYear",
#     "DecodedData", "ExpSalary", "HireYear", "FloorSalary", "FormattedDate", "LoginHour",
#     "PhoneIsNull", "PrevSalary", "NextSalary", "NameLength", "PaddedName", "StartMonth",
#     "PhoneNumber", "SalaryMod10", "RandomValue", "RoundedSalary", "LoginSecond", "NamePrefix"
# ]
# df.select(selected_columns).show(truncate=False)

# # Show aggregation examples for max, min, and sum
# print("Max Salary:", max_salary)
# print("Min Salary:", min_salary)
# print("Total Salary:", total_salary)

# # Show exploded skills example (from explode)
# df_exploded = df.select("First Name", explode(col("Skills")).alias("Skill"))
# df_exploded.show()




# # # Aggregated DataFrame by Team
# # df_aggregated = df.groupBy("Team").agg(
# #     avg("Salary").alias("Avg_Salary"),
# #     collect_list("First Name").alias("Team_Members"),
# #     collect_set("First Name").alias("Unique_Team_Members")
# # )

# # # Join aggregated results back to the original DataFrame
# # df_final = df.join(df_aggregated, on="Team", how="left")
# # df_final.select("Team", "Avg_Salary", "Team_Members", "Unique_Team_Members").show(truncate=False)


# 7. Performance Optimization & Caching

- Implement unit tests to verify data quality (check for null values, duplicate rows, and correct transformations).
- Use assertions to ensure the integrity of key columns (e.g., “First Name” should have no nulls).

# 8. Unit Testing (Data Quality Checks)

You can run these tests in separate cells or in a dedicated testing notebook.

- Save the final processed DataFrame to disk in Parquet (or CSV) format.


In [None]:


print("All data quality tests passed!")

All data quality tests passed!


# 9. Save Processed Data

In [None]:
# df_final.repartition(2).write.format("parquet").mode("overwrite").save("/mount_folder/alpha/NYC_Taxi_Data_Pipeline_git/Practice/pyspark_functions/Employeee/employees_hr_processed.parquet")

# Conclusion
This notebook demonstrates a complete PySpark project that:

- Ingests data,
- Performs data cleaning and transformation,
- Uses a comprehensive set of PySpark SQL functions (including statistical, date, window, and string functions),
- Aggregates and analyzes the data,
- Implements performance optimizations and unit tests,
- And finally, saves the processed output.

Feel free to adjust the sample data and configurations to match your real dataset from Kaggle or another open source.