## Use Case 4: Employee Performance and Sales Analysis

**Objective:** Evaluate employee sales performance and identify top contributors to optimize workforce management and incentive programs.

**Dataset:** `WA_Fn-UseC_-HR-Employee-Attrition.csv`

**Key Features:**
- Employee demographics (Age, Gender, Education)
- Job details (Department, Job Role, Years at Company)
- Performance metrics (Performance Rating, Monthly Income)
- Sales data (Job Involvement, Work-Life Balance)
- Target variable: Attrition (Yes/No)


In [0]:
# Use Case 4: Employee Performance - Data Loading
from pyspark.sql.functions import col, count, when

# Load employee attrition dataset
df_employee = spark.read.csv('/Volumes/workspace/default/capstone-project/WA_Fn-UseC_-HR-Employee-Attrition.csv', 
                              header=True, inferSchema=True)

print("Dataset loaded successfully!")
print(f"Total Records: {df_employee.count()}")
print(f"Total Columns: {len(df_employee.columns)}")

# Display schema
print("\n=== Dataset Schema ===")
df_employee.printSchema()

# Display sample data
display(df_employee.limit(10))


Dataset loaded successfully!
Total Records: 1470
Total Columns: 35

=== Dataset Schema ===
root
 |-- Age: integer (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)
 |-- Education: integer (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: integer (nullable = true)
 |-- EmployeeNumber: integer (nullable = true)
 |-- EnvironmentSatisfaction: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: integer (nullable = true)
 |-- JobInvolvement: integer (nullable = true)
 |-- JobLevel: integer (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: integer (nullable = true)
 |-- MonthlyRate: integer (nullable = true)
 |-- NumCompaniesWor

Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2
32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,4,Male,79,3,1,Laboratory Technician,4,Single,3068,11864,0,Y,No,13,3,3,80,0,8,2,2,7,7,3,6
59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,3,Female,81,4,1,Laboratory Technician,1,Married,2670,9964,4,Y,Yes,20,4,1,80,3,12,3,2,1,0,0,0
30,No,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,4,Male,67,3,1,Laboratory Technician,3,Divorced,2693,13335,1,Y,No,22,4,2,80,1,1,2,3,1,0,0,0
38,No,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,4,Male,44,2,3,Manufacturing Director,3,Single,9526,8787,0,Y,No,21,4,2,80,0,10,2,3,9,7,1,8
36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,3,Male,94,3,2,Healthcare Representative,3,Married,5237,16577,6,Y,No,13,3,2,80,2,17,3,2,7,7,7,7


In [0]:
# Use Case 4: Data Cleaning & Exploratory Analysis
from pyspark.sql.functions import col, count, when, isnan, avg

# Check for missing values (only numeric columns)
print("=== Missing Values Analysis ===")
numeric_cols = [field.name for field in df_employee.schema.fields 
                if str(field.dataType) in ['IntegerType', 'DoubleType', 'FloatType', 'LongType']]

if numeric_cols:
    df_employee.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) 
                        for c in numeric_cols[:10]]).show()

# Remove duplicates
df_employee = df_employee.dropDuplicates()
print(f"\nRecords after removing duplicates: {df_employee.count()}")

# Analyze attrition distribution
print("\n=== Employee Attrition Distribution ===")
df_employee.groupBy("Attrition").count().show()

# Department-wise analysis
print("\n=== Department Distribution ===")
df_employee.groupBy("Department").count().orderBy("count", ascending=False).show()

# Performance Rating Analysis
print("\n=== Performance Rating Distribution ===")
df_employee.groupBy("PerformanceRating").count().show()

# Display basic statistics
print("\n=== Dataset Statistics (Key Metrics) ===")
df_employee.select("Age", "MonthlyIncome", "YearsAtCompany", "JobSatisfaction").describe().show()

print("\n✅ Data cleaning completed!")


=== Missing Values Analysis ===

Records after removing duplicates: 1470

=== Employee Attrition Distribution ===
+---------+-----+
|Attrition|count|
+---------+-----+
|       No| 1233|
|      Yes|  237|
+---------+-----+


=== Department Distribution ===
+--------------------+-----+
|          Department|count|
+--------------------+-----+
|Research & Develo...|  961|
|               Sales|  446|
|     Human Resources|   63|
+--------------------+-----+


=== Performance Rating Distribution ===
+-----------------+-----+
|PerformanceRating|count|
+-----------------+-----+
|                3| 1244|
|                4|  226|
+-----------------+-----+


=== Dataset Statistics (Key Metrics) ===
+-------+------------------+-----------------+------------------+------------------+
|summary|               Age|    MonthlyIncome|    YearsAtCompany|   JobSatisfaction|
+-------+------------------+-----------------+------------------+------------------+
|  count|              1470|             1470

In [0]:
# Use Case 4: Performance Analysis
from pyspark.sql.functions import avg, col, count

# Top Performers by Income
print("=== Top 10 Highest Paid Employees ===")
top_earners = df_employee.orderBy(col("MonthlyIncome").desc()).limit(10)
display(top_earners.select("EmployeeNumber", "Department", "JobRole", "MonthlyIncome", 
                            "YearsAtCompany", "PerformanceRating"))

# Department-wise Average Income
print("\n=== Average Income by Department ===")
dept_income = df_employee.groupBy("Department").agg(
    avg("MonthlyIncome").alias("Avg_Income"),
    count("*").alias("Employee_Count")
).orderBy("Avg_Income", ascending=False)
display(dept_income)

# Job Role Performance Analysis
print("\n=== Average Income by Job Role (Top 10) ===")
role_income = df_employee.groupBy("JobRole").agg(
    avg("MonthlyIncome").alias("Avg_Income"),
    avg("JobSatisfaction").alias("Avg_Satisfaction"),
    count("*").alias("Employee_Count")
).orderBy("Avg_Income", ascending=False).limit(10)
display(role_income)

# Attrition by Department
print("\n=== Attrition Rate by Department ===")
attrition_dept = df_employee.groupBy("Department", "Attrition").count()
display(attrition_dept)

# Performance Rating vs Income
print("\n=== Income by Performance Rating ===")
perf_income = df_employee.groupBy("PerformanceRating").agg(
    avg("MonthlyIncome").alias("Avg_Income"),
    count("*").alias("Employee_Count")
).orderBy("PerformanceRating")
display(perf_income)

print("\n✅ Performance analysis completed!")


=== Top 10 Highest Paid Employees ===


EmployeeNumber,Department,JobRole,MonthlyIncome,YearsAtCompany,PerformanceRating
259,Research & Development,Manager,19999,33,3
1035,Research & Development,Research Director,19973,21,4
1191,Research & Development,Manager,19943,5,3
226,Research & Development,Manager,19926,5,3
787,Research & Development,Manager,19859,5,3
1282,Sales,Manager,19847,29,4
1038,Sales,Manager,19845,32,3
1740,Sales,Manager,19833,21,3
1255,Research & Development,Research Director,19740,8,3
1338,Human Resources,Manager,19717,7,3



=== Average Income by Department ===


Department,Avg_Income,Employee_Count
Sales,6959.17264573991,446
Human Resources,6654.507936507936,63
Research & Development,6281.252861602497,961


Databricks visualization. Run in Databricks to view.


=== Average Income by Job Role (Top 10) ===


JobRole,Avg_Income,Avg_Satisfaction,Employee_Count
Manager,17181.676470588234,2.7058823529411766,102
Research Director,16033.55,2.7,80
Healthcare Representative,7528.763358778626,2.786259541984733,131
Manufacturing Director,7295.137931034483,2.682758620689655,145
Sales Executive,6924.279141104295,2.754601226993865,326
Human Resources,4235.75,2.5576923076923075,52
Research Scientist,3239.972602739726,2.7739726027397262,292
Laboratory Technician,3237.169884169884,2.6911196911196917,259
Sales Representative,2626.0,2.7349397590361444,83



=== Attrition Rate by Department ===


Department,Attrition,count
Research & Development,No,828
Research & Development,Yes,133
Sales,No,354
Sales,Yes,92
Human Resources,No,51
Human Resources,Yes,12



=== Income by Performance Rating ===


PerformanceRating,Avg_Income,Employee_Count
3,6537.274115755628,1244
4,6313.893805309735,226



✅ Performance analysis completed!


In [0]:
# Use Case 4: Employee Attrition Prediction Model
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

# Select features
numeric_features = [
    "Age", "DailyRate", "DistanceFromHome", "MonthlyIncome", 
    "MonthlyRate", "NumCompaniesWorked", "PercentSalaryHike",
    "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole",
    "YearsSinceLastPromotion", "YearsWithCurrManager"
]

categorical_features = ["Department", "JobRole", "MaritalStatus", "Gender"]

# Index categorical variables
indexers = [StringIndexer(inputCol=cat, outputCol=f"{cat}_indexed", handleInvalid="keep") 
            for cat in categorical_features]

# Prepare features
indexed_features = [f"{cat}_indexed" for cat in categorical_features]
all_features = numeric_features + indexed_features

# Index target
attrition_indexer = StringIndexer(inputCol="Attrition", outputCol="label")

# Assemble features
assembler = VectorAssembler(
    inputCols=all_features,
    outputCol="features",
    handleInvalid="skip"
)

# Split data
train_data, test_data = df_employee.randomSplit([0.8, 0.2], seed=42)
print(f"Training set: {train_data.count()} records")
print(f"Test set: {test_data.count()} records")

# Build model
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="label",
    predictionCol="prediction",
    numTrees=100,
    maxDepth=10,
    seed=42
)

# Pipeline
pipeline = Pipeline(stages=indexers + [attrition_indexer, assembler, rf])

# Train
print("\n🔧 Training Random Forest model...")
model = pipeline.fit(train_data)

# Predict
predictions = model.transform(test_data)

# Evaluate
auc_evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
accuracy_evaluator = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy")

auc = auc_evaluator.evaluate(predictions)
accuracy = accuracy_evaluator.evaluate(predictions)

print(f"\n✅ Model Performance:")
print(f"   • AUC-ROC Score: {auc:.4f}")
print(f"   • Accuracy: {accuracy:.4f}")

# Display predictions
print("\n=== Sample Predictions ===")
display(predictions.select("EmployeeNumber", "Attrition", "label", "prediction", 
                           "probability", "Department", "JobRole", "MonthlyIncome").limit(20))


Training set: 1183 records
Test set: 287 records

🔧 Training Random Forest model...

✅ Model Performance:
   • AUC-ROC Score: 0.7418
   • Accuracy: 0.8432

=== Sample Predictions ===


EmployeeNumber,Attrition,label,prediction,probability,Department,JobRole,MonthlyIncome
1269,No,0.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.5066157220812394"",""0.49338427791876066""]}",Research & Development,Research Scientist,2994
1248,Yes,1.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.734028749028749"",""0.26597125097125096""]}",Research & Development,Research Scientist,1859
243,Yes,1.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.5984031385281385"",""0.4015968614718614""]}",Research & Development,Laboratory Technician,1102
1657,No,0.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.6726626477585687"",""0.32733735224143135""]}",Sales,Sales Representative,2783
137,Yes,1.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.6505743179515998"",""0.3494256820484002""]}",Research & Development,Laboratory Technician,2926
960,Yes,1.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.604714314521937"",""0.3952856854780629""]}",Research & Development,Laboratory Technician,2973
922,Yes,1.0,1.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.4577325715604492"",""0.5422674284395509""]}",Sales,Sales Representative,2044
701,Yes,1.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.5816339044082851"",""0.41836609559171484""]}",Research & Development,Research Scientist,1009
2021,No,0.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.5937373215105578"",""0.40626267848944214""]}",Sales,Sales Representative,2380
669,No,0.0,0.0,"{""type"":""1"",""size"":null,""indices"":null,""values"":[""0.5835583087088495"",""0.41644169129115055""]}",Sales,Sales Representative,3447


In [0]:
# Use Case 4: Employee Insights & Retention Strategies
from pyspark.sql.functions import col, when, udf
from pyspark.sql.types import DoubleType

# Extract attrition probability
def get_attrition_prob(probability):
    return float(probability[1]) if probability else 0.0

get_prob_udf = udf(get_attrition_prob, DoubleType())
predictions_with_prob = predictions.withColumn("attrition_risk", get_prob_udf(col("probability")))

# Categorize risk
df_risk = predictions_with_prob.withColumn(
    "retention_priority",
    when(col("attrition_risk") >= 0.7, "High Priority")
    .when(col("attrition_risk") >= 0.4, "Medium Priority")
    .otherwise("Low Priority")
)

print("=== Employee Retention Priority ===")
df_risk.groupBy("retention_priority").count().orderBy("retention_priority").show()

# High-risk employees
print("\n=== High Priority Employees (At Risk) ===")
high_risk_employees = df_risk.filter(col("retention_priority") == "High Priority")
print(f"🔴 {high_risk_employees.count()} employees at high risk of attrition")

display(high_risk_employees.select("EmployeeNumber", "Department", "JobRole", 
                                    "MonthlyIncome", "YearsAtCompany", 
                                    "attrition_risk", "retention_priority").limit(20))

# Department analysis
print("\n=== Retention Priority by Department ===")
dept_priority = df_risk.groupBy("Department", "retention_priority").count()
display(dept_priority)

# Recommendations
print("\n=== HR Recommendations ===")
print("🎯 Retention Strategies:")
print(f"   • Focus on {high_risk_employees.count()} high-priority employees")
print("   • Implement targeted retention programs")
print("   • Review compensation packages")
print("   • Enhance career development")
print("   • Improve work-life balance")

print("\n✅ Use Case 4: Employee Performance Analysis Complete!")


=== Employee Retention Priority ===
+------------------+-----+
|retention_priority|count|
+------------------+-----+
|      Low Priority|  270|
|   Medium Priority|   17|
+------------------+-----+


=== High Priority Employees (At Risk) ===
🔴 0 employees at high risk of attrition


EmployeeNumber,Department,JobRole,MonthlyIncome,YearsAtCompany,attrition_risk,retention_priority



=== Retention Priority by Department ===


Department,retention_priority,count
Sales,Low Priority,78
Sales,Medium Priority,8
Human Resources,Low Priority,8
Research & Development,Low Priority,184
Research & Development,Medium Priority,9


Databricks visualization. Run in Databricks to view.


=== HR Recommendations ===
🎯 Retention Strategies:
   • Focus on 0 high-priority employees
   • Implement targeted retention programs
   • Review compensation packages
   • Enhance career development
   • Improve work-life balance

✅ Use Case 4: Employee Performance Analysis Complete!
