# IBM HR Analytics Employee Attrition & Performance 

- 목표: 직원 이직 여부를 예측하는 무델을 구축
- 사용 도구: PySpark 의한 데이터 처리, 머신러닝 모델을 사용하여 예측

- 데이터 출처: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/data


In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Employee_Attrition").getOrCreate()
spark

24/12/18 13:28:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
df = spark.read.format("csv")\
    .option("header",'true')\
    .option('inferSchema', 'true')\
    .load("data/HR-Employee-Attrition.csv")

                                                                                

In [3]:
from pyspark.sql.functions import *
from pyspark.ml.feature import *
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import IntegerType

# 데이터 구조 확인

In [4]:
df.count()

1470

In [5]:
df.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)
 |-- Education: integer (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: integer (nullable = true)
 |-- EmployeeNumber: integer (nullable = true)
 |-- EnvironmentSatisfaction: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: integer (nullable = true)
 |-- JobInvolvement: integer (nullable = true)
 |-- JobLevel: integer (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: integer (nullable = true)
 |-- MonthlyRate: integer (nullable = true)
 |-- NumCompaniesWorked: integer (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string 

## 컬럼 설명

- Age: 직원의 나이
- Attrition: 직원의 이직 여부 (Yes or No)
- BusinessTravel: 직원의 출장 빈도 (Non-Travel, Travel_Rarely, Travel_Frequently)
- DailyRate: 직원의 일일 급여
- Department: 직원이 속한 부서 (Research & Development, Sales, Human Resources)
- DistanceFromHome: 직원의 집으로부터 회사까지의 거리
- Education: 직원의 학력 수준 (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor)
- EducationField: 직원의 전공 분야
- EmployeeCount: 직원 수 (모든 값이 1로 동일)
- EmployeeNumber: 직원의 고유 번호
- EnvironmentSatisfaction: 직원의 업무 환경 만족도 (1: Low, 2: Medium, 3: High, 4: Very High)
- Gender: 직원의 성별 (Male, Female)
- HourlyRate: 직원의 시간당 급여
- JobInvolvement: 직원의 직무 참여도 (1: Low, 2: Medium, 3: High, 4: Very High)
- JobLevel: 직원의 직급
- JobRole: 직원의 직무
- JobSatisfaction: 직원의 직무 만족도 (1: Low, 2: Medium, 3: High, 4: Very High)
- MaritalStatus: 직원의 결혼 여부 (Single, Married, Divorced)
- MonthlyIncome: 직원의 월급
- MonthlyRate: 직원의 월별 급여 총액
- NumCompaniesWorked: 직원이 근무한 회사 수
- Over18: 직원이 18세 이상인지 여부 (모든 값이 Y로 동일)
- OverTime: 직원의 초과 근무 여부 (Yes, No)
- PercentSalaryHike: 직원의 급여 인상 비율
- PerformanceRating: 직원의 성과 평가 등급 (1: Low, 2: Good, 3: Excellent, 4: Outstanding)
- RelationshipSatisfaction: 직원의 동료와의 관계 만족도 (1: Low, 2: Medium, 3: High, 4: Very High)
- StandardHours: 직원의 표준 근무 시간 (모든 값이 80으로 동일)
- StockOptionLevel: 직원의 주식 옵션 수준 (0, 1, 2, 3)
- TotalWorkingYears: 직원의 총 근무 연수
- TrainingTimesLastYear: 지난해 직원이 받은 교육 횟수
- WorkLifeBalance: 직원의 일-생활 균형 만족도 (1: Bad, 2: Good, 3: Better, 4: Best)
- YearsAtCompany: 직원이 현재 회사에서 근무한 기간
- YearsInCurrentRole: 직원이 현재 역할에서 근무한 기간
- YearsSinceLastPromotion: 직원의 마지막 승진 이후 경과된 기간
- YearsWithCurrManager: 직원이 현재 관리자와 함께 일한 기간

In [6]:
df.show(5)

24/12/18 13:28:38 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+---+---------+-----------------+---------+--------------------+----------------+---------+--------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|Over18|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StandardHours|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBalanc

In [7]:
# 통계 정보 확인
df.describe().show()

[Stage 5:>                                                          (0 + 1) / 1]

+-------+------------------+---------+--------------+------------------+---------------+----------------+------------------+----------------+-------------+-----------------+-----------------------+------+------------------+------------------+------------------+--------------------+------------------+-------------+-----------------+------------------+------------------+------+--------+------------------+-------------------+------------------------+-------------+------------------+------------------+---------------------+------------------+------------------+------------------+-----------------------+--------------------+
|summary|               Age|Attrition|BusinessTravel|         DailyRate|     Department|DistanceFromHome|         Education|  EducationField|EmployeeCount|   EmployeeNumber|EnvironmentSatisfaction|Gender|        HourlyRate|    JobInvolvement|          JobLevel|             JobRole|   JobSatisfaction|MaritalStatus|    MonthlyIncome|       MonthlyRate|NumCompaniesWorked|O

                                                                                

# 데이터 전처리

In [8]:
#각 컬럼별 null 값 개수 확인

null_counts = df.select(
    [
    sum(when(col(c).isNull() | isnan(c),1).otherwise(0)).alias(c) for c in df.columns
    ]
)

null_counts.show()

+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+--------------+-----------------------+------+----------+--------------+--------+-------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+
|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|Over18|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StandardHours|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBalance|YearsAtCompany|YearsInCurrentRole|YearsSinceLastPr

In [9]:
# 중복 데이터 확인 (중복 행의 개수 확인)
df.groupBy(df.columns).count().filter(col("count") > 1).show()

                                                                                

+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+--------------+-----------------------+------+----------+--------------+--------+-------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+
|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|Over18|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StandardHours|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBalance|YearsAtCompany|YearsInCurrentRole|YearsSince



In [10]:
# 불필요한 컬럼 제거 (하나의 값만 존재)
# EmployeeNumber, EmployeeCount, Over18, StandardHours
df = df.drop("EmployeeNumber","EmployeeCount", "Over18", "StandardHours")
df.show(5)

+---+---------+-----------------+---------+--------------------+----------------+---------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+--------+-----------------+-----------------+------------------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|EducationField|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBalance|YearsAtCompany|YearsInCurrentRole|YearsSinceLastPromotion|YearsWithCurrManager|
+---+---------+---

In [11]:
# 데이터 타입 확인


#범주형 데이터 인코딩 - Attrition 
indexer = StringIndexer(inputCol="Attrition", outputCol="Attrition_Index")
df = indexer.fit(df).transform(df)

In [12]:
# 이상치 확인

# 데이터 분석

In [13]:
age_attrition = df.groupby("Age", "Attrition_Index").count()
age_attrition.orderBy("Age").show()



+---+---------------+-----+
|Age|Attrition_Index|count|
+---+---------------+-----+
| 18|            0.0|    4|
| 18|            1.0|    4|
| 19|            1.0|    6|
| 19|            0.0|    3|
| 20|            1.0|    6|
| 20|            0.0|    5|
| 21|            1.0|    6|
| 21|            0.0|    7|
| 22|            1.0|    5|
| 22|            0.0|   11|
| 23|            1.0|    4|
| 23|            0.0|   10|
| 24|            1.0|    7|
| 24|            0.0|   19|
| 25|            0.0|   20|
| 25|            1.0|    6|
| 26|            0.0|   27|
| 26|            1.0|   12|
| 27|            0.0|   45|
| 27|            1.0|    3|
+---+---------------+-----+
only showing top 20 rows



                                                                                

In [14]:
# 상관관계 분석 (Numeric 컬럼에 대해)
numeric_cols = ["Age", "DailyRate", "DistanceFromHome", "HourlyRate", "MonthlyIncome"]
df.select(numeric_cols).corr()

TypeError: corr() missing 2 required positional arguments: 'col1' and 'col2'

# 시각화

# 머신러닝 모델 생성 및 예측

In [None]:
#aseembler = VectorAssembler()