# Employee Attrition – Feature Engineering

This notebook prepares the dataset for machine learning by:

- Cleaning categorical variables  
- Encoding features  
- Creating new features  
- Scaling where necessary  
- Splitting into train/test datasets  


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
#Load data
df=pd.read_csv(r"C:\Users\Jones Mbela\Desktop\RENNY\AI AND ML\Employee attrition\data\WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [4]:
#Dropping irrelevant columns
drop_cols = [
    "EmployeeCount",
    "Over18",
    "StandardHours",
    "EmployeeNumber"
]
df.drop(columns=drop_cols, inplace=True)
print("Remaining columns:", df.shape)

Remaining columns: (1470, 31)


In [5]:
##Target Encoding
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
df['Attrition'].value_counts()

Attrition
0    1233
1     237
Name: count, dtype: int64

In [6]:
#Feature Creation

#Income per year @ company
df['IncomePerYear'] = df['MonthlyIncome']  / (df['YearsAtCompany'] + 1)

#Job level Experience interaction
df['ExperienceLevel'] = df['TotalWorkingYears'] * df['JobLevel']

#Satisfaction Index
df['SatisfactionIndex'] = (df['JobSatisfaction'] + df['EnvironmentSatisfaction'] + df['RelationshipSatisfaction']) / 3


In [7]:
#Category Encoding
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols

Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'OverTime'],
      dtype='object')

In [8]:
#one-hot encoding
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print("Data shape after encoding:", df.shape)

Data shape after encoding: (1470, 48)


In [9]:
#Scaling numerical features
numeric_cols = [
    'Age','DailyRate', 'DistanceFromHome', 'HourlyRate','MonthlyIncome','MonthlyRate',
    'PercentSalaryHike', 'TotalWorkingYears',
    'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
    'YearsWithCurrManager', 'IncomePerYear', 'ExperienceLevel', 'SatisfactionIndex'

]

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [10]:
#Train-test split
X = df.drop('Attrition', axis=1)
y = df['Attrition']

In [11]:
#Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (1176, 47)
Test set shape: (294, 47)


In [12]:
#Save processed data
X_train.to_csv(r"C:\Users\Jones Mbela\Desktop\RENNY\AI AND ML\Employee attrition\data\X_train.csv", index=False)
y_train.to_csv(r"C:\Users\Jones Mbela\Desktop\RENNY\AI AND ML\Employee attrition\data\y_train.csv", index=False)
X_test.to_csv(r"C:\Users\Jones Mbela\Desktop\RENNY\AI AND ML\Employee attrition\data\X_test.csv", index=False)
y_test.to_csv(r"C:\Users\Jones Mbela\Desktop\RENNY\AI AND ML\Employee attrition\data\y_test.csv", index=False)

print("Data preprocessing complete. Processed files saved to disk.")

Data preprocessing complete. Processed files saved to disk.


## Feature Engineering Summary

We have successfully:

- Removed irrelevant features  
- Encoded categorical variables  
- Created meaningful business features  
- Scaled numerical variables  
- Split data into training and testing sets  

The dataset is now fully prepared for machine learning modeling.
