# 🧠 Employee Attrition Prediction – Modeling
**Author:** Saurabh Shrivastav  
**Objective:** Build a classification model to predict employee attrition based on HR data insights.

📌 *This notebook is a continuation of the EDA phase (`01_eda_attrition.ipynb`).*

## 📦 Section 1: Load Data & Prepare for Modeling
##### Load Data & Prepare for Modeling
Reload cleaned dataset and re-apply the feature engineering steps.

In [8]:
# 📦 Import libraries
import pandas as pd
import numpy as np

# 📊 Load dataset
df = pd.read_csv('../data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# 🔍 Preview
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## 1️⃣ Data Cleaning & Feature Selection (Feature Engineering)

- Drop ID-like or constant columns
- Encode categorical features
- Prepare final dataset for modeling

In [9]:
# 🔻 Drop columns that don't help prediction
df.drop(['EmployeeNumber', 'Over18', 'StandardHours', 'EmployeeCount'], axis=1, inplace=True)

# 🧠 Convert Attrition to binary
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

# 👁️ Separate target
y = df['Attrition']
X = df.drop('Attrition', axis=1)

# 🏷️ Categorical & Numerical columns
cat_cols = X.select_dtypes(include='object').columns
num_cols = X.select_dtypes(exclude='object').columns

## Encode Categorical Variables

- Use `pd.get_dummies()` to one-hot encode categorical features
- Avoid dummy variable trap with `drop_first=True`

In [10]:
# 🧠 One-hot encode all categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# 🧾 Check the shape and preview
print("Encoded shape:", X_encoded.shape)
X_encoded.head()

Encoded shape: (1470, 44)


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,...,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes
0,41,1102,1,2,2,94,3,2,4,5993,...,False,False,False,False,False,True,False,False,True,True
1,49,279,8,1,3,61,2,2,2,5130,...,False,False,False,False,True,False,False,True,False,False
2,37,1373,2,2,4,92,2,1,3,2090,...,True,False,False,False,False,False,False,False,True,True
3,33,1392,3,4,4,56,3,1,3,2909,...,False,False,False,False,True,False,False,True,False,True
4,27,591,2,1,1,40,3,1,2,3468,...,True,False,False,False,False,False,False,True,False,False


## 2️⃣ Train-Test Split

- Use `train_test_split` from scikit-learn
- Stratify on the target variable (`Attrition`) to preserve Yes/No ratio

In [15]:
from sklearn.model_selection import train_test_split

# Split the encoded dataset
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)

# ✅ Check split shapes
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

X_train: (1176, 44)
X_test: (294, 44)
y_train: (1176,)
y_test: (294,)


### 🤖 Model 1: Logistic Regression

- Simple and interpretable baseline model
- Useful for binary classification (Attrition: Yes/No)
- Evaluate using Accuracy, Precision, Recall, F1
