Dataset: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

# Assignment Tasks

## 1. Data Understanding

### Import dataset using Pandas.

In [96]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [97]:
df = pd.read_csv('data/WA_Fn-UseC_HR_Employee_Attrition.csv')

### Explore the shape, columns, and data types.

In [98]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [99]:
df.shape

(1470, 35)

In [100]:
for i in df.columns:
    print(f"{i}: {df[i].dtype}")

Age: int64
Attrition: object
BusinessTravel: object
DailyRate: int64
Department: object
DistanceFromHome: int64
Education: int64
EducationField: object
EmployeeCount: int64
EmployeeNumber: int64
EnvironmentSatisfaction: int64
Gender: object
HourlyRate: int64
JobInvolvement: int64
JobLevel: int64
JobRole: object
JobSatisfaction: int64
MaritalStatus: object
MonthlyIncome: int64
MonthlyRate: int64
NumCompaniesWorked: int64
Over18: object
OverTime: object
PercentSalaryHike: int64
PerformanceRating: int64
RelationshipSatisfaction: int64
StandardHours: int64
StockOptionLevel: int64
TotalWorkingYears: int64
TrainingTimesLastYear: int64
WorkLifeBalance: int64
YearsAtCompany: int64
YearsInCurrentRole: int64
YearsSinceLastPromotion: int64
YearsWithCurrManager: int64


### Check for missing values and duplicates.

In [101]:
if df.isna().sum().any():
    print("There are missing values. Here they are:")
    print(df.isna().sum())
else:
    print("No missing values found.")

No missing values found.


In [102]:
if df.duplicated().any():
    print("There are duplicates. Here they are:")
    print(df[df.duplicated()])
else:
    print("No duplicates found.")

No duplicates found.


### Generate descriptive statistics.

In [103]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


## 2. Exploratory Data Analysis (EDA)

### Univariate analysis

In [104]:
def classify_columns_table(df):
    records = []
    
    for col in df.columns:
        if df[col].nunique() == len(df):
            col_type = "Unique"
        elif pd.api.types.is_numeric_dtype(df[col]):
            col_type = "Numeric"
        elif isinstance(df[col].dtype, pd.CategoricalDtype) or df[col].dtype == "object":
            col_type = "Categorical"
        else:
            col_type = "Leftover"
        
        records.append({
            "Column": col,
            "Assigned_Type": col_type,
            "Unique_Values": df[col].nunique(),
            "Example_Values": df[col].dropna().unique()[:5]  # show up to 5 examples
        })
    return pd.DataFrame(records)

classify_columns_table(df)

Unnamed: 0,Column,Assigned_Type,Unique_Values,Example_Values
0,Age,Numeric,43,"[41, 49, 37, 33, 27]"
1,Attrition,Categorical,2,"[Yes, No]"
2,BusinessTravel,Categorical,3,"[Travel_Rarely, Travel_Frequently, Non-Travel]"
3,DailyRate,Numeric,886,"[1102, 279, 1373, 1392, 591]"
4,Department,Categorical,3,"[Sales, Research & Development, Human Resources]"
5,DistanceFromHome,Numeric,29,"[1, 8, 2, 3, 24]"
6,Education,Numeric,5,"[2, 1, 4, 3, 5]"
7,EducationField,Categorical,6,"[Life Sciences, Other, Medical, Marketing, Tec..."
8,EmployeeCount,Numeric,1,[1]
9,EmployeeNumber,Unique,1470,"[1, 2, 4, 5, 7]"


#### Age





Categorical analysis: Countplots for Attrition, Department, JobRole, MaritalStatus.

Bivariate analysis: Compare Attrition with Age, JobSatisfaction, MonthlyIncome, OverTime.

Correlation heatmap for numerical variables.

3. Data Preprocessing

Encode categorical variables (Label Encoding / One-Hot Encoding).

Normalize/standardize continuous variables.

Handle class imbalance in Attrition (SMOTE / class weights).

Train-test split (e.g., 80-20).

4. Model Building

Train at least 3 models:

Logistic Regression (baseline)

Random Forest Classifier

Gradient Boosting (XGBoost/LightGBM)

5. Model Evaluation

Use metrics: Accuracy, Precision, Recall, F1-score, AUROC.

Plot ROC curve and confusion matrix.

Compare model performance and choose the best one.

6. Feature Importance

Extract feature importance from tree-based models.

Discuss which features are most influential in predicting attrition.

7. Reporting

Write a concise report (2–3 pages) covering:

Problem statement

Data insights (EDA findings)

Model comparison table

Key takeaways (business-level insights for HR)