#**🏢 Employee Attrition Prediction and Analysis**

#🏆 **Executive Summary**


Employee attrition poses a significant challenge to organizations, impacting operational efficiency and increasing recruitment costs.
This project leverages advanced machine learning techniques to **predict employee attrition and uncover key drivers** influencing employee decisions to leave.

Using the IBM HR Analytics dataset, we conducted extensive **exploratory data analysis (EDA)**, applied **data preprocessing techniques**, and implemented a powerful **XGBoost classifier** to build an effective prediction model.
Key findings highlight the impact of factors such as **overtime work, monthly income, job role, and tenure** on attrition likelihood.

The insights derived from this analysis enable businesses to **design targeted retention strategies**, improving **employee satisfaction and organizational stability**.



✅ **Main Highlights**:

- Comprehensive data exploration and visualization.

- Handling of class imbalance using SMOTE to enhance model fairness.

- Feature encoding and cleaning for optimal machine learning performance.

- Implementation of XGBoost, achieving robust predictive results.

- Business recommendations based on data-driven insights.

📌 **Project Overview**

In this project, we aim to predict **employee attrition**  that is, whether an employee is likely to leave the company  using machine learning techniques.
By analyzing features such as **job roles, monthly income, tenure**, and others, organizations can identify **at-risk employees and design effective retention strategies**.

**📚 Dataset Description**

- **Source**: IBM HR Analytics Employee Attrition & Performance Dataset

- **Records**: 1,470 entries

- **Features**: 35 columns including demographics, job-related factors, and performance indicators.

🛠️ **Libraries and Tools Used**

- **Pandas and NumPy**: For efficient data handling and numerical computations.

- **Matplotlib, Seaborn, Plotly**: For advanced and interactive data visualization.

- **Scikit-learn**: For preprocessing, modeling, and evaluation.

- **SMOTE**: For handling class imbalance by generating synthetic examples of the minority class.

- **XGBoost**: For building an accurate, robust machine learning model.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from xgboost import XGBClassifier
import plotly.graph_objects as go
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix,f1_score
from warnings import filterwarnings
filterwarnings("ignore")


#**Data Loading and Initial Exploration**

In [15]:
data = pd.read_csv('/content/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [16]:
print("Dataset Info:")
print(data.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel  

In [17]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1470.0,36.92381,9.135373,18.0,30.0,36.0,43.0,60.0
DailyRate,1470.0,802.485714,403.5091,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,1470.0,9.192517,8.106864,1.0,2.0,7.0,14.0,29.0
Education,1470.0,2.912925,1.024165,1.0,2.0,3.0,4.0,5.0
EmployeeCount,1470.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EmployeeNumber,1470.0,1024.865306,602.024335,1.0,491.25,1020.5,1555.75,2068.0
EnvironmentSatisfaction,1470.0,2.721769,1.093082,1.0,2.0,3.0,4.0,4.0
HourlyRate,1470.0,65.891156,20.329428,30.0,48.0,66.0,83.75,100.0
JobInvolvement,1470.0,2.729932,0.711561,1.0,2.0,3.0,3.0,4.0
JobLevel,1470.0,2.063946,1.10694,1.0,1.0,2.0,3.0,5.0


In [18]:
duplicate_rows = data.duplicated().any()
duplicate_rows

np.False_

**There is no missing value and duplicate rows in dataset.**

🧹Some of columns can be **removed**, because their values do not affect the analysis results.

- Over18: All values are Y
- EmployeeCount: all values are 1.0
- StandardHours: all values are 80.0
- EmployeeNumber: is the id of the employee that their values do not affect the analysis results.

In [19]:
data = data.drop(['EmployeeCount','StandardHours','Over18','EmployeeNumber'], axis =1)


#**Data Visualization** 📊

🔹 Insight: Quick overview of the distribution of employees who stayed vs. left.



In [20]:
fig = px.histogram(data, x="Attrition", color="Attrition",
                   title="Employee Attrition Count",
                   color_discrete_sequence=['salmon', 'skyblue'])
fig.show()

**Numerical Features vs Attrition**

🔹 Insight: Detect trends and outliers between numerical features and attrition rates.


In [21]:
num_cols = ['Age', 'DistanceFromHome', 'MonthlyIncome', 'NumCompaniesWorked',
            'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole',
            'YearsSinceLastPromotion', 'YearsWithCurrManager']

for col in num_cols:
    fig = px.box(data, x='Attrition', y=col, color='Attrition',
                 title=f"{col} vs Attrition",
                 color_discrete_sequence=['salmon', 'skyblue'])
    fig.show()

📊 **Insight: Employee Loyalty and Managerial Tenure Are Linked**

Employees who have not left the company ("No" Attrition) tend to have longer relationships with their current manager compared to those who have left ("Yes" Attrition).

- The **median years** with the current manager for employees who stayed is around **3 years**, while for those who left it's about 2 years.

- The spread of years is also greater for non-attrition employees, indicating they often build **longer-term relationships** with managers.

- There are more outliers (longer manager tenures) among those who stayed, suggesting loyalty may grow with time and trust.

💼 **Recommendation:**

- **Promote managerial continuity** where possible.

- Invest in **manager-employee relationship building**—e.g., through mentoring, regular feedback, and coaching.

- Track **managerial changes** and provide **extra support** for employees undergoing frequent transitions.

**Categorical Features by Attrition**

🔹 Insight: Explore relationships between attrition and categorical features.

In [22]:
cat_cols = ['BusinessTravel', 'Department', 'EducationField', 'Gender',
            'JobRole', 'MaritalStatus', 'OverTime']

for col in cat_cols:
    fig = px.histogram(data, x=col, color='Attrition', barmode='group',
                       title=f"{col} by Attrition",
                       color_discrete_sequence=['salmon', 'skyblue'])
    fig.update_layout(xaxis={'categoryorder':'total descending'})
    fig.show()


💡 **Analysis of graphs**

- Attrition is the highest for both men and women from 18 to 35 years of age and gradually decreases.
- As income increases, attrition decreases.
- Attrition is much, much less in divorced women.
- Attrition is higher for employees who usually travel than others, and this rate is higher for women than for men.
- Attrition is the highest for those in level 1 jobs.
- Women with the job position of manager, research director and technician laboratory have almost no attrition.
- Men with the position of sales expert have a lot of attrition.

In [23]:
df_corr = data.copy()
df_corr['Attrition'] = df_corr['Attrition'].map({'Yes': 1, 'No': 0})
corr = df_corr.corr(numeric_only=True).round(2)
fig = go.Figure(data=go.Heatmap(
        z=corr.values,
        x=corr.columns,
        y=corr.columns,
        colorscale='RdBu',
        zmin=-1, zmax=1,
        colorbar=dict(title="Correlation")
    ))

fig.update_layout(title='Correlation Heatmap')
fig.show()


**There are high correlation between some features:**

- monthlyincome & joblevel
- year in currentrol , year at company, year with current manager & year in current role

🔹 Insight: Identify the strongest correlations influencing attrition.

**Monthly Income vs Total Working Years by Attrition**

🔹 Insight: Higher tenure and income generally correlate with lower attrition.

In [24]:
fig = px.scatter(data, x='TotalWorkingYears', y='MonthlyIncome',
                 color='Attrition',
                 title='Monthly Income vs Total Working Years by Attrition',
                 size='Age',
                 color_discrete_sequence=['red', 'skyblue'],
                 hover_data=['JobRole', 'JobLevel'])
fig.show()


📊 **Insight: Attrition is Not Limited to Low-Income or Inexperienced Employees**

From the scatter plot:

- While many employees with lower income and fewer total working years show attrition (red dots), we also see attrition among highly experienced and well-paid employees.

- Red dots (attrition) appear throughout the income and experience range — including employees earning above 15,000 and with 30–40 years of experience.

💼 **Recommendation:**

- Don't assume high salary or long tenure ensures retention.

- Conduct stay interviews with senior, high-income employees to understand satisfaction drivers.

- Offer meaningful roles, succession planning, and flexible work arrangements for experienced staff.

- Monitor signs of burnout or disengagement across all levels—not just entry or mid-career roles.

In [37]:
fig = px.histogram(data, x="Age", color="Attrition", nbins=30,
                   marginal="box",
                   title="Age Distribution by Attrition",
                   color_discrete_sequence=['skyblue', 'salmon'])
fig.show()


📊 **Insight: Younger Employees Are More Likely to Leave the Company**

- The density and histogram show that attrition (light blue) is higher among employees aged between 25 and 35.

- Older employees (above 40) have lower attrition rates, as shown by the dominance of the red distribution (No Attrition) in those age brackets.

💼 **Recommendation:**

- Develop **career progression programs** tailored for early-career employees.

- Offer **mentorship, training,** and **skill-building opportunities**.

- Consider implementing **engagement surveys** targeting younger age groups to understand motivators and retention drivers.

- Focus on improving **onboarding experiences** and first 2–3 year engagement plans.

In [26]:
fig = px.histogram(data, x="JobRole", color="Gender", barmode='group',
                   facet_col="Attrition",
                   title="Job Role Distribution by Gender and Attrition",
                   color_discrete_sequence=px.colors.qualitative.Set2)
fig.update_layout(xaxis_tickangle=-45)
fig.show()


In [27]:
df=data.copy()
color_map = {'Yes': 'lightcoral', 'No': 'skyblue'}
df['AttritionColor'] = df['Attrition'].map(color_map)
fig = px.sunburst(df, path=['Department', 'JobRole', 'Attrition'],
                  values=None,
                  color='AttritionColor',
                  title='Attrition Breakdown by Department and Job Role')
fig.show()

#**🔧 Data Preprocessing**

In [28]:
#Categorical Columns
cat = data.select_dtypes(['object']).columns
#Numerical Columns
num = data.select_dtypes(['number']).columns


In [29]:
data["Attrition"] = data["Attrition"].map({"Yes": 1, "No":0})


In [30]:
cat= data.drop('Attrition',axis=1).select_dtypes(['object']).columns
cat

Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'OverTime'],
      dtype='object')

In [31]:
data = pd.get_dummies(data, drop_first= True)


🔹 Reasoning: Simplify dataset, prepare it for modeling by encoding categorical features and cleaning irrelevant columns.

#**⚙️ Modeling**

**✅ Model Chosen: XGBoost Classifier**

Why XGBoost?

- Exceptional performance on structured/tabular data

- Effective handling of class imbalance

- High computational efficiency

In [32]:
X=data.drop(columns='Attrition')
y=data['Attrition']

In [33]:
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=.2,random_state=42, stratify=y)

In [34]:
# Stratified K-Fold setup
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

accuracy_scores = []
f1_scores = []

for train_idx, val_idx in skf.split(X_train, y_train):
    # Use .iloc for DataFrames/Series
    X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

    # Apply SMOTE
    smote = SMOTE(random_state=42)
    X_tr_resampled, y_tr_resampled = smote.fit_resample(X_tr, y_tr)

    # Compute scale_pos_weight
    scale_pos_weight = len(y_tr_resampled[y_tr_resampled == 0]) / len(y_tr_resampled[y_tr_resampled == 1]) * 1.5

    # Define and train the model
    xgb_model = XGBClassifier(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.1,
        scale_pos_weight=scale_pos_weight,
        use_label_encoder=False,
        eval_metric='logloss',
        random_state=42
    )
    xgb_model.fit(X_tr_resampled, y_tr_resampled)

    # Predict and evaluate
    y_pred = xgb_model.predict(X_val)
    accuracy_scores.append(accuracy_score(y_val, y_pred))
    f1_scores.append(f1_score(y_val, y_pred))

# Results
print("Average Accuracy:", np.mean(accuracy_scores))
print("Average F1 Score:", np.mean(f1_scores))


Average Accuracy: 0.8537288135593221
Average F1 Score: 0.48398137809902514


📈 **Evaluation Metrics**
Once you train the model:

- Accuracy Score: To evaluate overall correctness.

- Confusion Matrix: To understand true positives/negatives.

- Classification Report: Precision, Recall, F1-Score, especially useful due to class imbalance.

In [35]:
# Predictions
y_pred = xgb_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8435374149659864
Confusion Matrix:
 [[230  17]
 [ 29  18]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.93      0.91       247
           1       0.51      0.38      0.44        47

    accuracy                           0.84       294
   macro avg       0.70      0.66      0.67       294
weighted avg       0.83      0.84      0.83       294



📋 **Final Conclusion**

- Certain features like OverTime, MonthlyIncome, JobRole, and YearsAtCompany strongly correlate with attrition.

- Employees working overtime or with lower incomes tend to have higher attrition risks.

- Departmental patterns and job roles also show meaningful insights, providing opportunities for targeted retention efforts.

**Business Recommendation**:

- Implement proactive strategies focusing on at-risk groups identified through the model to **reduce attrition and boost retention**.
- **Align Compensation with Experience:**

  - Experienced employees with low income show higher attrition risk.

  - Adjust salaries fairly based on tenure and performance.
- **Strengthen Manager-Employee Relationships**
  - Encourage stable manager assignments and leadership training.




