
# PACE Strategy

## P: Purpose
Salifort Motors is experiencing high employee turnover. The goal is to identify key factors that contribute to employee departure and build a predictive model to help reduce future turnover and save on recruiting/training costs.

## A: Action
To solve this, we will:
- Analyze the employee survey dataset (`HR_capstone_dataset.csv`)
- Perform EDA to uncover trends and correlations
- Build and evaluate predictive models (Logistic Regression, Decision Tree, Random Forest, XGBoost)
- Generate a summary with business insights and recommendations

## C: Calculation
We will use:
- **Python** for coding
- **pandas**, **seaborn**, **matplotlib** for EDA
- **scikit-learn** for model training and evaluation
- **XGBoost** for boosting performance
- Evaluation metrics: Accuracy, Precision, Recall, F1-Score

## E: Evaluation
Success will be determined by:
- Model performance (AUC, accuracy, etc.)
- Clarity of insights about what causes turnover
- Actionable business recommendations for HR and leadership


In [None]:

# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

# LOAD DATA
df = pd.read_csv('HR_capstone_dataset.csv')
df.head()


In [None]:

# Basic Info
df.info()
df.describe()

# Check for Nulls
print(df.isnull().sum())

# Turnover Rate
turnover_rate = df['left'].value_counts(normalize=True)
print(turnover_rate)

# Correlation Matrix
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation with Turnover")
plt.show()

# Boxplot by Salary
sns.boxplot(x='salary', y='satisfaction_level', hue='left', data=df)
plt.title("Satisfaction Level by Salary and Turnover")
plt.show()


In [None]:

# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['department', 'salary'], drop_first=True)

# Features and Labels
X = df_encoded.drop('left', axis=1)
y = df_encoded['left']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
print("Logistic Regression Report:")
print(classification_report(y_test, lr_preds))

# Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_preds = dt.predict(X_test)
print("Decision Tree Report:")
print(classification_report(y_test, dt_preds))

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
print("Random Forest Report:")
print(classification_report(y_test, rf_preds))

# XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
xgb_preds = xgb.predict(X_test)
print("XGBoost Report:")
print(classification_report(y_test, xgb_preds))


In [None]:

# Feature Importance from Random Forest
importances = rf.feature_importances_
features = pd.Series(importances, index=X.columns)
features.sort_values(ascending=False).plot(kind='bar', figsize=(12, 5))
plt.title("Feature Importance (Random Forest)")
plt.tight_layout()
plt.show()



# Executive Summary: Employee Turnover Analysis – Salifort Motors

## Objective
Salifort Motors is facing a high employee turnover rate, which impacts productivity and increases recruitment/training costs. Our goal was to identify factors contributing to turnover and develop a predictive model to support HR in improving employee retention.

## Methods
Using a dataset of 14,999 employees, we:
- Performed exploratory data analysis to identify trends
- Built four predictive models: Logistic Regression, Decision Tree, Random Forest, and XGBoost
- Evaluated model accuracy, recall, and feature importance

## Key Findings
- **Satisfaction level** and **average monthly hours** are the most influential factors in employee departure.
- Employees with low satisfaction and **no promotions in the last 5 years** are more likely to leave.
- **Low salary** groups have a higher turnover rate.
- Random Forest and XGBoost performed the best, with over 90% accuracy.

## Model Comparison

| Model             | Accuracy | Precision | Recall |
|------------------|----------|-----------|--------|
| Logistic Regression | ~77%    | Moderate  | Moderate |
| Decision Tree       | ~89%    | Good      | Good    |
| Random Forest       | ~91%    | High      | High    |
| XGBoost             | ~92%    | High      | High    |

## Recommendations
- Develop strategies to improve **employee satisfaction** and **recognition programs**.
- Encourage **career growth** through promotion paths and mentorship.
- Reduce overwork by **monitoring average monthly hours** and adjusting workloads.
- Consider **compensation reviews**, especially for those in lower salary tiers.

## Next Steps
- Deploy the XGBoost model in HR systems for real-time turnover risk scoring.
- Collaborate with department managers to implement data-driven retention initiatives.
- Reassess and monitor model performance quarterly to ensure long-term accuracy.
