## 🔍 Causal Inference: Understanding What *Causes* Employee Churn

While predictive modeling can tell us which factors are correlated with employee churn, it doesn't answer the more fundamental question: *what causes employees to leave?*

This notebook explores causal inference techniques to identify potential causal relationships between employee characteristics and churn. By moving beyond correlation, we aim to support more effective and targeted interventions to reduce turnover.

### Objectives
- Apply causal inference methods (e.g., propensity score matching, causal graphs, or DoWhy) to assess causal effects.
- Investigate whether the most predictive features are truly causal.
- Provide interpretable and actionable insights for decision-making.

> This analysis builds on the predictive model developed in the previous notebook. For context, refer to [employee_attrition.ipynb](./employee_attrition.ipynb).


Building on the previous data analysis and predictive modeling, we found that variables such as **Job Role** and **Over Time** are highly correlated with employee churn. These two stand out as the most important predictors among all features used in the model. Other variables like **Age** and **Business Travel** also showed strong predictive power.

However, correlation does not imply causation.

To deepen our understanding, we now turn to causal inference to investigate whether these variables actually *cause* employees to leave the company. In particular, we will explore the following questions:

- Does having a specific job role (e.g., Director vs. Research Scientist) make someone more likely to leave the company?
- Is younger age a causal factor for churn?
- Does working overtime directly influence the decision to resign?
- Is frequent business travel a reason why employees look for opportunities elsewhere?

Answering these questions will allow us to go beyond prediction and understand the mechanisms behind churn. With causal insights, we can think about targeted interventions that could help reduce churn risk — and, consequently, lower associated costs like employee replacement, training, and loss of productivity.

This notebook is part of an ongoing learning journey in causal inference. The goal is to apply theoretical knowledge in a practical scenario, iterating and refining the approach as new insights emerge.

To start our causal analysis, we first need to examine the variables available and understand how they relate to each other. This step is essential: in causal inference, identifying the relationships between variables allows us to build a meaningful **causal graph** (or DAG — Directed Acyclic Graph), which will serve as the foundation for our analysis.

A well-defined causal graph helps us reason about potential confounders, mediators, and colliders, and guides our choice of estimation methods. Let's begin by exploring the variables and thinking about their possible causal connections.

In [2]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import graphviz as gr

In [3]:
# reading the data file

data = pd.read_csv('data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# checking the column names for recap
data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')