
## Step 1: Import Required Libraries

We need several libraries to handle data, visualize trends, and build machine learning models.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Ignore harmless warnings
warnings.filterwarnings('ignore')

# Display all columns
pd.set_option('display.max_columns', None)
```

### **Breaking Down Each Import:**
1. **NumPy (`numpy`)**:  
   - Used for numerical computations.
   - Helps in handling arrays and performing mathematical operations.

2. **Pandas (`pandas`)**:  
   - A powerful library for handling structured data.
   - Used for reading, transforming, and analyzing the dataset.

3. **Matplotlib (`matplotlib.pyplot`)**:  
   - Provides basic plotting functions.
   - Used to **visualize trends in employee attrition**.

4. **Seaborn (`seaborn`)**:  
   - Built on Matplotlib but provides **better statistical visualizations**.
   - Helps in **heatmaps, boxplots, and trend analysis**.

5. **Warnings (`warnings.filterwarnings('ignore')`)**:  
   - Suppresses unnecessary warnings for cleaner outputs.


In [None]:
# Ignore harmless warnings
import warnings
warnings.filterwarnings('ignore')

# Importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display all columns
pd.set_option('display.max_columns', None)


## Step 2: Load the Employee Attrition Dataset

We load the dataset using **Pandas** and inspect its structure.

```python
data = pd.read_csv('/content/drive/MyDrive/Datasets/Input/employee_data.csv')

# Display the first few rows
data.head()
```

### **Breaking Down the Code:**
1. **`pd.read_csv()`**:  
   - Reads the dataset from a CSV file.
   - Stores it in a Pandas **DataFrame**.

2. **`data.head()`**:  
   - Displays the **first five rows** of the dataset.
   - Helps us **quickly inspect** its structure.

### **Why is this Important?**
- We need to understand the **features** available before performing attrition analysis.


In [None]:
# Data Import
data = pd.read_csv('/content/drive/MyDrive/Datasets/Input/employee_data.csv')
# Data sample
data.head()


## Step 3: Exploratory Data Analysis (EDA)

Before building a model, we **explore** the dataset:

```python
print(f'The dataset has {data.shape[0]} rows and {data.shape[1]} columns.')
data.info()
data.describe()
```

### **Breaking Down the Code:**

1. **`data.shape`**:  
   - Returns the **number of rows and columns**.
   - Helps understand the dataset **size**.

2. **`data.info()`**:  
   - Displays column names, **data types**, and **missing values**.

3. **`data.describe()`**:  
   - Shows **summary statistics** (mean, std, min, max, etc.).

### **Why is this Important?**
- Helps detect **data quality issues** before analysis.


In [None]:
# Number of rows and columns in the data
rows, cols = data.shape
print(f'The data has {rows} rows and {cols} columns.')


## Step 4: Checking for Missing Values

Missing values can affect model accuracy. We count missing values in each column.

```python
missing_values = data.isnull().sum()
missing_values[missing_values > 0]
```

### **Breaking Down the Code:**

1. **`data.isnull().sum()`**:  
   - Counts **missing values** in each column.

2. **`missing_values[missing_values > 0]`**:  
   - Filters out only columns that **have missing values**.

### **Why is this Important?**
- Missing data can **bias** the analysis.
- We decide whether to **drop or impute** missing values.


In [None]:
# Number of numerical and categorical features
num, obj = 0, 0
for feature in data.columns:
    if data[feature].dtype == 'O':
        obj += 1
    else:
        num += 1
print('NUMBER OF CATEGORICAL AND NUMERICAL FEATURES:')
print(f'The data has {obj} categorical and {num} numerical features.')

# Percentage of missing values
print('\nPERCENTAGE OF MISSING VALUES:')
total = 0
for feature in data.columns:
    total += len(data[feature])

missing = round(data.isnull().mean()*100,2)
print('There is no missing value in dataset.' if total == data.size else missing)


## Step 5: Attrition Rate Analysis

We analyze how many employees **left** vs. **stayed**.

```python
sns.countplot(x='Attrition', data=data, palette='coolwarm')
plt.title('Employee Attrition Distribution')
plt.show()
```

### **Breaking Down the Code:**

1. **`sns.countplot(x='Attrition', data=data, palette='coolwarm')`**:  
   - Creates a **count plot** showing attrition distribution.
   - **X-axis** = Attrition categories (Yes/No).

2. **`plt.title()` & `plt.show()`**:  
   - Adds a title & displays the plot.

### **Why is this Important?**
- Employee attrition datasets are often **imbalanced** (fewer employees leaving than staying).
- Class imbalance can **affect model predictions**.


In [None]:
# Checking data duplicates
rows, cols = data[data.duplicated()].shape
print('There are no duplicates.' if rows == 0 else f'There are {rows} duplicates in the data.')

# if you find any duplicates then treat it
    # data.drop_duplicates(inplace=True)


## Step 6: Correlation Analysis

We check how numerical features are related.

```python
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), cmap='coolwarm', annot=True, fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()
```

### **Breaking Down the Code:**

1. **`data.corr()`**:  
   - Computes Pearson correlation between numerical columns.

2. **`sns.heatmap()`**:  
   - Creates a **heatmap** for correlation visualization.

3. **`cmap='coolwarm', annot=True`**:  
   - Uses **coolwarm color scale** & shows values inside cells.

### **Why is this Important?**
- Helps identify **highly correlated features** (which may be redundant).


In [None]:
# Feature "Over18" is populated with one value, it wont contribute to analysis
del data['Over18']


## Step 7: Feature Engineering - Encoding Categorical Variables

Machine learning models cannot process **categorical variables** directly.  
We convert them into **numerical format** using:

1. **Label Encoding** (For binary categorical features).  
2. **One-Hot Encoding** (For multi-category features).  

```python
from sklearn.preprocessing import LabelEncoder

# Encode binary categorical variables (Yes/No → 0/1)
binary_cols = ['Attrition', 'OverTime']
for col in binary_cols:
    data[col] = LabelEncoder().fit_transform(data[col])

# One-Hot Encoding for categorical features with multiple categories
data = pd.get_dummies(data, drop_first=True)

print("Encoded dataset preview:")
data.head()
```

### **Breaking Down the Code:**

1. **`LabelEncoder().fit_transform(data[col])`**:  
   - Converts **Yes/No columns** into **0/1** format.

2. **`pd.get_dummies(data, drop_first=True)`**:  
   - Performs **One-Hot Encoding**, creating new **binary columns** for each category.

### **Why is This Important?**
- Models **cannot handle text**; categorical data must be converted.
- Encoding ensures that the model **understands feature relationships**.


In [None]:
# Understanding the numerical features
for feature in data.columns:
    if data[feature].dtype != 'O':
        if len(data[feature].unique()) == 1:
            print(f'** {feature} has {len(data[feature].unique())} unique values **')
        else:
            print(f'{feature} has {len(data[feature].unique())} unique values')


## Step 8: Handling Class Imbalance

If the dataset has **more employees staying** than leaving,  
the model may **struggle to predict attrition correctly**.

We handle imbalance using **Synthetic Minority Over-sampling Technique (SMOTE)**.

```python
from imblearn.over_sampling import SMOTE

# Separate features and target variable
X = data.drop(columns=['Attrition'])
y = data['Attrition']

# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("Class distribution after SMOTE:", y_resampled.value_counts())
```

### **Breaking Down the Code:**

1. **`SMOTE(sampling_strategy='auto')`**:  
   - Generates **synthetic samples** for the minority class.  

2. **`fit_resample(X, y)`**:  
   - Balances the dataset by **over-sampling the minority class**.

### **Why is This Important?**
- Prevents the model from being **biased towards majority class**.
- Ensures **fair prediction of attrition**.


In [None]:
# Feature "StandardHours" and "EmployeeCount" has only one unique value
# Feature "EmployeeNumber" look like a ID column

for cols in ['StandardHours', 'EmployeeNumber', 'EmployeeCount']:
    del data[cols]


## Step 9: Model Selection and Training

We train a **Logistic Regression model** to predict attrition.

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

print("Model trained successfully!")
```

### **Breaking Down the Code:**

1. **`train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)`**:  
   - Splits the dataset into **80% training** and **20% testing**.

2. **`LogisticRegression()`**:  
   - A statistical model used for **binary classification**.

3. **`model.fit(X_train, y_train)`**:  
   - Trains the model using the **training data**.

### **Why is This Important?**
- Helps predict whether an employee **will leave or stay**.


In [None]:
# OBSERVATION ON QUALITATIVE AND QUANTITATIVE DATA DISTRIBUTION UPON ATTRITION:


#   1. For the quantitative variables, the distribution of attition 'yes' folows the same pattern of distribution
#      for attrition 'No' with less density.

#   2. For the qualtitative variables, the count of atrition 'yes' looks like a scaled down count of attrition
#      'No' except that of JobRole, MaritalStatus and OverTime.

#   3. Since the distribution of attrition follows a similar pattern with less density, further analysis
#      can be carried on with data where attrition is 'Yes'.

#   4. Similarly for qualitative variables, the attrition follows a scaled down count, further analysis
#      can be carried on with data where attrition is 'Yes'.Excpet JobRole, MaritalStatus and OverTime.



## Step 10: Model Evaluation - Precision-Recall Curve

Since attrition is **imbalanced**, we use the **Precision-Recall Curve** to evaluate performance.

```python
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get model probabilities
y_scores = model.predict_proba(X_test)[:, 1]

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_scores)

# Plot the curve
plt.figure(figsize=(8,6))
plt.plot(recall, precision, marker='.', label="Logistic Regression")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.show()
```

### **Breaking Down the Code:**

1. **`model.predict_proba(X_test)[:, 1]`**:  
   - Gets the **probability scores** for class **1 (Attrition)**.

2. **`precision_recall_curve(y_test, y_scores)`**:  
   - Computes the **Precision-Recall curve**.

3. **`plt.plot(recall, precision, marker='.')`**:  
   - Plots the **Precision vs Recall**.

### **Why is This Important?**
- Traditional accuracy is **misleading for imbalanced datasets**.
- Precision-Recall Curve helps us **better assess performance**.


In [None]:
# OBSERVATION:

#    1. Employees who are male are most likely to attrite.
#    2. Employees from R&D and sales department are most like to attrite.
#    3. Employees from life science and medical background are most likely to attrite.
#    4. Employees working as laboratory technician,sales executive, reasearch scientist,
#       sales representative are most likely to attrite.
#    5. Employees who are single is most likely to attrite.

#    ** Bivariate analysis as hue has to be done for further insights.


## Step 11: Model Explainability using SHAP

We use **SHAP (SHapley Additive Explanations)** to understand **which features influence attrition predictions**.

```python
import shap

# Initialize SHAP Explainer
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test)

# Summary Plot
shap.summary_plot(shap_values, X_test, plot_type="bar")
```

### **Breaking Down the Code:**

1. **`shap.Explainer(model, X_test)`**:  
   - Creates a SHAP explainer for the trained model.

2. **`shap_values = explainer(X_test)`**:  
   - Computes SHAP values for **each feature**.

3. **`shap.summary_plot(shap_values, X_test, plot_type="bar")`**:  
   - Displays a **bar chart** of feature importance.

### **Why is This Important?**
- Helps **interpret model decisions** (e.g., which features drive attrition predictions).


In [None]:
# OBSERVATION FOR QUANTITATIVE FEATURES HAVING LESS THAN 30 UNIQUE VALUES:

# 01. Employees living nearby are more likely to attrite. It has to be further analysed for its controversy.
# 02. Employees with education rank 3 & 4 are most likely to attrite.
# 03. Employees with 3rd rank jobinvolvement are most likely to attrite.
# 04. Employees with joblevel 1 are most likely to attrite.
# 05. Employees who have worked only in one company are most likely to attrite.
# 06. Employees with less than 14% salary hike are most likely to attrite.
# 07. Employees with performance rating of 3 is most likely to attrite.
# 08. Employees with 0 stockoptionlevel is most likely to attrite.
# 09. Employee with 2 and 3 times of trainingtimelastyear is most likely to attrite.
# 10. Employee with rank 3 worklifebalance are most likely to attrite. It has to be analysed for its controversy.
# 11. Employee having 1 year of experience in the company is more likely to attrite.
# 12. Employee serving in same role for 2 years is most likely to attrite.
# 13. Employee serving with 1 and less than 1 year since last promotion is most likely to attrite.
# 14. Employee serving with less than 1 year with current manager are most likely to attrite.
# 15. Environment, job, relationship satisfication is nearly equally spead in all ranks.


## Step 12: Hyperparameter Tuning - GridSearchCV

We optimize the **hyperparameters** using **GridSearchCV**.

```python
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]
}

# Grid Search
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
```

### **Breaking Down the Code:**

1. **`param_grid = {'C': [0.01, 0.1, 1, 10, 100]}`**:  
   - Defines a **range of values** for the **regularization strength (C)**.

2. **`GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='f1')`**:  
   - Performs **grid search** over hyperparameters.
   - Uses **5-fold cross-validation**.

3. **`grid_search.fit(X_train, y_train)`**:  
   - Trains multiple models with different parameters.

### **Why is This Important?**
- Improves model **performance** by selecting the **best hyperparameters**.


In [None]:
# Creating plot structure
fig = plt.figure(figsize=(18,12))
spec = fig.add_gridspec(2,2)
spec.update(wspace=0.15,hspace=0.2)
sec_1 = fig.add_subplot(spec[0,0])
sec_2 = fig.add_subplot(spec[0,1])
sec_3 = fig.add_subplot(spec[1,0])
sec_4 = fig.add_subplot(spec[1,1])

# Adding color preferences
bg_color = '#ffd9d9'
for selection in [fig, sec_1, sec_2, sec_3, sec_4]:
    selection.set_facecolor(bg_color)

# Plotting the graph
sec = [sec_1, sec_2, sec_3, sec_4]
cnt = 0
for hue in ['JobLevel', 'PerformanceRating', 'YearsSinceLastPromotion']:
    sns.countplot(SalRep_data, x='JobRole', hue=hue, ax=sec[cnt], palette='RdYlBu')
    sec[cnt].grid(color='#000000', linestyle=':', axis='y', zorder=0,  dashes=(1,5))
    sec[cnt].set_title('Sales Representative attrition Vs '+hue+' as hue', size=14)
    sec[cnt].set_xlabel('')
    for location in ['top', 'right', 'left']:
        sec[cnt].spines[location].set_visible(False)
    cnt+=1

# Narrating the observation
sec_4.text(0.5,0.6,'OBSERVATION\n__________________\n\n Sales SalRep_data with low JobLevel,\
\nLowPerformanceRating and with less\
\nthan 1 YearsSinceLastPromotion are\n most likely to attrite.',
           ha='center',va='center',size=18,weight=550,family='serif')

# Removing axis and spines
sec_4.xaxis.set_visible(False)
sec_4.yaxis.set_visible(False)
for location in ['top', 'right', 'left', 'bottom']:
    sec_4.spines[location].set_visible(False)

