This notebook demonstrates the development of a neural network model using the HR Employee Attrition dataset. The dataset contains anonymized employee records, including variables such as satisfaction level, evaluation scores, workload, and salary levels, with the target variable indicating whether the employee left the company.

The goal of this project is to:
* Load and explore the dataset.
* Remove outliers using Z-score.
* Preprocess the data (scale numerical columns and one-hot encode categorical variables).
* Train a neural network model (MLPClassifier) to predict employee attrition.
* Evaluate model performance using accuracy, confusion matrix, and classification report.
* Save and reuse the trained model for predictions on unseen data.

This notebook also demonstrates how to use the pickle library to serialize and deserialize both datasets and trained models for future use.

The final model is able to predict the likelihood that a given employee will leave the company, based on their behavior, history, and role characteristics.

The machine learning technique used is a Multilayer Perceptron (MLP), implemented via the `MLPClassifier` from Scikit-learn. MLP is a supervised learning algorithm that learns a mapping from input features to output labels by optimizing weights in a neural network. It consists of one or more hidden layers of neurons and uses backpropagation with gradient descent (in this case, the 'adam' optimizer) to minimize prediction error. The ReLU activation function is used to introduce non-linearity and help the model capture complex patterns in the data.

Evaluate data (Exploratory Data Analysis):
* Understand the structure of the dataset
    * Continuous numerical columns: satisfaction_level, last_evaluation, average_monthly_hours.
    * Discrete/categorical columns encoded as numbers: number_project, time_spend_company, work_accident, promotion_last_5years, left.
    * True categorical columns: departments, salary.
    * Prediction target: left (0 = stayed, 1 = left).
* Detect outliers and missing values
* Visualize distributions, correlations, and trends
* Assess data quality
* Generate initial hypotheses

## Libraries Used

This section provides an overview of each Python library used in the notebook and its role in the project.

### 🔹 `pickle`
- **Purpose**: Serialization and deserialization of Python objects.
- **Used for**: Saving and loading the trained model (`MLPClassifier`) and processed datasets to/from disk for reuse without retraining.

### 🔹 `pandas`
- **Purpose**: Data manipulation and analysis.
- **Used for**: Loading the dataset from CSV, inspecting null values, exploring statistics (`describe()`), and preparing the DataFrame for modeling.

### 🔹 `numpy`
- **Purpose**: Efficient numerical computations and array operations.
- **Used for**: Applying the Z-score formula for outlier detection and working with numerical data structures.

### 🔹 `scipy.stats`
- **Purpose**: Statistical functions.
- **Used for**: Calculating Z-scores using `stats.zscore` to identify and remove outliers from the dataset.

### 🔹 `sklearn.model_selection`
- **Purpose**: Tools for splitting datasets and model evaluation.
- **Used for**: `train_test_split()` divides the dataset into training and testing sets.

### 🔹 `sklearn.preprocessing`
- **Purpose**: Data preprocessing and transformation.
- **Used for**:
  - `StandardScaler`: Scales numeric features to a standard normal distribution (mean = 0, std = 1).
  - `OneHotEncoder`: Encodes categorical variables into a binary format for model training.

### 🔹 `sklearn.compose`
- **Purpose**: Column-wise transformations using `ColumnTransformer`.
- **Used for**: Creating a pipeline that applies scaling to numeric columns and encoding to categorical columns in a unified way.

### 🔹 `sklearn.pipeline`
- **Purpose**: Building sequential data processing and modeling pipelines.
- **Used for**: Organizing preprocessing steps into a pipeline to ensure clean and reproducible transformations.

### 🔹 `sklearn.neural_network`
- **Purpose**: Neural network models.
- **Used for**: Training an `MLPClassifier` (multi-layer perceptron) to predict employee attrition.

### 🔹 `sklearn.metrics`
- **Purpose**: Evaluation metrics for model performance.
- **Used for**: Calculating accuracy, confusion matrix, and detailed classification reports for the trained model.

---

These libraries, when used together, provide a powerful stack for building, training, evaluating, and deploying machine learning models in Python.

In [1]:
import pickle
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn. pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from scipy import stats


# Loading the data
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('ds/HR_Dataset.csv')
df['id'] = range(1, len(df) + 1)
print('Data loaded successfully!\n')

print(df.describe())
print('\nNull values per column:\n', df.isnull().sum())

Data loaded successfully!

       satisfaction_level  last_evaluation  number_project  average_monthly_hours  time_spend_company  work_accident          left  promotion_last_5years            id
count        14999.000000     14999.000000    14999.000000           14999.000000        14999.000000   14999.000000  14999.000000           14999.000000  14999.000000
mean             0.612834         0.716102        3.803054             201.050337            3.498233       0.144610      0.238083               0.021268   7500.000000
std              0.248631         0.171169        1.232592              49.943099            1.460136       0.351719      0.425924               0.144281   4329.982679
min              0.090000         0.360000        2.000000              96.000000            2.000000       0.000000      0.000000               0.000000      1.000000
25%              0.440000         0.560000        3.000000             156.000000            3.000000       0.000000      0.000000   

## Why Removing Outliers is Useful

Removing **outliers** (extreme values) is important for several reasons — especially when working with **statistical or machine learning models**.

---

### 🎯 1. Models Get Confused by Extreme Values

- Many algorithms (like linear regression, neural networks, or SVMs) **assume that the data has a reasonably normal distribution**.
- Outliers pull the mean and coefficients toward extremes, **distorting the model’s learning**.
- **Example**: An employee who worked 1000 hours in a month is likely a typo — but it might make the model think that "more hours" always means "lower chance of leaving."

---

### 📉 2. Improves Model Performance

- Clean data helps models **generalize better** on new data (reduces overfitting).
- Outliers can prevent the model from detecting real patterns and instead make it "learn" from exceptions.

---

### 📊 3. Stabilizes Statistics Like Mean and Standard Deviation

- Many descriptive statistics are **very sensitive to outliers** (mean, variance, etc.).
- Removing them allows your metrics to better summarize the typical behavior of the data.

---

### ✅ 4. Helps Identify Errors or Inconsistencies

- Outliers are not always “bad” — but when they appear, they **deserve investigation**:
  - A `satisfaction_level = 0.01` may be legitimate (someone truly dissatisfied), or a data entry error.
  - A `time_spend_company = 50` might indicate **a mistake or system error**.

---

### 📦 5. Reduces Noise and Improves Visualization

- Charts like histograms or boxplots become clearer without extreme values stretching the axes.

---

### 🚫 Exception

- In **some fields (like fraud detection, predictive maintenance, etc.)**, the outliers are **exactly what you want to find**. In these cases, you **should not remove them**, but instead analyze them carefully.

## 📊 Z-Score Formula

The **Z-score** is a statistical measurement that describes a value's position relative to the mean of a group of values. It is measured in terms of standard deviations from the mean.

### Formula
$Z = \frac{x - \mu}{\sigma}$

Where:

- `x` = the value being evaluated
- `μ` = the mean of the dataset
- `σ` = the standard deviation of the dataset

---

## 💡 Example

Let’s say we are analyzing salaries:

- Mean salary (μ) = 50,000
- Standard deviation (σ) = 5,000
- One employee earns (x) = 60,000

### Applying the formula:
```
Z = (60,000 - 50,000) / 5,000
Z = 10,000 / 5,000
Z = 2.0
```

### Interpretation:

A Z-score of **2.0** means that this salary is **2 standard deviations above the average**. This is not considered an outlier (which typically would require a Z-score > 3 or < -3), but it does indicate that the value is relatively high.

---

## 🧠 Why is Z-score useful?

- It helps detect **outliers** in a dataset.
- It **normalizes values**, allowing comparison across different scales.
- It's commonly used in **standardization** before machine learning models.

In [31]:
# Removing outliers using z-score (only on suitable continuous numeric columns)
continuous_cols = ['satisfaction_level', 'last_evaluation', 'average_monthly_hours', 'time_spend_company', 'number_project']
z = np.abs(stats.zscore(df[continuous_cols]))
outliers = (z > 3).any(axis=1)

print('\nRecords identified as outliers: \n')
print(df[outliers])

# Remove outliers from dataframe
df = df[~outliers]


Registros identificados como outliers: 

       satisfaction_level  last_evaluation  number_project  average_monthly_hours  time_spend_company  work_accident  left  promotion_last_5years departments  salary     id
11007                0.49             0.67               2                    190                   8              0     0                      0   marketing  medium  11008
11008                0.92             0.99               3                    176                   8              0     0                      0       sales  medium  11009
11009                0.81             0.55               4                    217                   8              0     0                      0  accounting  medium  11010
11010                0.62             0.91               3                    269                   8              0     0                      0     support  medium  11011
11011                0.21             0.70               3                    238            

* StandardScaler: apply z-score to each value and store it (std and mean are accordingly with each column)
* OneHotEncoder: give 0 or 1 numbers to each categorical value. For instance, salary medium creates column salary_medium with value 1 if contains, 0 otherwise.
    * The first column is usually dropped, because it can be inferred from the others (0 for all means 1 for this category)

### 🔍 Avoiding Multicollinearity

Multicollinearity occurs when **two or more independent variables** in a dataset are **highly correlated** with each other. This means that one variable can be **linearly predicted** from the others with a high degree of accuracy, which can distort the model’s learning process.

#### ⚠️ Why It Matters
- It reduces the interpretability of coefficients in linear models.
- It increases the variance of model parameters, making predictions unstable.
- It may cause overfitting in models sensitive to input redundancy.

#### ✅ Key Notes
- **Non-linear models** like **XGBoost**, **Random Forest**, and **Neural Networks** are generally **less affected** by multicollinearity.
- Always **check for null values**, as missing data can interfere with correlation inference.

#### 🧪 How to Detect Multicollinearity
- **Pearson Correlation Matrix**: Identify variables with correlation > 0.8 or 0.9.
- **Heatmaps**: Visualize correlations for easier interpretation.
- **Variance Inflation Factor (VIF)**: Quantifies how much variance is inflated due to multicollinearity. Remove features with VIF > 10.

#### 🧾 Common Examples of Correlated Variables
- `height` and `arm span`: Usually strongly correlated in biological data.
- `engine size` and `horsepower`: In vehicle specs, both increase together.
- `monthly income` and `years of experience`: Typically rise in tandem in job-related datasets.
- `number_project` and `average_monthly_hours`: In HR datasets, more projects often mean more work hours.
- `temperature` and `ice cream sales`: Seasonal correlation in marketing data.

## 🧪 Hold-Out Validation

The **hold-out method** is a common technique for validating machine learning models by splitting the dataset into two parts:

- **Training set**: Used to train the model (e.g., 80% of the data).
- **Test set**: Used to evaluate how well the model performs on unseen data (e.g., 20%).

Alternatives:
- K-Fold Cross-Validation, Stratified K-Fold, Repeated K-Fold
- Leave-One-Out (LOOCV)
- Time Series Split (Forward Chaining)
- Monte Carlo (ShuffleSplit)
- Bootstrap

In [3]:
# Split features and target variable
X = df.drop(columns=['id', 'left'])
y = df['left']

# Encode categorical variables
categorical_cols = ['salary', 'departments']
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# Create a pipeline for column preprocessing
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
X_processed = pipeline.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Export the processed data
with open('ds/hr.pkl', mode='wb') as f:
    pickle.dump([X_train, X_test, y_train, y_test], f)

* Common heuristics to estimate neurons in each layer: Hidden layer size ≈ (Number of input features + Number of output classes) / Total hidden layers
    * Another option: descending power of 2 to on each layer (64, 32, 16, 8, ...)
* ReLU:
    * Stands for Rectified Linear Unit, defined as f(x) = max(0, x).
    * It introduces non-linearity and helps the model learn complex patterns.
    * Commonly used in deep learning due to its simplicity and effectiveness.
* ADAM:
    * Stands for Adaptive Moment Estimation, an efficient and widely used optimizer for deep learning.
* TOL: Tolerance for the optimization. If the loss improvement is smaller than this threshold, training stops.
* Number of hidden layers: 1 for linear separable problems, > 1 otherwise (requires testing)
    * Train the model and evaluate performance:
        * If the model doesn’t learn or performs poorly → underfitting.
        * If training performance is high but test performance is poor → overfitting.
    * Validate using techniques like:
        * Cross-validation (cross_val_score, TimeSeriesSplit, etc.)
        * EarlyStopping (stops training once overfitting is detected)
    * Use regularization: Apply Dropout, Batch Normalization, or L2 penalties (alpha) if using larger networks.
        * For CNN: Data Augmentation
    * Hyperparameter Tuning: Tools like GridSearchCV, RandomSearch, AutoML or Bayesian Optimization/Optuna can automate the search for the best architecture.

In [4]:
# Load preprocessed data
with open('ds/hr.pkl', 'rb') as f:
    X_predict_train, X_predict_test, y_target_train, y_target_test = pickle.load(f)

# For training: 80%
print(X_predict_train.shape)  # Lines are employees and columns are features (variables)
print(y_target_train.shape)  # Lines are employees, but target is just left (predict based on features if employee is predicted to leave)

# For test: 20%
print(X_predict_test.shape)
print(y_target_test.shape)

from sklearn.neural_network import MLPClassifier

# Training the neural network (MLPClassifier - MultiLayer Perceptron Classifier)
#  - Number of neurons in hidden layers: (9 inputs + 1) / 2 --> 5 neurons
#  - Adam is used in deep learning (suitable for large datasets)
#  - Activation function: ReLU (commonly used in deep learning)
rna_hr = MLPClassifier(
    max_iter=500,
    verbose=True,
    tol=0.00001,
    solver='adam',  # sgd (stochastic gradient descent), lbfgs (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)
    activation='relu',  # identity, logistic, tanh
    hidden_layer_sizes=(5, 5)
)

rna_hr.fit(X_predict_train, y_target_train)

(11999, 18)
(11999,)
(3000, 18)
(3000,)
Iteration 1, loss = 0.50249323
Iteration 2, loss = 0.46472363
Iteration 3, loss = 0.43028917
Iteration 4, loss = 0.40144543
Iteration 5, loss = 0.37779462
Iteration 6, loss = 0.35853141
Iteration 7, loss = 0.34218275
Iteration 8, loss = 0.32817252
Iteration 9, loss = 0.31531146
Iteration 10, loss = 0.29366550
Iteration 11, loss = 0.25501743
Iteration 12, loss = 0.23266773
Iteration 13, loss = 0.22185264
Iteration 14, loss = 0.21589280
Iteration 15, loss = 0.21143074
Iteration 16, loss = 0.20738234
Iteration 17, loss = 0.20416972
Iteration 18, loss = 0.20156296
Iteration 19, loss = 0.19919070
Iteration 20, loss = 0.19716913
Iteration 21, loss = 0.19495213
Iteration 22, loss = 0.19324786
Iteration 23, loss = 0.19120417
Iteration 24, loss = 0.18938342
Iteration 25, loss = 0.18792838
Iteration 26, loss = 0.18600205
Iteration 27, loss = 0.18468698
Iteration 28, loss = 0.18327232
Iteration 29, loss = 0.18179039
Iteration 30, loss = 0.18019771
Iteration



## 📊 Confusion Matrix and Classification Metrics

The **confusion matrix** summarizes the performance of a classification model by comparing predicted labels to actual labels.

### Confusion Matrix Definitions

- **TN (True Negative)**: Model predicted `0` (employee stays), and the employee actually stayed.
- **FP (False Positive)**: Model predicted `1` (employee leaves), but the employee stayed.
- **FN (False Negative)**: Model predicted `0` (employee stays), but the employee left.
- **TP (True Positive)**: Model predicted `1` (employee leaves), and the employee actually left.

```
              Predicted
               0     |     1
            -----------------
Actual  0  |   TN   |   FP   |
        1  |   FN   |   TP   |
```

#### Example:

```
              Predicted
             |   0   |   1
        ---------------------
Actual  0  | 2251 |  43  | → Stayed (2294 total)
        1  |  80  | 626  | → Left   (706 total)
```

- ✅ **TP**: 626 employees correctly predicted to leave.
- ✅ **TN**: 2251 correctly predicted to stay.
- ❌ **FP**: 43 predicted to leave but actually stayed.
- ❌ **FN**: 80 predicted to stay but actually left.

---

### 📈 Key Classification Metrics

#### 🎯 Precision

Indicates how many of the positive predictions were actually correct.

**Formula:**

```
Precision = TP / (TP + FP)
```

> High precision means few false positives. The model is reliable when predicting a departure.

Ex: “Out of all the times the system said ‘thief’, how many times was it actually correct?”

---

#### 📉 Recall

Indicates how many of the actual positives were correctly predicted.

**Formula:**

```
Recall = TP / (TP + FN)
```

> High recall means the model catches most of the employees who will actually leave.

Ex: “Out of all the actual thieves, how many did the system catch?”

---

#### ⚖️ F1-Score

The harmonic mean of precision and recall. It balances the two in one metric.

**Formula:**

```
F1 = 2 * (Precision * Recall) / (Precision + Recall)
```

> Useful when you need a balance between false positives and false negatives.

---

### 📊 Averaging Methods

#### 📊 Macro Average

Averages the metric (Precision, Recall, F1) for each class **equally**, regardless of how many instances each class has.

**Formula:**

```
Macro Avg = (Metric_class_0 + Metric_class_1 + ... + Metric_class_n) / n
```

> Does **not** take class imbalance into account.

---

#### ⚖️ Weighted Average

Averages the metric for each class **weighted by the number of instances** (support) in each class.

**Formula:**

```
Weighted Avg = Σ(Metric_i × Support_i) / Total_Support
```

> More representative for **imbalanced datasets**.

---

These metrics help assess how well your model performs, not just in overall accuracy, but in identifying true cases, minimizing false alarms, and working fairly across different classes.


In [5]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predictions for model evaluation
predictions = rna_hr.predict(X_predict_test)

# Accuracy
print('Average Accuracy:', accuracy_score(y_target_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_target_test, predictions))
print('Classification Report:')
print(classification_report(y_target_test, predictions))

# Saving the trained model (can be loaded after as a python object)
with open('ds/rna_hr.sav', 'wb') as f:
    pickle.dump(rna_hr, f)

Average Accuracy: 0.959
Confusion Matrix:
[[2251   43]
 [  80  626]]
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      2294
           1       0.94      0.89      0.91       706

    accuracy                           0.96      3000
   macro avg       0.95      0.93      0.94      3000
weighted avg       0.96      0.96      0.96      3000



In [36]:
import pickle

# Loading the data
with open('ds/hr.pkl', 'rb') as f:
    X_hr, y_hr, _, _ = pickle.load(f)

print(X_hr)
print(y_hr)

# Loading the trained neural network
rna_hr = pickle.load(open('ds/rna_hr.sav', 'rb'))

# Transforming vector into matrix
inputs = X_hr[0].reshape(1, -1)  # 1 line, -1 (all columns)
print(inputs)
print(rna_hr.predict(inputs))

inputs = X_hr[1].reshape(1, -1)
print(rna_hr.predict(inputs))

inputs = X_hr[8000].reshape(1, -1)
print(rna_hr.predict(inputs))

[[-1.08951624 -0.15119673  1.77212132 ...  0.          0.
   0.        ]
 [ 0.83575687 -0.26785107 -0.65002444 ...  1.          0.
   0.        ]
 [-0.56808811 -0.44283259 -0.65002444 ...  0.          1.
   0.        ]
 ...
 [ 0.19399917  0.08211196  0.9647394  ...  0.          1.
   0.        ]
 [ 0.7154273   1.6569456   0.15735748 ...  1.          0.
   0.        ]
 [ 1.47751458  0.84036519 -1.45740636 ...  1.          0.
   0.        ]]
[[ 0.27421888 -1.20108582 -1.45740636 ...  1.          0.
   0.        ]
 [-1.65105424  1.54029125 -0.65002444 ...  1.          0.
   0.        ]
 [-0.2070994  -0.03454239  0.9647394  ...  0.          0.
   1.        ]
 ...
 [-1.49061481  0.43207499 -1.45740636 ...  0.          0.
   0.        ]
 [ 1.39729487 -0.15119673  0.15735748 ...  0.          1.
   0.        ]
 [ 1.43740472  1.42363691 -0.65002444 ...  1.          0.
   0.        ]]
[[-1.08951624 -0.15119673  1.77212132  1.63742518 -1.17671258 -0.40665095
  -0.14299166  1.          0.         