<h1 align="center">Logistic Regression Model for Absenteeism Prediction</h1>

---
This notebook develops a **logistic regression model** to predict whether an employee’s absenteeism level is **above the median**—classified as **excessive absenteeism**—based on various personal and workplace-related factors.

### Notebook Objectives

1. **Load** a dataset that has been cleaned and preprocessed in a prior step.

2. **Prepare the data** for modeling:

   * Define a binary target variable for excessive absenteeism.
   * Select relevant input features and scale them using a custom scaler.

3. **Build and fit** a logistic regression model using all available features.

4. **Interpret model coefficients** and compute **odds ratios** to identify key predictors.

5. **Evaluate model performance** through:

   * Summary statistics: AIC, BIC, pseudo R-squared, p-values
   * Accuracy scores (train/test split)
   * Confusion matrix analysis

6. **Test the model** on unseen data and serialize the trained model and scaler for future integration.

> A baseline model is first trained using all features. We then apply **backward elimination** to simplify the model by removing statistically insignificant variables, enhancing interpretability while preserving predictive performance.

---

## **1. Initial Steup and Dataset Load**
This section covers the necessary imports and loads the preprocessed dataset, which was prepared in a previous notebook (`1_preprocessing.ipynb`). The data has been cleaned and  encoded appropriately for modeling.

### 1.1 Import Packages

In [4]:
# Core libraries
import numpy as np
import pandas as pd

# Scikit-learn and statsmodels utilities for model training and evaluation
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# For saving/loading the model
import pickle
import os

# Supress warnings
import warnings
warnings.filterwarnings('ignore')

# Markdown and display
from IPython.display import display, Markdown

### 1.2 Load Preprocessed Data

In [6]:
data_preprocessed = pd.read_csv('../data/absenteeism_preprocessed.csv')

# Preview the data
data_preprocessed.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportation_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,2


---

## **2. Modeling Preparation**
In this section, we prepare the data for modeling by defining the target, selecting features, standardizing inputs, and splitting the data into training and testing sets.

### 2.1 Define Target Variable

Before training our logistic regression model, we need to define a target variable. In our case, we aim to predict whether an employee’s absenteeism is **excessive** based on several features. 

To convert this into a **binary classification problem**, we define a new target variable called `excessive_absenteeism`, using the following logic:

* Take the median of the `absenteeism_time_hours` column.
* If an employee’s absenteeism exceeds the median, classify them as `1` (excessively absent).
* Otherwise, classify them as `0` (moderately or rarely absent).

This approach creates a numerically stable and reasonably balanced dataset. A roughly **50/50 class distribution** ensures that the model doesn’t simply learn to always predict the majority class.

In [10]:
def create_target_variable(df, target_col='absenteeism_time_hours'):
    
    """Creates a binary target variable based on the median of absenteeism time hours.
    Adds a new column 'excessive_absenteeism' to the DataFrame and drops the original column."""
    
    # Calculate median
    cutoff = df[target_col].median()
    
    # Create binary target variable
    targets = np.where(df[target_col] > cutoff, 1, 0)
    df['excessive_absenteeism'] = targets
    
    # Check class distribution
    proportion_class_1 = targets.sum() / targets.shape[0]
    proportion_class_0 = 1 - proportion_class_1
    display(Markdown(f"<div>Class Distribution using cutoff = {cutoff} hours: <strong>{proportion_class_0:.0%}-{proportion_class_1:.0%}</strong></div>"))
    
    # Drop the original column
    df = df.drop([target_col], axis=1)
    
    return df, targets

In [11]:
data_with_targets, targets = create_target_variable(data_preprocessed)
data_with_targets.head()

<div>Class Distribution using cutoff = 3.0 hours: <strong>54%-46%</strong></div>

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportation_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,excessive_absenteeism
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,0


### 2.2 Select Input Features

Before training our logistic regression model, we need to identify the features (**independent variables**) that will be used to predict absenteeism. 

Since our target variable is `excessive_absenteeism`, we will select all other columns in the DataFrame as input features for the model. This approach includes all predictors initially, while allowing us to remove less informative ones later based on model performance and coefficients. 

> We hypothesize that:
>
> * **Reasons for absence** will likely be the strongest predictor of excessive absenteeism.
> * **Workload**, **children**, and **pets** may also play a role, as personal responsibilities and work pressure can influence absentee behavior.
> * Not all features will have significant predictive power. Logistic regression has the ability to reveal which variables contribute meaningfully to the prediction.

In [13]:
def select_input_features(df, columns_to_drop=['excessive_absenteeism']):
    
    """Selects input features for modeling by excluding the target column. 
    Optionally: drop unnecessary columns through backward elimination."""
    return df.drop(columns_to_drop, axis=1)

In [14]:
unscaled_inputs = select_input_features(data_with_targets)
unscaled_inputs.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportation_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1


### 2.3 Standardize the Inputs
Standardizing transforms features to have a `mean = 0` and a `standard deviation = 1`, ensuring that all variables contribute equally to the model's learning process and making them more directly comparable.

>Important Note:
> * We won't standardize the **dummy variables** since they already contain meaningful binary information (`0` = absence of a category, `1` = presence).
> * Standardizing them would distort this interpretability, making the model harder to understand.
> * To avoid this, we'll use a **custom scaler** that that selectively standardizes numeric features, leaving binary dummies untouched.

In [16]:
class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        # Initialize an internal StandardScaler instance
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        
        self.columns = columns  # columns to apply scaling to
        self.mean_ = None       # placeholder for column means
        self.var_ = None        # placeholder for column variances
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std

    def fit(self, X, y=None):
        # Fit the scaler only on selected columns
        self.scaler.fit(X[self.columns], y)
        
        # Store mean and variance for reference or reuse
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns

        # Scale selected columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)

        # Leave unselected columns unchanged
        X_not_scaled = X.loc[:, ~X.columns.isin(self.columns)]

        # Combine scaled and unscaled columns, preserving original column order
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [17]:
# Preview available columns
print(unscaled_inputs.columns.values)

['reason_group_1' 'reason_group_2' 'reason_group_3' 'reason_group_4'
 'day_of_week' 'month' 'transportation_expense_dollars'
 'distance_to_work_miles' 'age' 'daily_work_load_average'
 'body_mass_index' 'education' 'children' 'pets']


In [18]:
# Avoid scaling dummies and non-numeric categorical features
columns_to_omit = ['reason_group_1', 'reason_group_2', 'reason_group_3', 'reason_group_4', 'education']
columns_to_scale = [col for col in unscaled_inputs.columns.values if col not in columns_to_omit]

# Fit the custom scaler and transform inputs
absenteeism_scaler = CustomScaler(columns_to_scale)
absenteeism_scaler.fit(unscaled_inputs)
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

# Preview scaled inputs
scaled_inputs.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportation_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets
0,0,0,0,1,-0.683704,0.182726,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,-0.683704,0.182726,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,-0.007725,0.182726,-0.654143,1.426749,0.24831,-0.806331,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.668253,0.182726,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.668253,0.182726,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487


### 2.4 Split Data into Train and Test Sets

* **Training data (80%)**: Used by the model to learn patterns.
* **Test data (20%)**: Used to evaluate the model’s performance on unseen data.

> Note on Shuffling:
>
> * We shuffle the data to remove any order or grouping bias, ensuring randomness in how the training and testing sets are created.
> * We also set a `random_state` (seed) of 20 to ensure the split is reproducible.

In [20]:
# Split the data into training and testing sets (80/20 split)
x_train, x_test, y_train, y_test = train_test_split(
    scaled_inputs,
    targets,
    train_size=0.8,
    random_state=20
)

# Show the shape of the resulting datasets
display(Markdown(
    f"""
    Training Set: ({x_train.shape[0]} samples, {x_train.shape[1]} features) Targets: {y_train.shape[0]}
    Testing Set: ({x_test.shape[0]} samples, {x_test.shape[1]} features) Targets: {y_test.shape[0]}
    """
))


    Training Set: (560 samples, 14 features) Targets: 560
    Testing Set: (140 samples, 14 features) Targets: 140
    

---

## **3. Logistic Regression Modeling**
In this section, we train a logistic regression model using the training data, evaluate its initial performance, and examine statistical summaries to identify potential simplifications.

### 3.1 Train the Model Using Scikit-Learn
We use `LogisticRegression` from scikit-learn to train the model and assess its performance on the training set.

In [24]:
def fit_model(x, y):
    """Fit logistic regression and calculate the accuracy rate.""" 
    reg = LogisticRegression()
    reg.fit(x, y)
    accuracy = reg.score(x, y)
    return reg, accuracy

reg, training_accuracy_1 = fit_model(x_train, y_train)

In [25]:
display(Markdown(f"**Training Accuracy:** {training_accuracy_1:.2%}"))

**Training Accuracy:** 77.50%

> This means the model **correctly classified 77.5%** of the training observations. This is an acceptable baseline for model performance.

### 3.2 Examine Model Summary with Statsmodels
To understand the **statistical significance** of each feature, we fit the same model using `statsmodels`’ `Logit`, which provides detailed coefficient output and p-values.

In [28]:
def stats_summary_table(x_train, y_train, columns_to_drop=['reason_group_4']):
    """Generate statsmodels logistic regression summary for feature evaluation."""
    x_train_sm = sm.add_constant(x_train.copy())
    x_train_sm = x_train_sm.drop(columns=columns_to_drop)  # Drop one dummy to prevent multicollinearity
    logit_model = sm.Logit(y_train, x_train_sm)
    result = logit_model.fit()
    return result

model_1 = stats_summary_table(x_train, y_train)
model_1.summary()

Optimization terminated successfully.
         Current function value: 0.518646
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,560.0
Model:,Logit,Df Residuals:,546.0
Method:,MLE,Df Model:,13.0
Date:,"Mon, 26 May 2025",Pseudo R-squ.:,0.2467
Time:,15:41:29,Log-Likelihood:,-290.44
converged:,True,LL-Null:,-385.55
Covariance Type:,nonrobust,LLR p-value:,1.379e-33

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.9671,0.136,-7.112,0.000,-1.234,-0.701
reason_group_1,2.2890,0.258,8.880,0.000,1.784,2.794
reason_group_2,1.1549,0.929,1.243,0.214,-0.666,2.976
reason_group_3,2.9919,0.461,6.489,0.000,2.088,3.896
day_of_week,-0.0552,0.108,-0.510,0.610,-0.267,0.157
month,0.1501,0.107,1.401,0.161,-0.060,0.360
transportation_expense_dollars,0.6283,0.133,4.741,0.000,0.369,0.888
distance_to_work_miles,0.0093,0.116,0.080,0.936,-0.219,0.237
age,-0.2197,0.134,-1.636,0.102,-0.483,0.044


#### Interpretation of the Model Summary:

* The **Pseudo R²** is approximately 0.2467, meaning the model explains **~24.7%** of the variation in the log-odds.
* In logistic regression, values between **0.2-0.4** for Pseudo R² indicate a **reasonable model fit**.
* Some features are not statistically significant at **p-values > 0.05** level.
* Removing such features may improve model simplicity and generalizability without major performance loss.

---

### 3.3 Interpret Coefficients and Odds Ratios
To better understand the logistic regression model, we examine the coefficients and what they tell us about each feature’s impact on the odds of excessive absenteeism.

In logistic regression, the coefficients represent **log-odds**, which are not easy to interpret directly. By exponentiating these coefficients, we convert them to **odds ratios**, making interpretation more intuitive:

> ##### Key Interpretation Guidelines:
>
> * **Odds Ratio ≈ 1 or Coefficient ≈ 0** → Little to no effect on outcome
> * **Odds Ratio > 1** → Increases the odds of excessive absenteeism
> * **Odds Ratio < 1** → Decreases the odds of excessive absenteeism

In [31]:
def summarize_logistic_regression(model):
    """
    Create a summary table showing the intercept and coefficients
    of a trained logistic regression model.
    """
    # Extract feature names
    feature_names = unscaled_inputs.columns.values
    summary_table = pd.DataFrame({'Feature Name': feature_names})
    
    # Add coefficients and odds ratios
    summary_table['Coefficient'] = np.transpose(model.coef_)
    summary_table.index += 1
    summary_table.loc[0] = ['Intercept', model.intercept_[0]]

    # Calculate odds ratios
    summary_table['Odds Ratio'] = np.exp(summary_table['Coefficient'])

    # Sort by odds ratio
    display(summary_table.sort_values('Odds Ratio', ascending=False))

summarize_logistic_regression(reg)

Unnamed: 0,Feature Name,Coefficient,Odds Ratio
3,reason_group_3,3.096739,22.125672
1,reason_group_1,2.801363,16.467081
2,reason_group_2,0.933541,2.543499
4,reason_group_4,0.857183,2.356513
7,transportation_expense_dollars,0.613216,1.846359
13,children,0.361898,1.436052
11,body_mass_index,0.271155,1.311478
6,month,0.166403,1.181049
10,daily_work_load_average,-7.7e-05,0.999923
8,distance_to_work_miles,-0.007779,0.992251


---
<h3 align='center'>Most Influential Features</h3>

| Feature                        | Odds Ratio | Interpretation                                                                                             |
| :----------------------------- | ---------: | :--------------------------------------------------------------------------------------------------------- |
| **reason\_group\_3**           |      22.13 | **Strongest predictor**. Individuals citing this reason are **22x more likely** to be excessively absent. |
| **reason\_group\_1**           |      16.47 | Major predictor. **16x higher odds** of excessive absence when this reason applies.                       |
| **reason\_group\_2**           |       2.54 | Moderate effect. **2.5x more likely** to be excessively absent.                                           |
| **reason\_group\_4**           |       2.36 | Slightly increases odds. Over **2x** the baseline odds.                                                   |
| **transportation\_expense**    |       1.85 | Each standard deviation increase in expense increases absenteeism odds by **\~85%**.                       |
| **children**                   |       1.44 | Employees with more children are **43% more likely** to be excessively absent.                             |
| **body\_mass\_index**          |       1.31 | Higher BMI slightly increases absenteeism odds **\~31% higher odds**.                                    |
| **pets**                       |       0.75 | **Significant**. Each standard deviation increase in pet ownership **reduces odds by \~25%**.             |
| **age**                        |       0.85 | Older employees are **\~15% less likely** to be excessively absent.                                        |

#### Summary:
* Strongest predictors of absenteeism include specific absence reasons (`reason_group_3`, `reason_group_1`), transportation expense, and children.
* Pets and age appear to decrease the odds of absenteeism.
---

### 3.4 Simplify Model with Backward Elimination

Although some of these features may hold predictive value in different contexts or datasets, we aim for a leaner model in this iteration. To improve interpretability and generalizability, we simplify the model by removing features with:

* **High p-values** (i.e., low statistical significance)
* **Odds Ratios close to 1** (i.e., minimal impact on the outcome)

<h4 align="center">Feature Consideration for Elimination</h4>

| Feature                   | P-Value | Odds Ratio | Rationale                                               |
| ------------------------- | ------- | ---------- | ------------------------------------------------------- |
| `distance_to_work_miles`  | 0.9362  | 0.992251   | **Highest p-value and weak influence**                  |
| `daily_work_load_average` | 0.6953  | 0.999923   | Statistically insignificant and weak influence.         |
| `day_of_week`             | 0.6102  | 0.919141   | Statistically insignificant and weak influence.         |
| `education`               | 0.2611  | 0.813811   | Slight  effect, but statistically insignificant.        |
| `month`                   | 0.1613  | 1.181049   | May retain for potential seasonal effects.              |
| `reason_group_2`          | 0.2139  | 2.543499   | Moderate influence. Retained for categorical balance.   |
| `age`                     | 0.1018  | 0.847431   | Moderate influence. Retained temporarily.               |

#### **Model Refinement (Feature Dropping):**

In [34]:
# Input selection
columns_to_drop = ['excessive_absenteeism', 'day_of_week', 'daily_work_load_average', 'distance_to_work_miles', 'education']
unscaled_inputs = select_input_features(data_with_targets, columns_to_drop)

# Scale the input
columns_to_omit = ['reason_group_1', 'reason_group_2', 'reason_group_3', 'reason_group_4', 'education']
columns_to_scale = [col for col in unscaled_inputs.columns.values if col not in columns_to_omit]
absenteeism_scaler = CustomScaler(columns_to_scale)
absenteeism_scaler.fit(unscaled_inputs)
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

# Split into train and test dataset
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size=0.8, random_state=20)

# Fit the model and generate metrics
reg, training_accuracy_2 = fit_model(x_train, y_train)
model_2 = stats_summary_table(x_train, y_train)

Optimization terminated successfully.
         Current function value: 0.520303
         Iterations 6



<h4 align="center">Summary of Backward Elimination</h4>

| Iteration | Accuracy Δ | Pseudo R² Δ | AIC Δ ↓ |  BIC Δ ↓ | Notes                                                  |
| --------: | ---------: | ----------: | ------: | -------: | ------------------------------------------------------ |
|         1 | -0.0018 |  -0.0006 | -5.56 | -18.54 | Minor simplification; negligible impact                          |
|         2 |  +0.0036 |  -0.0024 | -6.14 | -23.45 | **Best trade-off** — simpler, slightly better accuracy          |
|         3 |   0.0000 |  -0.0054 | -5.81 | -27.45 | No gain in accuracy; more drop in R²                            |
|         4 | -0.0036 |  -0.0040 | -6.92 | -28.56 | Performance degrades                                             |

---

#### **Best Iteration — Iteration 2:**
In this version, we dropped: `distance_to_work_miles`, `daily_work_load_average`, `day_of_week`, and `education`.

In [37]:
print(f"{'Metric':<15}{'Model 1':<15}{'Model 2'}")
print("-------------------------------------")
print(f"{'Accuracy':<15}{training_accuracy_1:<15.4f}{training_accuracy_2:.4f}")
print(f"{'Pseudo R²':<15}{model_1.prsquared:<15.4f}{model_2.prsquared:.4f}")
print(f"{'AIC':<15}{model_1.aic:<15.2f}{model_2.aic:.2f}")
print(f"{'BIC':<15}{model_1.bic:<15.2f}{model_2.bic:.2f}")

Metric         Model 1        Model 2
-------------------------------------
Accuracy       0.7750         0.7786
Pseudo R²      0.2467         0.2443
AIC            608.88         602.74
BIC            669.47         646.02


#### Results:

* **Accuracy** improves to 77.86% — the best among all iterations.
* **AIC/BIC** both decrease significantly (by -6.14 and -23.45 respectively), indicating a **simpler** and likely more **generalizable** model.
* **Pseudo R²** drops only slightly (**0.2443 vs. 0.2467**), suggesting minimal loss in explanatory power.

> *This trade-off favors simplicity without sacrificing performance.*

---

## **4. Testing the Model**
In this final stage, we evaluate our model's performance on **unseen data** that the model has never encountered during training. This allows us to measure the model’s ability to generalize to real-world scenarios rather than just memorizing the training dataset.

### 4.1 Model Accuracy (Test Dataset)

In [41]:
test_accuracy = reg.score(x_test, y_test)
display(Markdown(f"**Test Accuracy:** {test_accuracy:.2%}"))

**Test Accuracy:** 75.00%

> Based on the test dataset, the model correctly predicts **~75%** of the time.

- The **training accuracy** was approximately **77.86%**, so this small **drop is expected** and a sign of good generalization.
- A drop less than 10–20% suggests that the model is **not overfitting** or specializing to the training data

### 4.2 Confusion Matrix
Let’s now evaluate the model's classification performance using a confusion matrix, which shows how many predictions were correct vs incorrect.

In [44]:
# Actual class labels
actual_values = y_test

# Predicted class labels
pred_values = reg.predict(x_test)

# Binning for confusion matrix (0: class 0, 1: class 1)
bins = np.array([0, 0.5, 1])

# Create confusion matrix using 2D histogram
cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]

# Convert to DataFrame for better presentation
confusion_matrix = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])
confusion_matrix

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,59.0,15.0
Actual 1,20.0,46.0


In [45]:
# Misclassification rate
misclassification_rate = (cm[0][1] + cm[1][0]) / cm.sum()
display(Markdown(f"**Misclassification Rate:** {misclassification_rate:.2%}"))

**Misclassification Rate:** 25.00%

This means that about 1 in 4 predictions are incorrect. It is still reasonable given the context and the limited number of highly influential predictors. Further tuning, feature engineering, or additional data could reduce this in the future.

---

## **5. Saving the Model and Scaler**
Once a machine learning model has been trained, saving the model is a critical step in a machine learning pipeline. Instead of re-training the model every time we need to make predictions, we can **serialize** the trained model into a compact file that can be loaded and used instantly.

### 5.1 What is Being Saved

The object `reg` (our trained logistic regression model) holds all necessary model attributes:

* Coefficients and intercept
* Random state
* Everything needed to make future predictions

Likewise, the object `absenteeism_scaler` stores:

* The features that were scaled
* The mean and standard deviation of each feature
* The standardization logic required to process new input data identically to the training data

To ensure predictions on future data are consistent, both the model and the scaler must be saved.

### 5.2 Pickling

We use Python’s built-in `pickle` module to serialize (`pickle`) and later restore (`unpickle`) these objects.

In [79]:
# Create folder if it doesn't exist
os.makedirs('../integration/model_artifacts', exist_ok=True)

# Save the trained model
with open('../integration/model_artifacts/model', 'wb') as file:
    pickle.dump(reg, file)

# Save the scaler
with open('../integration/model_artifacts/scaler', 'wb') as file:
    pickle.dump(absenteeism_scaler, file)

Now, these files can now be reloaded anytime using `pickle.load()` and used for predictions without re-training the model or re-defining the scaler logic.

---

## **6. Summary**

We successfully built and validated a logistic regression model to predict excessive absenteeism. Below is a structured overview of the modeling pipeline:

---

### **6.1 Steps Completed**

#### **6.1.1 Model Preparation**

* Loaded the preprocessed absenteeism dataset.
* Created a binary target variable, `excessive_absenteeism`, with an approximately **50/50 class distribution**.
* Selected relevant input features for modeling.
* Applied a **custom scaler** (built on `StandardScaler`) to scale only numerical variables.
* Split the data into **training and test sets (80/20 split)**.

#### **6.1.2 Model Training**

* Trained an initial logistic regression model using the training data.
* Interpreted model outputs using both `statsmodels` and `scikit-learn`.
* Identified key predictors based on **odds ratios**, including:

  * `reason_group_3` (\~22x more likely)
  * `reason_group_1` (\~16x more likely)
  * Other influential features: `transportation_expense_dollars` (\~85% more likely), `children` (\~43% more likely), etc.

#### **6.1.3 Feature Selection: Backward Elimination**

* Evaluated feature significance using **p-values** and **odds ratios**.
* Removed statistically insignificant features iteratively.
* Simplified the model for better interpretability without sacrificing accuracy.

#### **6.1.4 Model Evaluation**

* **Training Accuracy:** \~77.86%
* **Test Accuracy:** \~75.00%
* Minimal performance gap suggests **no overfitting** and strong generalization.
* Evaluated performance using a **confusion matrix**.

#### **6.1.5 Model & Scaler Serialization**

* Serialized the trained model (`reg`) using `pickle`.
* Saved the **custom scaler** (`absenteeism_scaler`) to ensure consistent preprocessing for future predictions.

---

### **6.2 Next Step: Integration**

In the next phase, we will apply this model to new data by:

* **Loading the saved model and scaler**.
* **Generating predictions** on incoming employee absenteeism records.
* **Preparing output** for downstream analysis in BI tools like **Tableau**.

---

