# Baseline model
## Introduction
This notebook contains the baseline model that we need to outperform. The baseline model is a weighted average (seasonal) timeseries forecast for the next 4 quarters.

In [None]:
import polars as pl
import polars.selectors as cs
import os
from pathlib import Path
from config import DIR_DB_SILVER

# --- Database Connection ---
# Check if database exists; handle both root and 'code' directory execution
if not DIR_DB_SILVER.exists():
    if (Path("..") / DIR_DB_SILVER).exists():
        DIR_DB_SILVER = Path("..") / DIR_DB_SILVER
    else:
        raise FileNotFoundError(f"❌ Database not found at {DIR_DB_SILVER}. Ensure you are in the project root.")

# Create a simple SQLAlchemy engine for the Silver database
engine = create_engine(f"sqlite:///{DIR_DB_SILVER}")

# SQL Query to extract and format absenteeism data
query = """
SELECT 
    Perioden as Timeperiod_text,
    printf('%s-%s-01', 
        substr(Perioden, 1, 4), 
        CASE substr(Perioden, 7, 2)
            WHEN '01' THEN '01'
            WHEN '02' THEN '04'
            WHEN '03' THEN '07'
            WHEN '04' THEN '10'
        END
    ) AS Period_startdate, 
    DATE(
        printf('%s-%s-01', 
            substr(Perioden, 1, 4), 
            CASE substr(Perioden, 7, 2)
                WHEN '01' THEN '01'
                WHEN '02' THEN '04'
                WHEN '03' THEN '07'
                WHEN '04' THEN '10'
            END
        ), 
        '+3 months', 
        '-1 day'
    ) AS Period_enddate,
    CAST(Ziekteverzuimpercentage_1 AS REAL) as Absenteeism_perc,
    BedrijfskenmerkenSBI2008_CategoryGroupID as SBI_code
FROM "80072ned_silver"
WHERE Perioden NOT LIKE '%JJ%' 
AND Period_startdate >= '2016-01-01'
"""

# Load into Polars DataFrame
with engine.connect() as conn:
    df_org = pl.read_database(query=query, connection=conn)

print(f"✅ Success! Loaded {len(df_org)} rows.")
df_org.head()

In [4]:
df_modified = df_org.with_columns(
    # Convert columns ending with 'date' to Date type
    cs.ends_with("date").str.to_date("%Y-%m-%d")
)
df_modified.head()

#### Year-on-Year Moving Average Prediction Model
**Goal:** <br> For each quarter (Q1/Q2/Q3/Q4) and SBI_code, calculate a rolling 3-year moving average of Absenteeism_perc as a simple prediction.
Example: The prediction for Q1-2019 = average of Q1-2016, Q1-2017, Q1-2018.

**Step 1: Extract Year and Quarter from Period_startdate** <br>
We derive the quarter (1–4) from the start month of each period. Month 1 = Q1, Month 4 = Q2, Month 7 = Q3, Month 10 = Q4. This allows us to group rows by quarter across different years in the next steps.

In [35]:
df_with_quarter = df_modified.with_columns(
    pl.col("Period_startdate").dt.year().alias("Year"),
    pl.col("Period_startdate").dt.quarter().alias("Quarter")
)
print(df_with_quarter.head())

shape: (5, 7)
┌─────────────────┬─────────────────┬────────────────┬─────────────────┬──────────┬──────┬─────────┐
│ Timeperiod_text ┆ Period_startdat ┆ Period_enddate ┆ Absenteeism_per ┆ SBI_code ┆ Year ┆ Quarter │
│ ---             ┆ e               ┆ ---            ┆ c               ┆ ---      ┆ ---  ┆ ---     │
│ str             ┆ ---             ┆ date           ┆ ---             ┆ str      ┆ i32  ┆ i8      │
│                 ┆ date            ┆                ┆ f64             ┆          ┆      ┆         │
╞═════════════════╪═════════════════╪════════════════╪═════════════════╪══════════╪══════╪═════════╡
│ 2016KW01        ┆ 2016-01-01      ┆ 2016-03-31     ┆ 4.3             ┆ 1        ┆ 2016 ┆ 1       │
│ 2016KW02        ┆ 2016-04-01      ┆ 2016-06-30     ┆ 3.8             ┆ 1        ┆ 2016 ┆ 2       │
│ 2016KW03        ┆ 2016-07-01      ┆ 2016-09-30     ┆ 3.5             ┆ 1        ┆ 2016 ┆ 3       │
│ 2016KW04        ┆ 2016-10-01      ┆ 2016-12-31     ┆ 4.1             ┆ 1   

**Step 2: Sort the Data** <br>
Polars window functions respect row order, so we must sort by SBI_code, Quarter, and Year to ensure the rolling average looks back over the correct preceding years (e.g. 2016 → 2017 → 2018 for a Q1 prediction of 2019).

In [None]:
df_sorted = df_with_quarter.sort(["SBI_code", "Quarter", "Year"])
print(df_sorted.head())

shape: (5, 7)
┌─────────────────┬─────────────────┬────────────────┬─────────────────┬──────────┬──────┬─────────┐
│ Timeperiod_text ┆ Period_startdat ┆ Period_enddate ┆ Absenteeism_per ┆ SBI_code ┆ Year ┆ Quarter │
│ ---             ┆ e               ┆ ---            ┆ c               ┆ ---      ┆ ---  ┆ ---     │
│ str             ┆ ---             ┆ date           ┆ ---             ┆ str      ┆ i32  ┆ i8      │
│                 ┆ date            ┆                ┆ f64             ┆          ┆      ┆         │
╞═════════════════╪═════════════════╪════════════════╪═════════════════╪══════════╪══════╪═════════╡
│ 2016KW01        ┆ 2016-01-01      ┆ 2016-03-31     ┆ 4.3             ┆ 1        ┆ 2016 ┆ 1       │
│ 2016KW02        ┆ 2016-04-01      ┆ 2016-06-30     ┆ 3.8             ┆ 1        ┆ 2016 ┆ 2       │
│ 2016KW03        ┆ 2016-07-01      ┆ 2016-09-30     ┆ 3.5             ┆ 1        ┆ 2016 ┆ 3       │
│ 2016KW04        ┆ 2016-10-01      ┆ 2016-12-31     ┆ 4.1             ┆ 1   

**Step 3: Calculate the Rolling 3-Year Moving Average per Quarter and SBI_code** <br>
We use .over(["SBI_code", "Quarter"]) to partition the data into groups, so the rolling average is calculated independently for each unique combination of SBI_code and Quarter (e.g. all Q1 rows for SBI_code "A").

window_size=3 — averages the current row and the 2 preceding rows within that partition (i.e. the 3 most recent years for that quarter)
min_periods=1 — allows the average to be computed even when fewer than 3 years of history are available (e.g. the first year in the data)

In [43]:
df_with_ma = df_sorted.with_columns(
    pl.col("Absenteeism_perc")
    .rolling_mean(window_size=3, min_samples=1)
    .over(["SBI_code", "Quarter"])
    .alias("MA3_Absenteeism_perc")
)
print(df_with_ma.head())

shape: (5, 8)
┌──────────────┬─────────────┬─────────────┬─────────────┬──────────┬──────┬─────────┬─────────────┐
│ Timeperiod_t ┆ Period_star ┆ Period_endd ┆ Absenteeism ┆ SBI_code ┆ Year ┆ Quarter ┆ MA3_Absente │
│ ext          ┆ tdate       ┆ ate         ┆ _perc       ┆ ---      ┆ ---  ┆ ---     ┆ eism_perc   │
│ ---          ┆ ---         ┆ ---         ┆ ---         ┆ str      ┆ i32  ┆ i8      ┆ ---         │
│ str          ┆ date        ┆ date        ┆ f64         ┆          ┆      ┆         ┆ f64         │
╞══════════════╪═════════════╪═════════════╪═════════════╪══════════╪══════╪═════════╪═════════════╡
│ 2016KW01     ┆ 2016-01-01  ┆ 2016-03-31  ┆ 4.3         ┆ 1        ┆ 2016 ┆ 1       ┆ 4.3         │
│ 2017KW01     ┆ 2017-01-01  ┆ 2017-03-31  ┆ 4.3         ┆ 1        ┆ 2017 ┆ 1       ┆ 4.3         │
│ 2018KW01     ┆ 2018-01-01  ┆ 2018-03-31  ┆ 4.9         ┆ 1        ┆ 2018 ┆ 1       ┆ 4.5         │
│ 2019KW01     ┆ 2019-01-01  ┆ 2019-03-31  ┆ 4.7         ┆ 1        ┆ 2019 ┆ 

**Step 4: Shift the Moving Average by 1 to Use it as a Forward Prediction** <br>
The moving average calculated in Step 3 uses the current year's value. To use it as a prediction, we shift it forward by 1 row within each partition — meaning the MA of [2016, 2017, 2018] becomes the prediction for 2019, rather than a description of 2018. Without this shift, the prediction for 2019 would include 2019's own actual value, which would be data leakage.

In [44]:
df_predicted = df_with_ma.with_columns(
    pl.col("MA3_Absenteeism_perc")
    .shift(1)
    .over(["SBI_code", "Quarter"])
    .alias("Predicted_Absenteeism_perc")
)
print(df_predicted.head())

shape: (5, 9)
┌────────────┬────────────┬────────────┬────────────┬───┬──────┬─────────┬────────────┬────────────┐
│ Timeperiod ┆ Period_sta ┆ Period_end ┆ Absenteeis ┆ … ┆ Year ┆ Quarter ┆ MA3_Absent ┆ Predicted_ │
│ _text      ┆ rtdate     ┆ date       ┆ m_perc     ┆   ┆ ---  ┆ ---     ┆ eeism_perc ┆ Absenteeis │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆   ┆ i32  ┆ i8      ┆ ---        ┆ m_perc     │
│ str        ┆ date       ┆ date       ┆ f64        ┆   ┆      ┆         ┆ f64        ┆ ---        │
│            ┆            ┆            ┆            ┆   ┆      ┆         ┆            ┆ f64        │
╞════════════╪════════════╪════════════╪════════════╪═══╪══════╪═════════╪════════════╪════════════╡
│ 2016KW01   ┆ 2016-01-01 ┆ 2016-03-31 ┆ 4.3        ┆ … ┆ 2016 ┆ 1       ┆ 4.3        ┆ null       │
│ 2017KW01   ┆ 2017-01-01 ┆ 2017-03-31 ┆ 4.3        ┆ … ┆ 2017 ┆ 1       ┆ 4.3        ┆ 4.3        │
│ 2018KW01   ┆ 2018-01-01 ┆ 2018-03-31 ┆ 4.9        ┆ … ┆ 2018 ┆ 1       ┆ 4.