# HR Data Cleaning Pipeline

This notebook documents the end-to-end HR dataset preparation workflow using the reusable `hr_data_insights` Python package.


## Setup
Ensure the repository root is on the Python path so the package inside `src/` can be imported.


In [None]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd()
SRC_PATH = PROJECT_ROOT / 'src'
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

print(f'Project root: {PROJECT_ROOT}')
print(f'Source path appended: {SRC_PATH}')


## Run the Cleaning Pipeline
The pipeline loads the raw dataset, standardises key fields, validates the result, and persists a cleaned CSV along with analytical tables.


In [None]:
from hr_data_insights.config import DatasetConfig
from hr_data_insights.pipeline import run_pipeline

config = DatasetConfig()
raw_path = PROJECT_ROOT / config.input_path
if not raw_path.exists():
    print(f'⚠️ Raw dataset not found at {raw_path}. Place the file and re-run this cell.')
else:
    pipeline_result = run_pipeline(config=config)
    cleaned_df = pipeline_result['cleaned']
    print(f'Cleaned dataframe has {len(cleaned_df):,} rows and {cleaned_df.shape[1]} columns.')
    if pipeline_result['validation_messages']:
        print('Validation warnings:')
        for msg in pipeline_result['validation_messages']:
            print(f' - {msg}')


## Preview the Cleaned Dataset
A quick glance at the first rows, types, and summary statistics helps confirm the cleaning output.


In [None]:
if 'cleaned_df' in globals():
    display(cleaned_df.head())
    display(cleaned_df.describe(include='all').transpose())
else:
    print('Pipeline has not been executed yet.')


## Analytics Tables
When the pipeline runs with metrics enabled (default), the primary HR business questions are returned as tidy dataframes for downstream BI tools.


In [None]:
if 'pipeline_result' in globals() and pipeline_result['analytics']:
    for name, table in pipeline_result['analytics'].items():
        print(f'
=== {name} ===')
        if isinstance(table, dict):
            for sub_name, sub_table in table.items():
                print(f'-- {sub_name} --')
                display(sub_table.head())
        else:
            display(table.head() if hasattr(table, 'head') else table)
else:
    print('Analytics not computed yet.')
