# ETL Class Example Notebook

This notebook demonstrates the three main steps of the ETL pipeline: **Extract**, **Transform**, and **Load**. It mirrors the structure of `etl_notebook.ipynb` but provides a concise class‑based example for quick reference.


## Extract

Set up the project root on `sys.path` so that package imports work from any working directory. Import the extraction utilities required for this example.


In [None]:
import os
import sys
# Find the project root (directory containing 'pixi.toml')
path = os.getcwd()
project_root = None
while path != os.path.dirname(path):
    if 'pixi.toml' in os.listdir(path):
        project_root = path
        break
    path = os.path.dirname(path)

if project_root is None:
    raise FileNotFoundError('Could not locate project root')
# Ensure the root is on the Python path
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Example extraction – import a sample extractor (adjust as needed)
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.extract import proximate
raw_df = proximate.extract(project_root=project_root)
raw_df.head()

## Transform

Apply cleaning, coercion, and normalization utilities from the `cleaning_functions` package. Each utility is demonstrated in its own cell for clarity.


In [None]:
# Cleaning utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import cleaning as cleaning_mod
# Apply the standard cleaning pipeline to the extracted dataframe
cleaned_df = cleaning_mod.standard_clean(raw_df)
cleaned_df.head()

In [None]:
# Coercion utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import coercion as coercion_mod
# Example: coerce integer and float columns
coerced_df = coercion_mod.coerce_columns(cleaned_df, 
                                         int_cols=['repl_no'], 
                                         float_cols=['value'],
                                         datetime_cols=['created_at', 'updated_at'])
coerced_df.head()

In [None]:
# Normalization (name → id) utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.name_id_swap import replace_name_with_id_df

# Import reference models
from src.ca_biositing.datamodels.ca_biositing.datamodels.schemas.generated.ca_biositing import *
# Define columns to normalize – map column name to (Model, attribute)
NORMALIZE_MAP = {
    'resource': (Resource, 'name'),
    'prepared_sample': (PreparedSample, 'name'),
    'preparation_method': (PreparationMethod, 'name'),
    'parameter': (Parameter, 'name'),
    'unit': (Unit, 'name'),
    'sample_unit': (Unit, 'name'),
    'analyst_email': (Contact, 'email'),
    'analysis_type': (AnalysisType, 'name'),
    'primary_ag_product': (PrimaryAgProduct, 'name'),
    'provider_code': (Provider, 'codename')
    # Add additional mappings as required
}
norm_df = coerced_df.copy()
for col, (model, attr) in NORMALIZE_MAP.items():
    if col in norm_df.columns:
        norm_df, _ = replace_name_with_id_df(db=None, df=norm_df, ref_model=model, df_name_column=col, model_name_attr=attr, id_column_name='id', final_column_name=f'{col}_id')
norm_df.head()


## Load

Load the transformed dataframes into the database. The example below re‑uses the load logic from `etl/load/products/primary_ag_product.py` and extends it to handle a list of tables.


In [None]:
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.load.products.primary_ag_product import load_primary_ag_product
from sqlmodel import Session
from src.ca_biositing.datamodels.ca_biositing.datamodels.database import get_engine

engine = get_engine()
# Example: list of (dataframe, load_function) pairs
load_tasks = [
    (norm_df, load_primary_ag_product),
    # Add additional (df, load_fn) tuples for other tables
]
with Session(engine) as session:
    for df, load_fn in load_tasks:
        if df is not None and not df.empty:
            load_fn(df=df, db=session)
    session.commit()
print('Load step completed')