# ETL Class Example Notebook

This notebook demonstrates the three main steps of the ETL pipeline: **Extract**, **Transform**, and **Load**. It mirrors the structure of `etl_notebook.ipynb` but provides a concise class‑based example for quick reference.


## Extract

Set up the project root on `sys.path` so that package imports work from any working directory. Import the extraction utilities required for this example.


In [1]:
import os
import sys
# Find the project root (directory containing 'pixi.toml')
path = os.getcwd()
project_root = None
while path != os.path.dirname(path):
    if 'pixi.toml' in os.listdir(path):
        project_root = path
        break
    path = os.path.dirname(path)

if project_root is None:
    raise FileNotFoundError('Could not locate project root')
# Ensure the root is on the Python path
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Example extraction – import a sample extractor (adjust as needed)
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.extract import proximate
raw_df = proximate.extract(project_root=project_root)
raw_df.head()

Unnamed: 0,Prox_UUID_031,Record_ID,Source_codename,Prepared_sample,Resource,Preparation_method,Storage_cond,Exper_abbrev,Repl_no,Repl_ID,Parameter,Value,Unit,Created_at,Updated_at,QC_result,Upload_status,Note,Analysis_type,Analyst_email
0,D7965110-407F-E356-D41D-B3B9A2B7B7,(73)B7B7,Oakleaf,Oak-TmPm01A(73),Tomato pomace,As Is,4C,Prox01xk,1,Prox01xk(73)1,Moisture,61.85,% total weight,2024-10-02 10:31:01,,Pass,not ready,,Proximate analysis,xkang2@lbl.gov
1,C8FEA984-2E9A-8DEF-55FB-1A9D7D9BA8,(73)9BA8,Oakleaf,Oak-TmPm01A(73),Tomato pomace,As Is,4C,Prox01xk,2,Prox01xk(73)2,Moisture,63.21,% total weight,2024-10-02 10:31:31,,Pass,ready,,Proximate analysis,xkang2@lbl.gov
2,DF304D5D-3A85-4881-7142-6D4E5F957D,(73)957D,Oakleaf,Oak-TmPm01A(73),Tomato pomace,As Is,4C,Prox01xk,3,Prox01xk(73)3,Moisture,63.27,% total weight,2024-10-02 10:32:01,,Pass,imported,,Proximate analysis,xkang2@lbl.gov
3,01C6C5BE-CEA6-54AF-3924-B0BAD69335,(73)9335,Oakleaf,Oak-TmPm01A(73),Tomato pomace,As Is,4C,Prox01xk,1,Prox01xk(73)1,Ash,0.69,% total weight,2024-10-03 10:31:01,,Pass,import failed,,Proximate analysis,xkang2@lbl.gov
4,126745C7-DD41-2F6D-0DC5-28DBCA415F,(73)415F,Oakleaf,Oak-TmPm01A(73),Tomato pomace,As Is,4C,Prox01xk,2,Prox01xk(73)2,Ash,0.89,% total weight,2024-10-03 10:31:31,,Pass,,,Proximate analysis,xkang2@lbl.gov


## Transform

Apply cleaning, coercion, and normalization utilities from the `cleaning_functions` package. Each utility is demonstrated in its own cell for clarity.


In [None]:
# Cleaning utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import cleaning as cleaning_mod
# Apply the standard cleaning pipeline to the extracted dataframe
cleaned_df = cleaning_mod.standard_clean(raw_df)
cleaned_df.head()

In [None]:
# Coercion utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import coercion as coercion_mod
# Example: coerce integer and float columns
coerced_df = coercion_mod.coerce_columns(cleaned_df, 
                                         int_cols=['repl_no'], 
                                         float_cols=['value'],
                                         datetime_cols=['created_at', 'updated_at'])
coerced_df.head()

In [None]:
# Normalization (name → id) utilities
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.name_id_swap import normalize_dataframes

# Import reference models
from src.ca_biositing.datamodels.ca_biositing.datamodels.schemas.generated.ca_biositing import *
# Define columns to normalize – map column name to (Model, attribute)
normalize_columns = {
    'resource': (Resource, 'name'),
    'prepared_sample': (PreparedSample, 'name'),
    'preparation_method': (PreparationMethod, 'name'),
    'parameter': (Parameter, 'name'),
    'unit': (Unit, 'name'),
    'sample_unit': (Unit, 'name'),
    'analyst_email': (Contact, 'email'),
    'analysis_type': (AnalysisType, 'name'),
    'primary_ag_product': (PrimaryAgProduct, 'name'),
    'provider_code': (Provider, 'codename')
    # Add additional mappings as required
}

# Normalize the dataframe
normalized_df = normalize_dataframes(coerced_df, normalize_columns)
normalized_df.head()


## Load

Load the transformed dataframes into the database. The example below re‑uses the load logic from `etl/load/products/primary_ag_product.py` and extends it to handle a list of tables.


In [None]:
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.load.products.primary_ag_product import load_primary_ag_product
from sqlmodel import Session
from src.ca_biositing.datamodels.ca_biositing.datamodels.database import get_engine

engine = get_engine()
# Example: list of (dataframe, load_function) pairs
load_tasks = [
    (norm_df, load_primary_ag_product),
    # Add additional (df, load_fn) tuples for other tables
]
with Session(engine) as session:
    for df, load_fn in load_tasks:
        if df is not None and not df.empty:
            load_fn(df=df, db=session)
    session.commit()
print('Load step completed')