## ETL Notebook for CA Biositing Project

This notebook provides a documented walkthrough of the ETL (Extract, Transform, Load) process for the CA Biositing project. It is designed for interactive development and exploration before migrating logic into the production pipeline.

It covers:

1.  **Setup**: Importing necessary libraries and establishing a connection to the database.
2.  **Extraction**: Pulling raw data from Google Sheets.
3.  **Cleaning**: Standardizing data types, handling missing values, and cleaning column names.
4.  **Normalization**: Replacing human-readable names (e.g., "Corn") with database foreign key IDs (e.g., `resource_id: 1`).
5.  **Utilities**: Common functions for data manipulation and analysis.
6.  **Deployment Plan**: A step-by-step guide for moving the code from this notebook into the production ETL modules.

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import janitor as jn
import logging
from IPython.display import display
from sqlalchemy.orm import Session
from sqlalchemy import select

# --- Basic Logging Configuration for Notebook ---
# When running in a notebook, we use Python's standard logging.
# In the production pipeline, this will be replaced by Prefect's `get_run_logger()`
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

# --- Robustly find the project root ---
# This ensures that the notebook can be run from any directory within the project.
path = os.getcwd()
project_root = None
while path != os.path.dirname(path):
    if 'pixi.toml' in os.listdir(path):
        project_root = path
        break
    path = os.path.dirname(path)

if not project_root:
    raise FileNotFoundError("Could not find project root containing 'pixi.toml'.")

# Add the project root to the Python path to allow for module imports
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    logger.info(f"Added project root '{project_root}' to sys.path")
else:
    logger.info(f"Project root '{project_root}' is already in sys.path")

# --- Import project modules ---
try:
    from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.engine import engine
    from src.ca_biositing.datamodels.ca_biositing.datamodels.schemas.generated.ca_biositing import *
    from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.name_id_swap import replace_name_with_id_df
    from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.extract import proximate, ultimate, cmpana, samplemetadata
    logger.info('Successfully imported all project modules.')
except ImportError as e:
    logger.error(f'Failed to import project modules: {e}', exc_info=True)

2026-01-14 09:03:04,574 - INFO - Added project root '/Users/pjsmitty301/ca-biositing' to sys.path
2026-01-14 09:03:06,007 - INFO - Successfully imported all project modules.


### TRANSFORMATION FUNCTIONS

### Data Cleaning Function

In [3]:
# Use the refactored cleaning/coercion helpers from the new package
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import standard_clean, coerce_columns, coerce_columns_list

def clean_the_gsheets(df, lowercase=True, replace_empty=True):
    """Wrapper that applies the standardized cleaning pipeline implemented in `cleaning_functions`."""
    logger.info('Starting DataFrame cleaning via standard_clean.')
    if not isinstance(df, pd.DataFrame):
        logger.error('Input is not a pandas DataFrame.')
        return None
    try:
        # Run the composed standard clean (names, empty->NA, lowercase, convert_dtypes)
        df_cleaned = standard_clean(df, lowercase=lowercase, replace_empty=replace_empty)
        # Preserve behaviour: drop rows missing key columns if present
        subset = [c for c in ['resource', 'value'] if c in df_cleaned.columns]
        if subset:
            df_cleaned = df_cleaned.dropna(subset=subset)
        logger.info(f'Cleaning complete; rows remaining: {len(df_cleaned)}')
        return df_cleaned
    except Exception as e:
        logger.error(f'An error occurred during DataFrame cleaning: {e}', exc_info=True)
        return None


# --- Coercion Configuration Templates ---
# You can define column coercions in two ways: explicit keyword arguments or a dtype_map dictionary.
# For geometry: use geopandas to load shapefiles (geometry column is already properly typed).
# Only use geometry_cols if you have WKT strings to parse.

# APPROACH 1: Explicit keyword arguments (clear and direct)
COERCION_CONFIG_EXPLICIT = {
    'int_cols': ['repl_no', 'sample_no'],
    'float_cols': ['value', 'measurement'],
    'datetime_cols': ['created_at', 'updated_at'],
    'bool_cols': ['is_valid'],
    'category_cols': ['status'],
    'geometry_cols': []  # Use only if you have WKT strings; prefer geopandas for shapefiles
}

# APPROACH 2: dtype_map dictionary (compact, useful for dynamic configs)
COERCION_CONFIG_DTYPE_MAP = {
    'int': ['repl_no', 'sample_no'],
    'float': ['value', 'measurement'],
    'datetime': ['created_at', 'updated_at'],
    'bool': ['is_valid'],
    'category': ['status'],
    'geometry': []  # Use only if you have WKT strings; prefer geopandas for shapefiles
}

# APPROACH 3: GeoPandas GeoDataFrame (for shapefiles and spatial data)
# When loading shapefiles with geopandas, geometry is already a GeoSeries.
# Use geometry_format='geodataframe' to skip coercion:
GEOPANDAS_CONFIG = {
    'int_cols': ['id', 'repl_no'],
    'float_cols': ['area', 'value'],
    'geometry_cols': ['geometry'],
    'geometry_format': 'geodataframe'  # Don't convert; already properly typed
}

# Usage: coerce_the_gsheets(df, **COERCION_CONFIG_EXPLICIT)
#   or: coerce_the_gsheets(df, dtype_map=COERCION_CONFIG_DTYPE_MAP)
#   or: coerce_the_gsheets(gdf, **GEOPANDAS_CONFIG)  # for GeoDataFrames


def coerce_the_gsheets(df, dtype_map=None, int_cols=None, float_cols=None, datetime_cols=None, bool_cols=None, category_cols=None, geometry_cols=None, geometry_format='wkt'):
    """Coerce specified columns on a cleaned DataFrame using coercion helpers.
    `dtype_map` is an alternative mapping where keys are 'int','float','datetime','bool','category','geometry'.
    `geometry_format` controls geometry coercion: 'wkt' (parse WKT strings) or 'geodataframe' (skip, already typed)."""
    if not isinstance(df, pd.DataFrame):
        logger.error('coerce_the_gsheets: input is not a DataFrame')
        return df
    return coerce_columns(df, int_cols=int_cols, float_cols=float_cols, datetime_cols=datetime_cols, bool_cols=bool_cols, category_cols=category_cols, geometry_cols=geometry_cols, dtype_map=dtype_map, geometry_format=geometry_format)


### GeoPandas: Working with Shapefiles

For spatial data (shapefiles, GeoJSON, etc.), use **geopandas** instead of shapely string parsing. GeoPandas provides:
- Direct shapefile loading into GeoDataFrames
- Geometry column automatically typed as `geometry`
- Spatial operations (spatial joins, intersects, buffer, etc.)
- Better integration with tabular data

**Loading shapefiles:**
```python
import geopandas as gpd

# Load a shapefile directly (geometry column is already correct type)
gdf = gpd.read_file('path/to/shapefile.shp')

# Clean tabular columns (geometry is preserved)
gdf_clean = clean_the_gsheets(gdf)

# Coerce columns, skipping geometry since it's already a GeoSeries
gdf_coerced = coerce_the_gsheets(
    gdf_clean,
    int_cols=['id', 'repl_no'],
    float_cols=['area', 'value'],
    datetime_cols=['created_at'],
    geometry_cols=['geometry'],
    geometry_format='geodataframe'  # geometry is already properly typed
)
```


### Data Normalization Function

In [4]:
def normalize_dataframes(dataframes, normalize_columns):
    """Normalizes a list of DataFrames by replacing name columns with foreign key IDs.

    This function iterates through a list of dataframes and, for each one, iterates
    through a dictionary of columns that need to be normalized. It uses the 
    `replace_name_with_id_df` utility to look up or create the corresponding ID
    in the database.

    Args:
        dataframes (list[pd.DataFrame]): A list of DataFrames to normalize.
        normalize_columns (dict): A dictionary mapping column names to SQLModel classes and attributes.

    Returns:
        list[pd.DataFrame]: The list of normalized DataFrames.
    """
    logger.info(f'Starting normalization process for {len(dataframes)} dataframes.')
    normalized_dfs = []
    try:
        with Session(engine) as db:
            for i, df in enumerate(dataframes):
                if not isinstance(df, pd.DataFrame):
                    logger.warning(f'Item {i+1} is not a DataFrame, skipping.')
                    continue
                
                logger.info(f'Processing DataFrame #{i+1} with {len(df)} rows.')
                df_normalized = df.copy()

                for df_col, (model, model_name_attr) in normalize_columns.items():
                    if df_col not in df_normalized.columns:
                        logger.warning(f"Column '{df_col}' not in DataFrame #{i+1}. Skipping normalization for this column.")
                        continue
                    
                    try:
                        # Skip normalization if the column is all NaN/None
                        if df_normalized[df_col].isnull().all():
                            logger.info(f"Skipping normalization for column '{df_col}' as it contains only null values.")
                            continue
                            
                        logger.info(f"Normalizing column '{df_col}' using model '{model.__name__}'.")
                        df_normalized, num_created = replace_name_with_id_df(
                            db=db,
                            df=df_normalized,
                            ref_model=model,
                            df_name_column=df_col,
                            model_name_attr=model_name_attr,
                            id_column_name='id',
                            final_column_name=f'{df_col}_id'
                        )
                        if num_created > 0:
                            logger.info(f"Created {num_created} new records in '{model.__name__}' table.")
                        new_col_name = f'{df_col}_id'
                        num_nulls = df_normalized[new_col_name].isnull().sum()
                        logger.info(f"Successfully normalized '{df_col}'. New column '{new_col_name}' contains {num_nulls} null values.")
                    except Exception as e:
                        logger.error(f"Error normalizing column '{df_col}' in DataFrame #{i+1}: {e}", exc_info=True)
                        continue # Continue to the next column
                
                normalized_dfs.append(df_normalized)
                logger.info(f'Finished processing DataFrame #{i+1}.')
            
            logger.info('Committing database session.')
            db.commit()
            logger.info('Database commit successful.')
    except Exception as e:
        logger.error(f'A critical error occurred during the database session: {e}', exc_info=True)
        db.rollback()
        logger.info('Database session rolled back.')
        
    return normalized_dfs


### ETL Execution Example

In [8]:
# --- 1. Extraction ---
# In a real Prefect flow, each extraction would be a separate task.
logger.info('Starting data extraction...')
prox_df = proximate.extract(project_root=project_root)
ult_df = ultimate.extract(project_root=project_root)
cmp_df = cmpana.extract(project_root=project_root)
samplemetadata_df = samplemetadata.extract(project_root=project_root)
dataframes = [samplemetadata_df]
logger.info('Data extraction complete.')

# --- 2. Cleaning ---
# This list comprehension applies the cleaning function to each extracted dataframe.
logger.info('Starting data cleaning...')
clean_dataframes = [clean_the_gsheets(df) for df in dataframes if df is not None]
logger.info('Data cleaning complete.')

# --- 3. Coercion ---
# Coerce specific columns to their target types; all others remain as strings.
logger.info('Starting data coercion...')
coerced_dataframes = [coerce_the_gsheets(
    df,
    int_cols=['repl_no', 'qty'],
    float_cols=['value'],
    datetime_cols=['created_at', 'updated_at', 'sample_ts', 'prod_date']
) for df in clean_dataframes if df is not None]
logger.info('Data coercion complete.')

# --- 4. Normalization ---
# This dictionary defines the columns to be normalized. 
# The key is the column name in the DataFrame.
# The value is a tuple containing the corresponding SQLAlchemy model and the name of the attribute on the model to match against.
NORMALIZE_COLUMNS = {
    'resource': (Resource, 'name'),
    'prepared_sample': (PreparedSample, 'name'),
    'preparation_method': (PreparationMethod, 'name'),
    'parameter': (Parameter, 'name'),
    'unit': (Unit, 'name'),
    'sample_unit': (Unit, 'name'),
    'analyst_email': (Contact, 'email'),
    'analysis_type': (AnalysisType, 'name'),
    'primary_ag_product': (PrimaryAgProduct, 'name'),
    'provider_code': (Provider, 'codename')
}

logger.info('Starting data normalization...')
normalized_dataframes = normalize_dataframes(coerced_dataframes, NORMALIZE_COLUMNS)
logger.info('Data normalization complete.')

# --- 4. Display Results ---
logger.info('Displaying results of normalization...')
for i, df in enumerate(normalized_dataframes):
    print(f'--- Normalized DataFrame {i+1} ---')
    display(df.head())

2026-01-08 16:39:40,604 - INFO - Starting data extraction...
2026-01-08 16:39:40,645 - INFO - HTTP Request: GET http://127.0.0.1:4200/api/admin/version "HTTP/1.1 200 OK"
2026-01-08 16:39:40,666 - INFO - Extracting raw data from '03.1-Proximate' in 'Aim 1-Feedstock Collection and Processing Data-BioCirV'...
2026-01-08 16:39:42,460 - INFO - Successfully extracted raw data.
2026-01-08 16:39:42,463 - INFO - Finished in state Completed()
2026-01-08 16:39:42,474 - INFO - HTTP Request: GET http://127.0.0.1:4200/api/admin/version "HTTP/1.1 200 OK"
2026-01-08 16:39:42,485 - INFO - Extracting raw data from '03.7-Ultimate' in 'Aim 1-Feedstock Collection and Processing Data-BioCirV'...
2026-01-08 16:39:42,704 - INFO - HTTP Request: GET http://127.0.0.1:4200/api/csrf-token?client=80b874cb-867b-41e0-92e3-f77a8bd8b74a "HTTP/1.1 422 Unprocessable Entity"
2026-01-08 16:39:42,718 - INFO - HTTP Request: POST http://127.0.0.1:4200/api/logs/ "HTTP/1.1 201 Created"
2026-01-08 16:39:43,653 - INFO - Successfu

--- Normalized DataFrame 1 ---


Unnamed: 0,index,field_sample_name,fv_date_time,sampling_location,sampling_street,sampling_city,sampling_zip,sampling_latlong,sample_ts,sample_source,...,treatment_notes,soil_type,crop_variety,crop_cultivar,production_notes,field_storage_location,field_storage_conditions,resource_id,sample_unit_id,provider_code_id
0,ebd7b1f2,pos-alf033,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales from 3-year-old stand,...,,,,,prod_date is approximate. crop was baled in j...,,,46,6,16
1,309299a1,pos-alf033,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales from 3-year-old stand,...,,,,,prod_date is approximate. crop was baled in j...,,,46,10,16
2,64aa3698,,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,44,6,16
3,b05f116c,,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,44,10,16
4,21c2b270,pos-wst034,6/30/2025 10:30,"field just west of mussi shop, muller rd",4400 w. muller rd,stockton,95206,"37.904889, -121.367878",2025-06-30 11:15:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,34,6,16


2026-01-08 16:39:58,800 - INFO - HTTP Request: POST http://127.0.0.1:4200/api/logs/ "HTTP/1.1 201 Created"


### Deployment Plan

The code in this notebook will be transitioned to the main ETL pipeline by following these steps:

1.  **Function Migration**: The `clean_the_gsheets` and `normalize_dataframes` functions will be moved to a new utility module, for example, `src/ca_biositing/pipeline/ca_biositing/pipeline/utils/etl_utils.py`. Each function will be decorated with `@task` from Prefect to turn it into a reusable pipeline component.
2.  **Flow Creation**: A new Prefect flow will be created in the `src/ca_biositing/pipeline/ca_biositing/pipeline/flows/` directory (e.g., `master_extraction_flow.py`). This flow will orchestrate the entire ETL process for a given data source.
3.  **Task Integration**: The new flow will be composed of individual tasks. It will call the existing extraction tasks (`proximate.extract`, etc.), and then pass the results to the new cleaning and normalization tasks from `etl_utils.py`.
4.  **Logging**: The `logging` module will be replaced with `get_run_logger()` from Prefect within the tasks to ensure logs are captured by the Prefect UI.
5.  **Configuration**: The `NORMALIZE_COLUMNS` dictionary will be moved to a configuration file or defined within the relevant flow to make it easier to manage and modify without changing the code.
6.  **Testing**: Unit tests will be written for the new utility functions in `etl_utils.py`. An integration test will be created for the new Prefect flow to ensure all the tasks work together correctly.
7.  **Deployment**: Once the flow is complete and tested, it will be deployed to the Prefect server using the `pixi run deploy` command, making it available to be run on a schedule or manually via the UI.

In [None]:
samplemetadata_df = samplemetadata.extract(project_root=project_root)

samplemetadata_df

TRANSFORM

In [None]:
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.extract import basic_sample_info
from src.ca_biositing.pipeline.ca_biositing.pipeline.etl.transform.products import primary_ag_product

# Extract the required data
basic_sample_df = basic_sample_info.extract_basic_sample_info(project_root=project_root)

# Prepare the data_sources dictionary (key must match EXTRACT_SOURCES in the transform module)
data_sources = {"basic_sample_info": basic_sample_df}

# Call the transform function 
transformed_df = primary_ag_product.transform_products_primary_ag_product(data_sources)

# Display or further process the result
if transformed_df is not None:
    print("Transformed data:")
    display(transformed_df.head())
else:
    print("Transformation failed.")



In [9]:
normalized_dataframes[0]

Unnamed: 0,index,field_sample_name,fv_date_time,sampling_location,sampling_street,sampling_city,sampling_zip,sampling_latlong,sample_ts,sample_source,...,treatment_notes,soil_type,crop_variety,crop_cultivar,production_notes,field_storage_location,field_storage_conditions,resource_id,sample_unit_id,provider_code_id
0,ebd7b1f2,pos-alf033,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales from 3-year-old stand,...,,,,,prod_date is approximate. crop was baled in j...,,,46,6,16
1,309299a1,pos-alf033,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales from 3-year-old stand,...,,,,,prod_date is approximate. crop was baled in j...,,,46,10,16
2,64aa3698,,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,44,6,16
3,b05f116c,,6/30/2025 10:30,"borba rd, equipment yard on west side",6871 borba rd,stockton,95206,"37.897784, -121.360592",2025-06-30 10:45:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,44,10,16
4,21c2b270,pos-wst034,6/30/2025 10:30,"field just west of mussi shop, muller rd",4400 w. muller rd,stockton,95206,"37.904889, -121.367878",2025-06-30 11:15:00,small bales,...,,,,,prod_date is approximate. crop was baled in j...,,,34,6,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,b48a5263,cut-washf089,10/28/2025 10:30,processing facility,1050 diamond st,stockton,95205,"37.9491361,-121.2510627",2025-10-28 10:53:00,stockpiles,...,,,,,,cooperative extension merced county,room temp,43,6,9
91,0be4bedc,cut-washd090,10/28/2025 10:30,processing facility,1050 diamond st,stockton,95205,"37.9491361,-121.2510627",2025-10-28 10:46:00,stockpiles,...,,,,,,cooperative extension merced county,room temp,37,6,9
92,623d31cb,exc-olpm091,11/4/2025 16:20,processing facility,10201 live oak road,stockton,95212,"38.0866249,-121.1872729",2025-11-04 17:22:00,discharge from decanter,...,,,arbosana,,,,4c,27,6,14
93,a2917a6b,exc-olst092,11/4/2025 16:20,processing facility,10201 live oak road,stockton,95212,"38.0866249,-121.1872729",2025-11-04 17:35:00,outdoor container (dumpster),...,,,,,,,4c,3,6,14


## Geospatial Cleaning â€” Latitude/Longitude Standardization

The `geospatial` module in `cleaning_functions` provides specialized tools for parsing and standardizing latitude/longitude columns. Useful for datasets with mixed naming conventions and formats (e.g., `sampling_lat`/`sampling_lon`, `prod_location` as combined column, etc.).

See detailed examples in: `src/ca_biositing/pipeline/ca_biositing/pipeline/utils/geospatial_cleaning.ipynb`

In [23]:
# Geospatial cleaning example: standardize all lat/lon columns at once
from src.ca_biositing.pipeline.ca_biositing.pipeline.utils.cleaning_functions import (
    standardize_latlon,
    detect_latlon_columns,
)

# Example DataFrame with mixed lat/lon formats
sample_geo_df = pd.DataFrame({
    'site_id': [1, 2, 3, 4, 5],
    'sampling_lat': [40.7128, 34.0522, 37.7749, '41.8781', None],
    'sampling_lon': [-74.0060, -118.2437, -122.4194, '-87.6298', -120.0],
    'prod_location': ['40.5,-74.0', '34.2,-118.5', None, '', '37.5,-122.5'],
    'name': ['NYC', 'LA', 'SF', 'Chicago', 'Sacramento'],
})

print("Before standardization:")
df


# Auto-detect all lat/lon columns and standardize
sample_geo_standardized = standardize_latlon(
    df,
    auto_detect=True,
    output_lat='lat',
    output_lon='lon',
    coerce_to_float=True
)

print("\nAfter standardization:")
print(f"\nData types: {sample_geo_standardized[['lat', 'lon']].dtypes.to_dict()}")
sample_geo_standardized


2026-01-08 16:45:38,804 - INFO - Detected combined lat/lon column: 'sampling_location'
2026-01-08 16:45:38,804 - INFO - Detected combined lat/lon column: 'sampling_latlong'
2026-01-08 16:45:38,805 - INFO - Detected combined lat/lon column: 'prod_location'
2026-01-08 16:45:38,806 - INFO - Detected combined lat/lon column: 'prod_latlong'
2026-01-08 16:45:38,806 - INFO - Detected combined lat/lon column: 'field_storage_location'
2026-01-08 16:45:38,807 - INFO - Splitting combined lat/lon column 'sampling_location' into 'lat' and 'lon'
2026-01-08 16:45:38,836 - INFO - Dropped original column 'sampling_location'
2026-01-08 16:45:38,836 - INFO - Successfully parsed 0/95 lat/lon pairs
2026-01-08 16:45:38,836 - INFO - Splitting combined lat/lon column 'sampling_latlong' into 'lat' and 'lon'
2026-01-08 16:45:38,838 - INFO - Dropped original column 'sampling_latlong'
2026-01-08 16:45:38,838 - INFO - Successfully parsed 94/95 lat/lon pairs
2026-01-08 16:45:38,839 - INFO - Splitting combined lat/l

Before standardization:

After standardization:

Data types: {'lat': dtype('float64'), 'lon': dtype('float64')}


Unnamed: 0,index,field_sample_name,fv_date_time,sampling_street,sampling_city,sampling_zip,sample_ts,sample_source,processing_method,storage_mode,...,soil_type,crop_variety,crop_cultivar,production_notes,field_storage_conditions,resource_id,sample_unit_id,provider_code_id,lat,lon
0,ebd7b1f2,pos-alf033,6/30/2025 10:30,6871 borba rd,stockton,95206,2025-06-30 10:45:00,small bales from 3-year-old stand,,,...,,,,prod_date is approximate. crop was baled in j...,,46,6,16,,
1,309299a1,pos-alf033,6/30/2025 10:30,6871 borba rd,stockton,95206,2025-06-30 10:45:00,small bales from 3-year-old stand,,,...,,,,prod_date is approximate. crop was baled in j...,,46,10,16,,
2,64aa3698,,6/30/2025 10:30,6871 borba rd,stockton,95206,2025-06-30 10:45:00,small bales,,,...,,,,prod_date is approximate. crop was baled in j...,,44,6,16,,
3,b05f116c,,6/30/2025 10:30,6871 borba rd,stockton,95206,2025-06-30 10:45:00,small bales,,,...,,,,prod_date is approximate. crop was baled in j...,,44,10,16,,
4,21c2b270,pos-wst034,6/30/2025 10:30,4400 w. muller rd,stockton,95206,2025-06-30 11:15:00,small bales,,,...,,,,prod_date is approximate. crop was baled in j...,,34,6,16,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,b48a5263,cut-washf089,10/28/2025 10:30,1050 diamond st,stockton,95205,2025-10-28 10:53:00,stockpiles,unknown,piled,...,,,,,room temp,43,6,9,,
91,0be4bedc,cut-washd090,10/28/2025 10:30,1050 diamond st,stockton,95205,2025-10-28 10:46:00,stockpiles,unknown,piled and turned,...,,,,,room temp,37,6,9,,
92,623d31cb,exc-olpm091,11/4/2025 16:20,10201 live oak road,stockton,95212,2025-11-04 17:22:00,discharge from decanter,"decanter, following crushing in hammer mill an...",none; sample drawn directly from line,...,,arbosana,,,4c,27,6,14,,
93,a2917a6b,exc-olst092,11/4/2025 16:20,10201 live oak road,stockton,95212,2025-11-04 17:35:00,outdoor container (dumpster),blower and optical sorter,dumpster,...,,,,,4c,3,6,14,,


In [15]:
df = normalized_dataframes[0]