# Database Joins and Denormalized Views

This notebook demonstrates how to use SQLAlchemy to perform complex joins on the project's data models and create denormalized views suitable for API endpoints.

It replaces the older `database_interaction.ipynb` and uses the new generated data models.

**Note:** Ensure you are running this notebook in the project's Pixi environment where `ca-biositing-datamodels` is installed.

In [6]:
# Verify Kernel
import sys
print(f"Python Executable: {sys.executable}")

# Check if we are in the pixi environment (path should contain .pixi)
if ".pixi" in sys.executable:
    print("✅ You are running in a Pixi environment.")
else:
    print("⚠️ You might be running in the wrong kernel. Please select the 'default' kernel created by Pixi.")

Python Executable: /Users/pjsmitty301/ca-biositing/.pixi/envs/default/bin/python
✅ You are running in a Pixi environment.


In [16]:
import pandas as pd
from sqlalchemy import create_engine, select
from sqlalchemy.orm import Session

# Import models from ca_biositing.py as the source of truth
from ca_biositing.datamodels.schemas.generated.ca_biositing import *

print("Libraries and models imported successfully.")

Libraries and models imported successfully.


In [2]:
# Database Connection
DATABASE_URL = "postgresql+psycopg2://biocirv_user:biocirv_dev_password@localhost:5432/biocirv_db"
engine = create_engine(DATABASE_URL)
print(f"Connected to database.")

Connected to database.


## 1. Join LandIQ Record and Observation

We will join `LandiqRecord` and `Observation` tables on `observation.record_id = landiq_record.id`.

In [None]:
stmt2 = select(LandiqRecord, PrimaryCrop).join(
    PrimaryCrop, 
    LandiqRecord.main_crop == PrimaryCrop.id).limit(10)

print(stmt2)

df2 = pd.read_sql(stmt2, engine)

df2.columns


TypeError: <class 'sqlalchemy.orm.decl_api.DeclarativeMeta'> is not a generic class

In [3]:
# Create the join query
stmt = select(LandiqRecord, Observation).join(
    Observation,
    Observation.record_id == LandiqRecord.id
)

print("SQL Query:")
print(stmt)

SQL Query:
SELECT landiq_record.dataset_id, landiq_record.polygon_id, landiq_record.main_crop, landiq_record.secondary_crop, landiq_record.tertiary_crop, landiq_record.quaternary_crop, landiq_record.confidence, landiq_record.irrigated, landiq_record.acres, landiq_record.version, landiq_record.note, landiq_record.test, landiq_record.id, landiq_record.created_at, landiq_record.updated_at, landiq_record.etl_run_id, landiq_record.lineage_group_id, observation.dataset_id AS dataset_id_1, observation.record_type, observation.record_id, observation.parameter_id, observation.value, observation.unit_id, observation.dimension_type_id, observation.dimension_value, observation.dimension_unit_id, observation.note AS note_1, observation.id AS id_1, observation.created_at AS created_at_1, observation.updated_at AS updated_at_1, observation.etl_run_id AS etl_run_id_1, observation.lineage_group_id AS lineage_group_id_1 
FROM landiq_record JOIN observation ON observation.record_id = landiq_record.id


In [None]:
# Execute and load into DataFrame
try:
    with Session(engine) as session:
        # Using pandas read_sql to execute the statement and fetch results
        # Note: Columns with same names (like 'id') might be duplicated or suffixed automatically by pandas.
        df = pd.read_sql(stmt, session.bind)
        
        print(f"Loaded {len(df)} rows.")
        display(df.head())
except Exception as e:
    print(f"Error executing query: {e}")

Loaded 0 rows.


Unnamed: 0,dataset_id,polygon_id,main_crop,secondary_crop,tertiary_crop,quaternary_crop,confidence,irrigated,acres,version,...,unit_id,dimension_type_id,dimension_value,dimension_unit_id,note_1,id_1,created_at_1,updated_at_1,etl_run_id_1,lineage_group_id_1


In [12]:
df.columns

Index(['dataset_id', 'polygon_id', 'main_crop', 'secondary_crop',
       'tertiary_crop', 'quaternary_crop', 'confidence', 'irrigated', 'acres',
       'version', 'note', 'test', 'id', 'created_at', 'updated_at',
       'etl_run_id', 'lineage_group_id', 'dataset_id_1', 'record_type',
       'record_id', 'parameter_id', 'value', 'unit_id', 'dimension_type_id',
       'dimension_value', 'dimension_unit_id', 'note_1', 'id_1',
       'created_at_1', 'updated_at_1', 'etl_run_id_1', 'lineage_group_id_1'],
      dtype='object')

In [15]:
df[["id", "dataset_id", "record_id", "main_crop", "parameter_id", "value", "unit_id"]]

Unnamed: 0,id,dataset_id,record_id,main_crop,parameter_id,value,unit_id
