# Database Joins and Denormalized Views

This notebook demonstrates how to use SQLAlchemy to perform complex joins on the project's data models and create denormalized views suitable for API endpoints.

It replaces the older `database_interaction.ipynb` and uses the new generated data models.

**Note:** Ensure you are running this notebook in the project's Pixi environment where `ca-biositing-datamodels` is installed.

In [None]:
# Verify Kernel
import sys

print(f"Python Executable: {sys.executable}")

# Check if we are in the pixi environment (path should contain .pixi)
if ".pixi" in sys.executable:
    print("✅ You are running in a Pixi environment.")
else:
    print("⚠️ You might be running in the wrong kernel. Please select the 'default' kernel created by Pixi.")

In [None]:
import pandas as pd
from sqlalchemy import create_engine, select
from sqlalchemy.orm import Session

# Import models from ca_biositing.py as the source of truth
from ca_biositing.datamodels.schemas.generated.ca_biositing import *

print("Libraries and models imported successfully.")

In [None]:
# Database Connection
DATABASE_URL = "postgresql+psycopg2://biocirv_user:biocirv_dev_password@localhost:5432/biocirv_db"
engine = create_engine(DATABASE_URL)
print(f"Connected to database.")

## 1. Join LandIQ Record and Observation

We will join `LandiqRecord` and `Observation` tables on `observation.record_id = landiq_record.id`.

In [None]:
# Create the join query
stmt = select(LandiqRecord, Observation).join(
    Observation,
    Observation.record_id == LandiqRecord.id
)

print("SQL Query:")
print(stmt)

In [None]:
stmt2 = select(LandiqRecord, PrimaryCrop).join(
    PrimaryCrop, 
    LandiqRecord.main_crop == PrimaryCrop.id).limit(10)

print(stmt2)

df2 = pd.read_sql(stmt2, engine)

df2.columns


In [None]:
# Execute and load into DataFrame
try:
    with Session(engine) as session:
        # Using pandas read_sql to execute the statement and fetch results
        # Note: Columns with same names (like 'id') might be duplicated or suffixed automatically by pandas.
        df = pd.read_sql(stmt, session.bind)
        
        print(f"Loaded {len(df)} rows.")
        display(df.head())
except Exception as e:
    print(f"Error executing query: {e}")

In [None]:
df.columns

In [None]:
df[["id", "dataset_id", "record_id", "main_crop", "parameter_id", "value", "unit_id"]]

In [None]:
query = select(Observation, LandiqRecord, PrimaryCrop).join(
    Observation,
    Observation.record_id == LandiqRecord.id).join(
    PrimaryCrop,
    LandiqRecord.main_crop == PrimaryCrop.id).limit(10)

df3 = pd.read_sql(query, engine)

df3.columns

In [None]:
pd.read_sql(select(LandiqRecord), engine)