# Interactive Database Querying

This notebook demonstrates how to connect to the project's PostgreSQL database from a local Jupyter environment to perform interactive queries and data analysis. It is intended as an educational tool and a template for data exploration.

### Step 1: Import Necessary Libraries

We begin by importing the required libraries:
- `pandas` for data manipulation.
- `create_engine`, `Session`, and `select` from `sqlmodel` for database interaction.
- The specific SQLModel data models (`FieldSample`, `GeographicLocation`) we want to query.

In [1]:
import pandas as pd
from sqlmodel import create_engine, Session, select

try:
    from ca_biositing.datamodels.biomass import FieldSample
    from ca_biositing.datamodels.biomass import PrimaryProduct
    from ca_biositing.datamodels.geographic_locations import GeographicLocation
    print("Libraries and models imported successfully.")
except ImportError as e:
    print(f"Error importing models: {e}")
    print("Please ensure the project is installed correctly by running 'pixi install' in your terminal.")

Libraries and models imported successfully.


### Step 2: Create the Database Engine

Next, we create a SQLAlchemy engine to manage connections to the database. 

**Note:** We are hardcoding the database URL here to ensure the notebook runs reliably. It connects to `localhost:5432`, which is the default port mapped to the project's Dockerized database.

In [2]:
DATABASE_URL = "postgresql+psycopg2://biocirv_user:biocirv_dev_password@localhost:5432/biocirv_db"

try:
    engine = create_engine(DATABASE_URL, echo=False)
    print(f"Database engine created. Ready to connect to: {DATABASE_URL.replace('biocirv_dev_password', '<password>')}")
except Exception as e:
    print(f"Error creating database engine: {e}")

Database engine created. Ready to connect to: postgresql+psycopg2://biocirv_user:<password>@localhost:5432/biocirv_db


### Step 3: Query the Tables

Now we can use the engine to open a session and query our tables. We will query the `field_samples` and `geographic_locations` tables and load them into separate pandas DataFrames.

In [3]:
field_samples_df = None
locations_df = None
primary_product_df = None

try:
    with Session(engine) as session:
        # Query the field_samples table
        statement_samples = select(FieldSample)
        results_samples = session.exec(statement_samples).all()
        field_samples_df = pd.DataFrame([sample.model_dump() for sample in results_samples])
        print(f"- Loaded {len(field_samples_df)} records from 'field_samples'.")
        
        # Query the geographic_locations table
        statement_locations = select(GeographicLocation)
        results_locations = session.exec(statement_locations).all()
        locations_df = pd.DataFrame([location.model_dump() for location in results_locations])
        print(f"- Loaded {len(locations_df)} records from 'geographic_locations'.")

                # Query the primary_product table
        statement_products = select(PrimaryProduct)
        results_products = session.exec(statement_products).all()
        primary_product_df = pd.DataFrame([product.model_dump() for product in results_products])
        print(f"- Loaded {len(primary_product_df)} records from 'primary_product'.")

except Exception as e:
    print(f"Error during database query: {e}")

- Loaded 0 records from 'field_samples'.
- Loaded 0 records from 'geographic_locations'.
- Loaded 18 records from 'primary_product'.


In [8]:
df2 = pd.read_sql("SELECT * FROM analysis_types", engine)
df2 = df2.set_index("analysis_type_id")

df2.head()

Unnamed: 0_level_0,analysis_name
analysis_type_id,Unnamed: 1_level_1
1,Proximate analysis
2,ICP-OES
3,Chemical composition
4,XRF analysis
5,Ultimate analysis


### Step 4: Join the DataFrames

With our data loaded into DataFrames, we can now perform a standard `left` join using pandas to combine them on the `location_id` key. 

We include error handling to manage the case where the database is empty. If the DataFrames are empty, the join key might not exist, and the merge will fail. Our code will catch this and print a helpful message.

In [7]:
merged_df = None

try:
    # Ensure dataframes are not None and not empty before merging
    if field_samples_df is not None and not field_samples_df.empty:
        merged_df = pd.merge(
            left=field_samples_df, 
            right=locations_df, 
            how='left', 
            on='location_id'
        )
        print("Join complete.")
    else:
        print("Skipping join because 'field_samples' table is empty.")
except KeyError as e:
    print(f"Error during join: Could not find join key {e}. This can happen if the tables are empty.")
except Exception as e:
    print(f"An unexpected error occurred during the join: {e}")

Skipping join because 'field_samples' table is empty.


### Step 5: Display Results

Finally, we can display the head of the merged DataFrame to inspect the result.

In [6]:
if merged_df is not None:
    display(merged_df.head())
else:
    print("No data to display.")

No data to display.
