# Update Plant Species Metadata in BigQuery

This notebook updates plant species metadata in BigQuery from a CSV file stored in GCS.

## Requirements
- Google Cloud credentials configured
- Environment variables set (see `config.example.py`)
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, python-dotenv


In [None]:
# Import required libraries
import os
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

print("Libraries imported successfully")


In [None]:
# Load configuration from environment variables
GCS_CSV_URL = os.getenv('PLANT_SPECIES_CSV_URL')
BQ_TABLE_ID = os.getenv('PLANT_SPECIES_TABLE_ID')

# Verify environment variables are set
if not GCS_CSV_URL:
    raise ValueError("PLANT_SPECIES_CSV_URL environment variable not set")
if not BQ_TABLE_ID:
    raise ValueError("PLANT_SPECIES_TABLE_ID environment variable not set")

print(f"Configuration loaded:")
print(f"  CSV URL configured: {bool(GCS_CSV_URL)}")
print(f"  Table ID configured: {bool(BQ_TABLE_ID)}")


In [None]:
# Initialize BigQuery client
bq_client = bigquery.Client()
print(f"BigQuery client initialized for project: {bq_client.project}")


In [None]:
# Read CSV from GCS
# GCS_CSV_URL should be in format: gs://bucket-name/path/to/file.csv
print("Reading CSV from GCS...")
df = pd.read_csv(GCS_CSV_URL)

print(f"CSV loaded successfully:")
print(f"  Rows: {len(df)}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Display basic statistics about the data
print("Data Info:")
df.info()
print("\nData Description:")
df.describe()


## Next Steps

In the next iteration, we'll add:
1. Data validation and cleaning
2. BigQuery table update logic
3. Error handling and logging
4. Dry-run mode to preview changes
