# Update Plant Species Metadata in BigQuery

This notebook updates plant species metadata in BigQuery from a CSV file stored in GCS.

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, pyyaml


In [None]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from google.cloud import storage

print("Libraries imported successfully")


In [None]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values
GCS_CSV_URL = config['gcs']['csv_url']
BQ_TABLE_ID = config['bigquery']['table_id']
BQ_PROJECT = config['bigquery'].get('project')  # Optional

# Verify required config values
if not GCS_CSV_URL or GCS_CSV_URL.startswith('gs://your-'):
    raise ValueError("Please configure gcs.csv_url in config.yml")
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure bigquery.table_id in config.yml")

print("✓ Configuration loaded successfully")
print(f"  CSV URL: {GCS_CSV_URL[:50]}..." if len(GCS_CSV_URL) > 50 else f"  CSV URL: {GCS_CSV_URL}")
print(f"  Table ID: {BQ_TABLE_ID}")


In [None]:
# Initialize BigQuery client
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()
print(f"✓ BigQuery client initialized")
print(f"  Project: {bq_client.project}")


In [None]:
# Read CSV from GCS
# GCS_CSV_URL should be in format: gs://bucket-name/path/to/file.csv
print("Reading CSV from GCS...")
df = pd.read_csv(GCS_CSV_URL)

print(f"CSV loaded successfully:")
print(f"  Rows: {len(df)}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Display basic statistics about the data
print("Data Info:")
df.info()
print("\nData Description:")
df.describe()


## Next Steps

In the next iteration, we'll add:
1. Data validation and cleaning
2. BigQuery table update logic
3. Error handling and logging
4. Dry-run mode to preview changes
