# Update Plant Species Metadata in BigQuery

This notebook updates plant species metadata in BigQuery from a CSV file stored in GCS.

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, pyyaml


In [24]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from google.cloud import storage

print("Libraries imported successfully")


Libraries imported successfully


In [25]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values
GCS_CSV_URL = config['gcs']['csv_url']
BACKUP_BUCKET = config['gcs'].get('backup_bucket')
BACKUP_PREFIX = config['gcs'].get('backup_prefix', 'backups')
BQ_TABLE_ID = config['bigquery']['table_id']
BQ_PROJECT = config['bigquery'].get('project')

# Verify required config values
if not GCS_CSV_URL or GCS_CSV_URL.startswith('gs://your-'):
    raise ValueError("Please configure gcs.csv_url in config.yml")
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure bigquery.table_id in config.yml")

print("✓ Configuration loaded successfully")
print(f"  CSV URL: {GCS_CSV_URL[:50]}..." if len(GCS_CSV_URL) > 50 else f"  CSV URL: {GCS_CSV_URL}")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


✓ Configuration loaded successfully
  CSV URL: gs://mpg-data-warehouse/vegetation_species_metadat...
  Table ID: mpg-data-warehouse.vegetation_species_metadata.vegetation_species_metadata_source
  Backup: gs://mpg-data-warehouse/vegetation_species_metadata/bak


In [26]:
# Initialize clients
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()
storage_client = storage.Client(project=BQ_PROJECT) if BQ_PROJECT else storage.Client()

print(f"✓ Clients initialized")
print(f"  Project: {bq_client.project}")


✓ Clients initialized
  Project: mpg-data-warehouse


In [27]:
# Read CSV from GCS (new data)
print("Reading CSV from GCS...")
# Use latin-1 encoding to handle special characters that aren't valid UTF-8
df_new = pd.read_csv(GCS_CSV_URL, encoding='latin-1')

print(f"✓ CSV loaded successfully:")
print(f"  Rows: {len(df_new)}")
print(f"  Columns: {list(df_new.columns)}")
print(f"\nFirst few rows:")
df_new.head()


Reading CSV from GCS...
✓ CSV loaded successfully:
  Rows: 769
  Columns: ['__kp_PlantMetadata', '__kp_PlantCode', 'NameScientific', 'NameSynonym', 'NameCommon', 'NameFamily', 'NativeStatus', 'LifeCycle', 'LifeForm', 'zModificationTimestamp']

First few rows:


Unnamed: 0,__kp_PlantMetadata,__kp_PlantCode,NameScientific,NameSynonym,NameCommon,NameFamily,NativeStatus,LifeCycle,LifeForm,zModificationTimestamp
0,51,ERECON,Eremogone congesta,Arenaria congesta,ballhead sandwort,Caryophyllaceae,native,Perennial,forb,09/16/2025 10:20:26
1,355,MYOSTR,Myosotis stricta,"Myosotis micrantha, M. sylvatica",stiff forget-me-not,Boraginaceae,nonnative,Annual,forb,09/16/2025 10:19:26
2,384,PERGAI,Perideridia gairdneri,Perideridia montana,Gardner's yampah,Apiaceae,native,Perennial,forb,09/16/2025 10:18:33
3,230,EUPVIR,Euphorbia virgata,Euphorbia esula,leafy spurge,Euphorbiaceae,nonnative,Perennial,forb,09/16/2025 10:11:23
4,802,PHLO_SP,Phlox spp.,,phlox,Polemoniaceae,native,unknown,forb,09/04/2025 08:56:03


## Transform CSV Data

Apply column transformations to match BigQuery schema:
- Rename columns to follow warehouse naming conventions
- Drop `zModificationTimestamp` column (not stored in warehouse)


In [28]:
# Define column mapping from CSV to BigQuery
column_mapping = {
    '__kp_PlantMetadata': 'key_plant_species',
    '__kp_PlantCode': 'key_plant_code',
    'NameScientific': 'plant_name_sci',
    'NameSynonym': 'plant_name_syn',
    'NameCommon': 'plant_name_common',
    'NameFamily': 'plant_name_family',
    'NativeStatus': 'plant_native_status',
    'LifeCycle': 'plant_life_cycle',
    'LifeForm': 'plant_life_form'
    # zModificationTimestamp is dropped (not included in mapping)
}

print("Column mapping:")
for csv_col, bq_col in column_mapping.items():
    print(f"  {csv_col:25s} → {bq_col}")


Column mapping:
  __kp_PlantMetadata        → key_plant_species
  __kp_PlantCode            → key_plant_code
  NameScientific            → plant_name_sci
  NameSynonym               → plant_name_syn
  NameCommon                → plant_name_common
  NameFamily                → plant_name_family
  NativeStatus              → plant_native_status
  LifeCycle                 → plant_life_cycle
  LifeForm                  → plant_life_form


In [29]:
# Verify CSV columns match expected schema
expected_csv_columns = set(column_mapping.keys()) | {'zModificationTimestamp'}
actual_csv_columns = set(df_new.columns)

if actual_csv_columns == expected_csv_columns:
    print("✓ CSV columns match expected schema")
else:
    print("⚠ CSV column differences detected:")
    if actual_csv_columns - expected_csv_columns:
        print(f"  Unexpected columns: {actual_csv_columns - expected_csv_columns}")
    if expected_csv_columns - actual_csv_columns:
        print(f"  Missing columns: {expected_csv_columns - actual_csv_columns}")
    
print(f"\nCSV columns: {list(df_new.columns)}")


✓ CSV columns match expected schema

CSV columns: ['__kp_PlantMetadata', '__kp_PlantCode', 'NameScientific', 'NameSynonym', 'NameCommon', 'NameFamily', 'NativeStatus', 'LifeCycle', 'LifeForm', 'zModificationTimestamp']


In [30]:
# Apply transformation: rename columns and drop zModificationTimestamp
df_transformed = df_new.copy()

# Select and rename columns in one step
df_transformed = df_transformed[list(column_mapping.keys())].rename(columns=column_mapping)

print("✓ Transformation applied")
print(f"  Original columns: {len(df_new.columns)}")
print(f"  Transformed columns: {len(df_transformed.columns)}")
print(f"  Dropped: zModificationTimestamp")
print(f"\nTransformed columns: {list(df_transformed.columns)}")
print(f"\nTransformed data preview:")
df_transformed.head()


✓ Transformation applied
  Original columns: 10
  Transformed columns: 9
  Dropped: zModificationTimestamp

Transformed columns: ['key_plant_species', 'key_plant_code', 'plant_name_sci', 'plant_name_syn', 'plant_name_common', 'plant_name_family', 'plant_native_status', 'plant_life_cycle', 'plant_life_form']

Transformed data preview:


Unnamed: 0,key_plant_species,key_plant_code,plant_name_sci,plant_name_syn,plant_name_common,plant_name_family,plant_native_status,plant_life_cycle,plant_life_form
0,51,ERECON,Eremogone congesta,Arenaria congesta,ballhead sandwort,Caryophyllaceae,native,Perennial,forb
1,355,MYOSTR,Myosotis stricta,"Myosotis micrantha, M. sylvatica",stiff forget-me-not,Boraginaceae,nonnative,Annual,forb
2,384,PERGAI,Perideridia gairdneri,Perideridia montana,Gardner's yampah,Apiaceae,native,Perennial,forb
3,230,EUPVIR,Euphorbia virgata,Euphorbia esula,leafy spurge,Euphorbiaceae,nonnative,Perennial,forb
4,802,PHLO_SP,Phlox spp.,,phlox,Polemoniaceae,native,unknown,forb


In [31]:
# Display basic statistics about the new data
print("New Data Info:")
df_new.info()
print("\nNew Data Description:")
df_new.describe()


New Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 769 entries, 0 to 768
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   __kp_PlantMetadata      769 non-null    int64 
 1   __kp_PlantCode          769 non-null    object
 2   NameScientific          765 non-null    object
 3   NameSynonym             95 non-null     object
 4   NameCommon              769 non-null    object
 5   NameFamily              767 non-null    object
 6   NativeStatus            769 non-null    object
 7   LifeCycle               769 non-null    object
 8   LifeForm                769 non-null    object
 9   zModificationTimestamp  769 non-null    object
dtypes: int64(1), object(9)
memory usage: 60.2+ KB

New Data Description:


Unnamed: 0,__kp_PlantMetadata
count,769.0
mean,394.241873
std,232.208775
min,1.0
25%,195.0
50%,388.0
75%,604.0
max,802.0


## Read Existing BigQuery Table

Load the current data from BigQuery to compare with the new data.


In [32]:
# Read existing data from BigQuery
print(f"Reading existing data from {BQ_TABLE_ID}...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

try:
    df_existing = bq_client.query(query).to_dataframe()
    print(f"✓ Existing table loaded:")
    print(f"  Rows: {len(df_existing)}")
    print(f"  Columns: {list(df_existing.columns)}")
except Exception as e:
    print(f"⚠ Error reading table: {e}")
    print("  This may be expected if the table doesn't exist yet.")
    df_existing = None


Reading existing data from mpg-data-warehouse.vegetation_species_metadata.vegetation_species_metadata_source...
✓ Existing table loaded:
  Rows: 765
  Columns: ['key_plant_species', 'key_plant_code', 'plant_name_sci', 'plant_name_syn', 'plant_name_common', 'plant_name_family', 'plant_native_status', 'plant_life_cycle', 'plant_life_form']


In [33]:
# Display existing data (if available)
if df_existing is not None:
    print("Existing data sample:")
    display(df_existing.head())
    print("\nExisting Data Info:")
    df_existing.info()


Existing data sample:


Unnamed: 0,key_plant_species,key_plant_code,plant_name_sci,plant_name_syn,plant_name_common,plant_name_family,plant_native_status,plant_life_cycle,plant_life_form
0,360,NV,no vegetation,,no vegetation,,none,unknown,none
1,6,STINEL,Stipa nelsonii,Achnatherum nelsonii,Columbia needlegrass,Poaceae,native,perennial,graminoid
2,409,POAPAL,Poa palustris,,fowl bluegrass,Poaceae,native,perennial,graminoid
3,407,POACUS,Poa cusickii,,Cusick's bluegrass,Poaceae,native,perennial,graminoid
4,275,HORJUB,Hordeum jubatum,,foxtail barley,Poaceae,native,perennial,graminoid



Existing Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 765 entries, 0 to 764
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   key_plant_species    765 non-null    Int64 
 1   key_plant_code       765 non-null    object
 2   plant_name_sci       761 non-null    object
 3   plant_name_syn       85 non-null     object
 4   plant_name_common    765 non-null    object
 5   plant_name_family    765 non-null    object
 6   plant_native_status  765 non-null    object
 7   plant_life_cycle     765 non-null    object
 8   plant_life_form      765 non-null    object
dtypes: Int64(1), object(8)
memory usage: 54.7+ KB


## Compare Differences

Compare the new CSV data with the existing BigQuery table to identify changes.


In [34]:
# Compare datasets (using transformed data)
if df_existing is not None:
    print("=== Comparison Summary ===\n")
    
    # Row count comparison
    print(f"Row count:")
    print(f"  Existing: {len(df_existing)}")
    print(f"  New:      {len(df_transformed)}")
    print(f"  Diff:     {len(df_transformed) - len(df_existing):+d}\n")
    
    # Column comparison
    existing_cols = set(df_existing.columns)
    new_cols = set(df_transformed.columns)
    
    if existing_cols == new_cols:
        print(f"✓ Columns match ({len(new_cols)} columns)")
    else:
        print("⚠ Column differences detected:")
        if new_cols - existing_cols:
            print(f"  New columns: {new_cols - existing_cols}")
        if existing_cols - new_cols:
            print(f"  Removed columns: {existing_cols - new_cols}")
    
    print(f"\nColumns: {list(df_transformed.columns)}")
    
    # Data type comparison
    if existing_cols == new_cols:
        print(f"\nData types comparison:")
        for col in df_transformed.columns:
            existing_type = str(df_existing[col].dtype)
            new_type = str(df_transformed[col].dtype)
            match_symbol = "✓" if existing_type == new_type else "⚠"
            print(f"  {match_symbol} {col:25s} existing: {existing_type:10s} → new: {new_type:10s}")
else:
    print("No existing data to compare - this will be a new table creation.")


=== Comparison Summary ===

Row count:
  Existing: 765
  New:      769
  Diff:     +4

✓ Columns match (9 columns)

Columns: ['key_plant_species', 'key_plant_code', 'plant_name_sci', 'plant_name_syn', 'plant_name_common', 'plant_name_family', 'plant_native_status', 'plant_life_cycle', 'plant_life_form']

Data types comparison:
  ⚠ key_plant_species         existing: Int64      → new: int64     
  ✓ key_plant_code            existing: object     → new: object    
  ✓ plant_name_sci            existing: object     → new: object    
  ✓ plant_name_syn            existing: object     → new: object    
  ✓ plant_name_common         existing: object     → new: object    
  ✓ plant_name_family         existing: object     → new: object    
  ✓ plant_native_status       existing: object     → new: object    
  ✓ plant_life_cycle          existing: object     → new: object    
  ✓ plant_life_form           existing: object     → new: object    


In [35]:
# Identify new and removed records
if df_existing is not None and len(df_transformed) != len(df_existing):
    # Find records in new data that aren't in existing (based on key_plant_code)
    existing_keys = set(df_existing['key_plant_code'])
    new_keys = set(df_transformed['key_plant_code'])
    
    added_keys = new_keys - existing_keys
    removed_keys = existing_keys - new_keys
    
    if added_keys:
        print(f"✓ New records to add ({len(added_keys)}):")
        new_records = df_transformed[df_transformed['key_plant_code'].isin(added_keys)]
        display(new_records[['key_plant_code', 'plant_name_sci', 'plant_name_common']])
    
    if removed_keys:
        print(f"\n⚠ Records to remove ({len(removed_keys)}):")
        removed_records = df_existing[df_existing['key_plant_code'].isin(removed_keys)]
        display(removed_records[['key_plant_code', 'plant_name_sci', 'plant_name_common']])
    
    if not added_keys and not removed_keys:
        print("No records added or removed - only updates to existing records")


✓ New records to add (8):


Unnamed: 0,key_plant_code,plant_name_sci,plant_name_common
0,ERECON,Eremogone congesta,ballhead sandwort
1,MYOSTR,Myosotis stricta,stiff forget-me-not
2,PERGAI,Perideridia gairdneri,Gardner's yampah
3,EUPVIR,Euphorbia virgata,leafy spurge
4,PHLO_SP,Phlox spp.,phlox
5,CHOJUN,Chondrilla juncea,rush skeletonweed
6,MEDI_SP,Medicago spp.,unknown Medicago
17,MENALB,Mentzelia albicaulis,white-stem stickleaf



⚠ Records to remove (4):


Unnamed: 0,key_plant_code,plant_name_sci,plant_name_common
63,PERMON,Perideridia montana,Gardner's yampah
540,ARECON,Arenaria congesta,ballhead sandwort
696,MYOMIC,Myosotis micrantha,stiff forget-me-not
739,EUPESU,Euphorbia esula,leafy spurge


## Backup Existing Table

Before making any changes, create a backup of the existing table to GCS.


In [None]:
# Backup existing table to GCS
from datetime import datetime

if df_existing is not None and BACKUP_BUCKET:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Destination: {backup_path}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    extract_job.result()  # Wait for job to complete
    
    print(f"✓ Backup completed successfully")
    print(f"  Files: {backup_path}")
elif df_existing is None:
    print("⚠ No existing table to backup (table doesn't exist yet)")
elif not BACKUP_BUCKET:
    print("⚠ Backup bucket not configured in config.yml")
    print("  Set 'gcs.backup_bucket' to enable automatic backups")


## Update BigQuery Table

⚠️ **IMPORTANT**: This will replace the entire table with the new data.

Review the comparison above before proceeding. The backup has been created.


In [None]:
# Write transformed data to BigQuery
from datetime import datetime

print("=" * 60)
print("UPDATING BIGQUERY TABLE")
print("=" * 60)
print(f"\nTable: {BQ_TABLE_ID}")
print(f"Rows to write: {len(df_transformed)}")
print(f"Mode: WRITE_TRUNCATE (replace entire table)")
print(f"\nStarting update at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")

# Configure job to replace entire table
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_TRUNCATE",  # Replace entire table
    schema_update_options=[
        bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
        bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
    ]
)

# Load dataframe to BigQuery
load_job = bq_client.load_table_from_dataframe(
    df_transformed,
    BQ_TABLE_ID,
    job_config=job_config
)

# Wait for job to complete
load_job.result()

print(f"\n✓ Update completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"  Rows written: {load_job.output_rows}")
print(f"  Job ID: {load_job.job_id}")


## Verify Update

Read back the table to verify the update was successful.


In [None]:
# Read updated table
print("Verifying update...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n✓ Verification complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")
print(f"\nUpdated table preview:")
df_updated.head()


In [None]:
# Verify row counts match
print("Data integrity check:")
print(f"  Rows written:  {len(df_transformed)}")
print(f"  Rows in table: {len(df_updated)}")

if len(df_transformed) == len(df_updated):
    print(f"\n✓ Row count verified - all {len(df_updated)} rows successfully written")
else:
    print(f"\n⚠ Row count mismatch!")
    print(f"  Expected: {len(df_transformed)}")
    print(f"  Actual:   {len(df_updated)}")
    print(f"  Difference: {len(df_updated) - len(df_transformed)}")


## Summary Report

Complete summary of the update operation.


In [None]:
# Generate summary report
from datetime import datetime

print("=" * 60)
print("PLANT SPECIES METADATA UPDATE SUMMARY")
print("=" * 60)

print(f"\n📅 Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\n📂 Source:")
print(f"  CSV: {GCS_CSV_URL.split('/')[-1]}")
print(f"  Location: {'/'.join(GCS_CSV_URL.split('/')[:-1])}")

print(f"\n🎯 Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\n📊 Data Changes:")
if df_existing is not None:
    print(f"  Previous rows: {len(df_existing)}")
    print(f"  New rows:      {len(df_updated)}")
    print(f"  Net change:    {len(df_updated) - len(df_existing):+d}")
else:
    print(f"  New table created with {len(df_updated)} rows")

print(f"\n🔄 Transformations Applied:")
print(f"  ✓ Renamed {len(column_mapping)} columns to warehouse conventions")
print(f"  ✓ Dropped zModificationTimestamp column")

if BACKUP_BUCKET and df_existing is not None:
    print(f"\n💾 Backup:")
    print(f"  Location: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/")
    print(f"  Status: ✓ Created before update")

print(f"\n✅ Update completed successfully!")
print("=" * 60)


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, use the backup created at the beginning of this notebook.

```python
# To rollback, load from the backup:
# backup_path = "gs://mpg-data-warehouse/backups/vegetation_species_metadata/YYYYMMDD_HHMMSS/*.csv"
# df_backup = pd.read_csv(backup_path)
# bq_client.load_table_from_dataframe(df_backup, BQ_TABLE_ID, job_config=job_config)
```

The backup location was printed in the backup cell above.
