# Update gridVeg Additional Species in BigQuery

This notebook appends new additional species records to the BigQuery table from a CSV file stored in GCS.

**Operation**: APPEND new rows (not replace entire table)

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, pyyaml


In [1]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from google.cloud import storage
from datetime import datetime

print("Libraries imported successfully")


Libraries imported successfully


## Load Configuration

**TODO**: Add configuration section to config.yml for this table


In [2]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values for gridVeg additional species
# TODO: Update these config keys once added to config.yml
GCS_CSV_URL = config['gridveg_additional_species']['gcs']['csv_url']
BACKUP_BUCKET = config['gridveg_additional_species']['gcs'].get('backup_bucket')
BACKUP_PREFIX = config['gridveg_additional_species']['gcs'].get('backup_prefix', 'backups/gridveg_additional_species')
BQ_TABLE_ID = config['gridveg_additional_species']['bigquery']['table_id']
BQ_PROJECT = config['gridveg_additional_species']['bigquery'].get('project')

# Verify required config values
if not GCS_CSV_URL or GCS_CSV_URL.startswith('gs://your-'):
    raise ValueError("Please configure gridveg_additional_species.gcs.csv_url in config.yml")
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure gridveg_additional_species.bigquery.table_id in config.yml")

print("✓ Configuration loaded successfully")
print(f"  CSV URL: {GCS_CSV_URL[:60]}..." if len(GCS_CSV_URL) > 60 else f"  CSV URL: {GCS_CSV_URL}")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


✓ Configuration loaded successfully
  CSV URL: gs://mpg-data-warehouse/gridVeg/src/2025/2025-09-18_gridVeg_...
  Table ID: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
  Backup: gs://mpg-data-warehouse/gridVeg/bak


In [3]:
# Initialize clients
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()
storage_client = storage.Client(project=BQ_PROJECT) if BQ_PROJECT else storage.Client()

print(f"✓ Clients initialized")
print(f"  Project: {bq_client.project}")


✓ Clients initialized
  Project: mpg-data-warehouse


## Load CSV Data from GCS

Read the source CSV file containing new additional species records.


In [4]:
# Read CSV from GCS (new data)
print("Reading CSV from GCS...")
df_new = pd.read_csv(GCS_CSV_URL)

print(f"✓ CSV loaded successfully:")
print(f"  Rows: {len(df_new)}")
print(f"  Columns: {list(df_new.columns)}")
print(f"\nFirst few rows:")
df_new.head()


Reading CSV from GCS...
✓ CSV loaded successfully:
  Rows: 390
  Columns: ['Survey Data::__kp_Survey', 'Survey Data::_kf_Site', 'Survey Data::SurveyDate', 'Survey Data::SurveyYear', '_kf_Species_serial']

First few rows:


Unnamed: 0,Survey Data::__kp_Survey,Survey Data::_kf_Site,Survey Data::SurveyDate,Survey Data::SurveyYear,_kf_Species_serial
0,B45700C5-D391-4679-8579-217DCB1385A2,227,5/21/25,2025,492
1,B45700C5-D391-4679-8579-217DCB1385A2,227,5/21/25,2025,496
2,B45700C5-D391-4679-8579-217DCB1385A2,227,5/21/25,2025,230
3,B45700C5-D391-4679-8579-217DCB1385A2,227,5/21/25,2025,287
4,B45700C5-D391-4679-8579-217DCB1385A2,227,5/21/25,2025,303


## Transform CSV Data

Apply schema transformations to match BigQuery table:
- Rename columns to match destination schema
- Convert date format from mm/dd/yy to ISO format (YYYY-MM-DD)


In [5]:
# Define column mapping from CSV to BigQuery
column_mapping = {
    'Survey Data::__kp_Survey': 'survey_ID',
    'Survey Data::_kf_Site': 'grid_point',
    'Survey Data::SurveyDate': 'date',
    'Survey Data::SurveyYear': 'year',
    '_kf_Species_serial': 'key_plant_species'
}

print("Column mapping:")
for csv_col, bq_col in column_mapping.items():
    print(f"  {csv_col:35s} → {bq_col}")


Column mapping:
  Survey Data::__kp_Survey            → survey_ID
  Survey Data::_kf_Site               → grid_point
  Survey Data::SurveyDate             → date
  Survey Data::SurveyYear             → year
  _kf_Species_serial                  → key_plant_species


In [6]:
# Verify CSV columns match expected schema
expected_csv_columns = set(column_mapping.keys())
actual_csv_columns = set(df_new.columns)

if actual_csv_columns == expected_csv_columns:
    print("✓ CSV columns match expected schema")
else:
    print("⚠ CSV column differences detected:")
    if actual_csv_columns - expected_csv_columns:
        print(f"  Unexpected columns: {actual_csv_columns - expected_csv_columns}")
    if expected_csv_columns - actual_csv_columns:
        print(f"  Missing columns: {expected_csv_columns - actual_csv_columns}")
    
print(f"\nCSV columns: {list(df_new.columns)}")


✓ CSV columns match expected schema

CSV columns: ['Survey Data::__kp_Survey', 'Survey Data::_kf_Site', 'Survey Data::SurveyDate', 'Survey Data::SurveyYear', '_kf_Species_serial']


In [7]:
# Apply transformation: rename columns
df_transformed = df_new.copy()
df_transformed = df_transformed.rename(columns=column_mapping)

print("✓ Columns renamed")
print(f"  Transformed columns: {list(df_transformed.columns)}")


✓ Columns renamed
  Transformed columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']


In [8]:
# Convert date from m/d/yy to proper datetime/date format
# Explicitly specify format to avoid parsing warnings and ensure consistency
# Note: %y handles 2-digit years (00-68 = 2000-2068, 69-99 = 1969-1999)
df_transformed['date'] = pd.to_datetime(df_transformed['date'], format='%m/%d/%y').dt.date

print("✓ Date format converted to date type")
print(f"  Sample dates: {df_transformed['date'].head().tolist()}")


✓ Date format converted to date type
  Sample dates: [datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 21)]


In [9]:
# Display transformed data info
print("Transformed Data Info:")
df_transformed.info()
print(f"\nTransformed data preview:")
df_transformed.head()


Transformed Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   survey_ID          390 non-null    object
 1   grid_point         390 non-null    int64 
 2   date               390 non-null    object
 3   year               390 non-null    int64 
 4   key_plant_species  390 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 15.4+ KB

Transformed data preview:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
0,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,492
1,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,496
2,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,230
3,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,287
4,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,303


## Read Existing BigQuery Table

Load the current data from BigQuery to compare with the new data.


In [10]:
# Read existing data from BigQuery
print(f"Reading existing data from {BQ_TABLE_ID}...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

try:
    df_existing = bq_client.query(query).to_dataframe()
    print(f"✓ Existing table loaded:")
    print(f"  Rows: {len(df_existing)}")
    print(f"  Columns: {list(df_existing.columns)}")
    print(f"\nExisting data preview:")
    display(df_existing.head())
except Exception as e:
    print(f"⚠ Error reading table: {e}")
    print("  This may be expected if the table doesn't exist yet.")
    df_existing = None


Reading existing data from mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species...
✓ Existing table loaded:
  Rows: 13662
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']

Existing data preview:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
0,308,324,2011-05-10,2011,69
1,308,324,2011-05-10,2011,72
2,308,324,2011-05-10,2011,5
3,308,324,2011-05-10,2011,82
4,308,324,2011-05-10,2011,75


In [11]:
# Display existing data info (if available)
if df_existing is not None:
    print("Existing Data Info:")
    df_existing.info()


Existing Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13662 entries, 0 to 13661
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   survey_ID          13662 non-null  object
 1   grid_point         13662 non-null  Int64 
 2   date               13662 non-null  dbdate
 3   year               13662 non-null  Int64 
 4   key_plant_species  13648 non-null  Int64 
dtypes: Int64(3), dbdate(1), object(1)
memory usage: 573.8+ KB


## Compare New vs Existing Data

Identify which rows in the new data are not already in the existing table.


In [12]:
# Compare datasets
if df_existing is not None:
    print("=== Comparison Summary ===\n")
    
    # Row count comparison
    print(f"Row count:")
    print(f"  Existing: {len(df_existing)}")
    print(f"  New CSV:  {len(df_transformed)}")
    
    # Column comparison
    existing_cols = set(df_existing.columns)
    new_cols = set(df_transformed.columns)
    
    if existing_cols == new_cols:
        print(f"\n✓ Columns match ({len(new_cols)} columns)")
    else:
        print("\n⚠ Column differences detected:")
        if new_cols - existing_cols:
            print(f"  New columns: {new_cols - existing_cols}")
        if existing_cols - new_cols:
            print(f"  Missing columns: {existing_cols - new_cols}")
    
    print(f"\nColumns: {list(df_transformed.columns)}")
else:
    print("No existing data to compare - this will be a new table creation.")


=== Comparison Summary ===

Row count:
  Existing: 13662
  New CSV:  390

✓ Columns match (5 columns)

Columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']


In [13]:
# Identify new records (not in existing table)
# Use combination of survey_ID + key_plant_species as the composite key
if df_existing is not None:
    # Create composite key for both dataframes
    df_existing['_composite_key'] = df_existing['survey_ID'].astype(str) + '_' + df_existing['key_plant_species'].astype(str)
    df_transformed['_composite_key'] = df_transformed['survey_ID'].astype(str) + '_' + df_transformed['key_plant_species'].astype(str)
    
    existing_keys = set(df_existing['_composite_key'])
    new_keys = set(df_transformed['_composite_key'])
    
    # Find records in new data that aren't in existing
    keys_to_append = new_keys - existing_keys
    
    if keys_to_append:
        df_to_append = df_transformed[df_transformed['_composite_key'].isin(keys_to_append)].copy()
        # Drop the temporary composite key column
        df_to_append = df_to_append.drop(columns=['_composite_key'])
        
        print(f"✓ Found {len(df_to_append)} new records to append")
        
        # Show year breakdown of new records
        print(f"\nNew records by year:")
        year_counts = df_to_append['year'].value_counts().sort_index()
        for year, count in year_counts.items():
            print(f"  {year}: {count} records")
        
        print(f"\nSample of new records:")
        display(df_to_append.head(10))
    else:
        df_to_append = None
        print("⚠ No new records found - all records already exist in table")
        print("  Nothing to append.")
    
    # Check for any duplicates
    duplicate_keys = existing_keys & new_keys
    if duplicate_keys:
        print(f"\n⚠ Warning: {len(duplicate_keys)} records already exist in table")
        print(f"  These will be skipped during append.")
        if len(duplicate_keys) <= 10:
            df_existing_temp = df_existing[df_existing['_composite_key'].isin(list(duplicate_keys)[:10])]
            print(f"\n  Sample duplicate records (survey_ID, species):")
            for _, row in df_existing_temp[['survey_ID', 'key_plant_species']].head(10).iterrows():
                print(f"    {row['survey_ID']}, {row['key_plant_species']}")
    
    # Clean up temporary column from df_existing and df_transformed
    df_existing = df_existing.drop(columns=['_composite_key'])
    df_transformed = df_transformed.drop(columns=['_composite_key'])
else:
    # No existing table, so all records are new
    df_to_append = df_transformed.copy()
    print(f"✓ No existing table - will create new table with {len(df_to_append)} records")


✓ Found 390 new records to append

New records by year:
  2025: 390 records

Sample of new records:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
0,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,492
1,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,496
2,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,230
3,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,287
4,B45700C5-D391-4679-8579-217DCB1385A2,227,2025-05-21,2025,303
5,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,190,2025-05-21,2025,320
6,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,190,2025-05-21,2025,67
7,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,190,2025-05-21,2025,156
8,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,190,2025-05-21,2025,388
9,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,190,2025-05-21,2025,262


## Backup Existing Table

Before making any changes, create a backup of the existing table to GCS.


In [14]:
# Backup existing table to GCS
if df_existing is not None and BACKUP_BUCKET and df_to_append is not None:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Destination: {backup_path}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    extract_job.result()  # Wait for job to complete
    
    print(f"✓ Backup completed successfully")
    print(f"  Files: {backup_path}")
elif df_existing is None:
    print("⚠ No existing table to backup (table doesn't exist yet)")
elif not BACKUP_BUCKET:
    print("⚠ Backup bucket not configured in config.yml")
    print("  Set 'gridveg_additional_species.gcs.backup_bucket' to enable automatic backups")
elif df_to_append is None:
    print("⚠ No new records to append, skipping backup")


Creating backup of existing table...
  Destination: gs://mpg-data-warehouse/gridVeg/bak/20251031_102839/*.csv
✓ Backup completed successfully
  Files: gs://mpg-data-warehouse/gridVeg/bak/20251031_102839/*.csv


## Append New Records to BigQuery

⚠️ **IMPORTANT**: This will APPEND new rows to the existing table (not replace).

Review the comparison above before proceeding.


In [15]:
# Append new records to BigQuery
if df_to_append is not None and len(df_to_append) > 0:
    print("=" * 60)
    print("APPENDING TO BIGQUERY TABLE")
    print("=" * 60)
    print(f"\nTable: {BQ_TABLE_ID}")
    print(f"Rows to append: {len(df_to_append)}")
    print(f"Mode: WRITE_APPEND (add to existing table)")
    print(f"\nStarting append at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")
    
    # Configure job to append to existing table
    job_config = bigquery.LoadJobConfig(
        write_disposition="WRITE_APPEND"  # Append to existing table
    )
    
    # Load dataframe to BigQuery
    load_job = bq_client.load_table_from_dataframe(
        df_to_append,
        BQ_TABLE_ID,
        job_config=job_config
    )
    
    # Wait for job to complete
    load_job.result()
    
    print(f"\n✓ Append completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  Rows appended: {load_job.output_rows}")
    print(f"  Job ID: {load_job.job_id}")
else:
    print("=" * 60)
    print("NO RECORDS TO APPEND")
    print("=" * 60)
    print("\nNo new records found or no records to append.")
    print("Table remains unchanged.")


APPENDING TO BIGQUERY TABLE

Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
Rows to append: 390
Mode: WRITE_APPEND (add to existing table)

Starting append at 2025-10-31 10:28:48...

✓ Append completed at 2025-10-31 10:28:52
  Rows appended: 390
  Job ID: d124a00c-280b-43b1-9c2c-aa62d246f359


## Verify Append

Read back the table to verify the append was successful.


In [16]:
# Read updated table
print("Verifying append...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n✓ Verification complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")

# Show records by year
print(f"\nRecords by year:")
year_counts = df_updated['year'].value_counts().sort_index()
for year, count in year_counts.items():
    print(f"  {year}: {count} records")

print(f"\nUpdated table preview:")
df_updated.tail(10)


Verifying append...

✓ Verification complete
  Rows in table: 14052
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']

Records by year:
  2011: 4043 records
  2012: 1747 records
  2013: 209 records
  2015: 485 records
  2016: 4906 records
  2017: 39 records
  2021: 1238 records
  2022: 454 records
  2023: 267 records
  2024: 274 records
  2025: 390 records

Updated table preview:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
14042,E71C66DF-7C7A-4834-8034-5731CDDE84C9,70,2024-07-10,2024,334
14043,E71C66DF-7C7A-4834-8034-5731CDDE84C9,70,2024-07-10,2024,558
14044,E71C66DF-7C7A-4834-8034-5731CDDE84C9,70,2024-07-10,2024,135
14045,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,308
14046,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,125
14047,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,84
14048,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,20
14049,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,90
14050,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,5
14051,11BCE714-4E58-4F75-8238-323BB5D2616C,420,2024-07-11,2024,113


In [17]:
# Verify row counts
if df_to_append is not None and len(df_to_append) > 0:
    expected_rows = len(df_existing) + len(df_to_append) if df_existing is not None else len(df_to_append)
    actual_rows = len(df_updated)
    
    print("Data integrity check:")
    if df_existing is not None:
        print(f"  Previous rows:   {len(df_existing)}")
        print(f"  Rows appended:   {len(df_to_append)}")
        print(f"  Expected total:  {expected_rows}")
        print(f"  Actual total:    {actual_rows}")
    else:
        print(f"  Rows written:    {len(df_to_append)}")
        print(f"  Rows in table:   {actual_rows}")
    
    if expected_rows == actual_rows:
        print(f"\n✓ Row count verified - all {len(df_to_append)} new rows successfully appended")
    else:
        print(f"\n⚠ Row count mismatch!")
        print(f"  Expected: {expected_rows}")
        print(f"  Actual:   {actual_rows}")
        print(f"  Difference: {actual_rows - expected_rows}")
else:
    print("No new records were appended.")


Data integrity check:
  Previous rows:   13662
  Rows appended:   390
  Expected total:  14052
  Actual total:    14052

✓ Row count verified - all 390 new rows successfully appended


## Summary Report

Complete summary of the append operation.


In [18]:
# Generate summary report
print("=" * 60)
print("GRIDVEG ADDITIONAL SPECIES APPEND SUMMARY")
print("=" * 60)

print(f"\n📅 Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\n📂 Source:")
print(f"  CSV: {GCS_CSV_URL.split('/')[-1]}")
print(f"  Location: {'/'.join(GCS_CSV_URL.split('/')[:-1])}")

print(f"\n🎯 Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\n📊 Data Changes:")
if df_existing is not None:
    print(f"  Previous rows: {len(df_existing)}")
    print(f"  New rows:      {len(df_updated)}")
    print(f"  Rows appended: {len(df_updated) - len(df_existing):+d}")
    
    if df_to_append is not None and len(df_to_append) > 0:
        print(f"\n  Appended records by year:")
        year_counts = df_to_append['year'].value_counts().sort_index()
        for year, count in year_counts.items():
            print(f"    {year}: {count} records")
else:
    print(f"  New table created with {len(df_updated)} rows")

print(f"\n🔄 Transformations Applied:")
print(f"  ✓ Renamed {len(column_mapping)} columns to match BigQuery schema")
print(f"  ✓ Converted date format to ISO (YYYY-MM-DD)")
print(f"  ✓ Used composite key (survey_ID + key_plant_species) for duplicate detection")

if BACKUP_BUCKET and df_existing is not None and df_to_append is not None and len(df_to_append) > 0:
    print(f"\n💾 Backup:")
    print(f"  Location: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/")
    print(f"  Status: ✓ Created before append")

if df_to_append is not None and len(df_to_append) > 0:
    print(f"\n✅ Append completed successfully!")
else:
    print(f"\n✅ No changes needed - table is up to date!")
print("=" * 60)


GRIDVEG ADDITIONAL SPECIES APPEND SUMMARY

📅 Timestamp: 2025-10-31 10:29:26

📂 Source:
  CSV: 2025-09-18_gridVeg_additional_species_SOURCE.csv
  Location: gs://mpg-data-warehouse/gridVeg/src/2025

🎯 Target:
  Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
  Project: mpg-data-warehouse

📊 Data Changes:
  Previous rows: 13662
  New rows:      14052
  Rows appended: +390

  Appended records by year:
    2025: 390 records

🔄 Transformations Applied:
  ✓ Renamed 5 columns to match BigQuery schema
  ✓ Converted date format to ISO (YYYY-MM-DD)
  ✓ Used composite key (survey_ID + key_plant_species) for duplicate detection

💾 Backup:
  Location: gs://mpg-data-warehouse/gridVeg/bak/
  Status: ✓ Created before append

✅ Append completed successfully!


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, use the backup created at the beginning of this notebook.

```python
# To rollback, first identify and delete the appended rows:
# Create composite keys for filtering
# df_rollback = df_updated.copy()
# df_rollback['_composite_key'] = df_rollback['survey_ID'].astype(str) + '_' + df_rollback['key_plant_species'].astype(str)
# keys_to_remove = set([row['survey_ID'] + '_' + str(row['key_plant_species']) for _, row in df_to_append.iterrows()])
# df_rollback = df_rollback[~df_rollback['_composite_key'].isin(keys_to_remove)].drop(columns=['_composite_key'])
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_rollback, BQ_TABLE_ID, job_config=job_config)

# Or restore from backup:
# backup_path = "gs://BACKUP_BUCKET/BACKUP_PREFIX/TIMESTAMP/*.csv"
# df_backup = pd.read_csv(backup_path)
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_backup, BQ_TABLE_ID, job_config=job_config)
```

The backup location was printed in the backup cell above.
