# Update gridVeg Image Metadata in BigQuery

This notebook appends new image metadata records to the BigQuery table from a CSV file stored in GCS.

**Operation**: APPEND new rows (not replace entire table)

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, pyyaml


In [1]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from google.cloud import storage
from datetime import datetime

print("Libraries imported successfully")


Libraries imported successfully


## Load Configuration

**TODO**: Add configuration section to config.yml for this table


In [2]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values for gridVeg image metadata
# TODO: Update these config keys once added to config.yml
GCS_CSV_URL = config['gridveg_image_metadata']['gcs']['csv_url']
BACKUP_BUCKET = config['gridveg_image_metadata']['gcs'].get('backup_bucket')
BACKUP_PREFIX = config['gridveg_image_metadata']['gcs'].get('backup_prefix', 'backups/gridveg_image_metadata')
BQ_TABLE_ID = config['gridveg_image_metadata']['bigquery']['table_id']
BQ_PROJECT = config['gridveg_image_metadata']['bigquery'].get('project')

# Verify required config values
if not GCS_CSV_URL or GCS_CSV_URL.startswith('gs://your-'):
    raise ValueError("Please configure gridveg_image_metadata.gcs.csv_url in config.yml")
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure gridveg_image_metadata.bigquery.table_id in config.yml")

print("✓ Configuration loaded successfully")
print(f"  CSV URL: {GCS_CSV_URL[:60]}..." if len(GCS_CSV_URL) > 60 else f"  CSV URL: {GCS_CSV_URL}")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


✓ Configuration loaded successfully
  CSV URL: gs://mpg-data-warehouse/gridVeg/src/2025/2025-09-18_gridVeg_...
  Table ID: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_image_metadata
  Backup: gs://mpg-data-warehouse/gridVeg/bak


In [3]:
# Initialize clients
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()
storage_client = storage.Client(project=BQ_PROJECT) if BQ_PROJECT else storage.Client()

print(f"✓ Clients initialized")
print(f"  Project: {bq_client.project}")


✓ Clients initialized
  Project: mpg-data-warehouse


## Load CSV Data from GCS

Read the source CSV file containing new image metadata records.


In [4]:
# Read CSV from GCS (new data)
print("Reading CSV from GCS...")
df_new = pd.read_csv(GCS_CSV_URL)

print(f"✓ CSV loaded successfully:")
print(f"  Rows: {len(df_new)}")
print(f"  Columns: {list(df_new.columns)}")
print(f"\nFirst few rows:")
df_new.head()


Reading CSV from GCS...
✓ CSV loaded successfully:
  Rows: 78
  Columns: ['__kp_Photos', 'Survey Data::__kp_Survey', 'Survey Data::SurveyDate', 'Survey Data::SurveyYear', 'Survey Data::_kf_Site', 'Direction']

First few rows:


Unnamed: 0,__kp_Photos,Survey Data::__kp_Survey,Survey Data::SurveyDate,Survey Data::SurveyYear,Survey Data::_kf_Site,Direction
0,3E0C8814-CA4F-4370-B998-0FF321625FEF,B45700C5-D391-4679-8579-217DCB1385A2,5/21/25,2025,227,North
1,D39F4604-8B98-414A-AAE1-91880E10083B,B45700C5-D391-4679-8579-217DCB1385A2,5/21/25,2025,227,West
2,4CAA802B-D9DF-43DD-931F-1A55BB114FC6,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,5/21/25,2025,190,North
3,3CCC329F-6956-4E7F-83F7-71E324FB733E,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,5/21/25,2025,190,West
4,7B70FE85-49B4-4AA2-85D6-2C85F4B85542,38A8FE64-8769-474C-BC25-01CBF006BFCC,5/22/25,2025,331,North


## Transform CSV Data

Apply schema transformations to match BigQuery table:
- Rename columns to match destination schema
- Convert date format from mm/dd/yy to ISO format (YYYY-MM-DD)
- Generate image_url from image_ID (https://storage.cloud.google.com/gridveg-reference-images/{image_ID}.jpg)
- Clean up Direction field (handle invisible character issue in "North")


In [5]:
# Define column mapping from CSV to BigQuery
column_mapping = {
    '__kp_Photos': 'image_ID',
    'Survey Data::__kp_Survey': 'survey_ID',
    'Survey Data::SurveyDate': 'date',
    'Survey Data::SurveyYear': 'year',
    'Survey Data::_kf_Site': 'grid_point',
    'Direction': 'image_direction'
}

print("Column mapping:")
for csv_col, bq_col in column_mapping.items():
    print(f"  {csv_col:35s} → {bq_col}")


Column mapping:
  __kp_Photos                         → image_ID
  Survey Data::__kp_Survey            → survey_ID
  Survey Data::SurveyDate             → date
  Survey Data::SurveyYear             → year
  Survey Data::_kf_Site               → grid_point
  Direction                           → image_direction


In [7]:
# Verify CSV columns match expected schema
expected_csv_columns = set(column_mapping.keys())
actual_csv_columns = set(df_new.columns)

if actual_csv_columns == expected_csv_columns:
    print("✓ CSV columns match expected schema")
else:
    print("⚠ CSV column differences detected:")
    if actual_csv_columns - expected_csv_columns:
        print(f"  Unexpected columns: {actual_csv_columns - expected_csv_columns}")
    if expected_csv_columns - actual_csv_columns:
        print(f"  Missing columns: {expected_csv_columns - actual_csv_columns}")
    
print(f"\nCSV columns: {list(df_new.columns)}")


✓ CSV columns match expected schema

CSV columns: ['__kp_Photos', 'Survey Data::__kp_Survey', 'Survey Data::SurveyDate', 'Survey Data::SurveyYear', 'Survey Data::_kf_Site', 'Direction']


In [8]:
# Apply transformation: rename columns
df_transformed = df_new.copy()
df_transformed = df_transformed.rename(columns=column_mapping)

print("✓ Columns renamed")
print(f"  Transformed columns: {list(df_transformed.columns)}")


✓ Columns renamed
  Transformed columns: ['image_ID', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']


In [9]:
# Convert date from m/d/yy to proper datetime/date format
# Explicitly specify format to avoid parsing warnings and ensure consistency
# Note: %y handles 2-digit years (00-68 = 2000-2068, 69-99 = 1969-1999)
df_transformed['date'] = pd.to_datetime(df_transformed['date'], format='%m/%d/%y').dt.date

print("✓ Date format converted to date type")
print(f"  Sample dates: {df_transformed['date'].head().tolist()}")


✓ Date format converted to date type
  Sample dates: [datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 21), datetime.date(2025, 5, 22)]


In [10]:
# Clean up Direction field - strip whitespace and handle invisible characters
# The source mentions "invisible difference in North" that displays as two levels
df_transformed['image_direction'] = df_transformed['image_direction'].str.strip()

# Check for unique values and any issues
print("✓ Direction field cleaned")
print(f"  Unique directions: {sorted(df_transformed['image_direction'].dropna().unique())}")
print(f"  Direction counts:")
for direction, count in df_transformed['image_direction'].value_counts().items():
    print(f"    {repr(direction):12s}: {count}")


✓ Direction field cleaned
  Unique directions: ['East', 'North', 'West']
  Direction counts:
    'North'     : 39
    'West'      : 38
    'East'      : 1


In [11]:
# Generate image_url column by concatenating base URL with image_ID and .jpg extension
# Format: https://storage.cloud.google.com/gridveg-reference-images/{image_ID}.jpg
base_url = "https://storage.cloud.google.com/gridveg-reference-images/"
df_transformed['image_url'] = base_url + df_transformed['image_ID'] + ".jpg"

print("✓ Generated image_url column")
print(f"  Base URL: {base_url}")
print(f"  Sample URLs:")
for url in df_transformed['image_url'].head(3):
    print(f"    {url}")


✓ Generated image_url column
  Base URL: https://storage.cloud.google.com/gridveg-reference-images/
  Sample URLs:
    https://storage.cloud.google.com/gridveg-reference-images/3E0C8814-CA4F-4370-B998-0FF321625FEF.jpg
    https://storage.cloud.google.com/gridveg-reference-images/D39F4604-8B98-414A-AAE1-91880E10083B.jpg
    https://storage.cloud.google.com/gridveg-reference-images/4CAA802B-D9DF-43DD-931F-1A55BB114FC6.jpg


In [12]:
# Reorder columns to match destination schema
expected_column_order = ['image_ID', 'image_url', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']
df_transformed = df_transformed[expected_column_order]

print("✓ Columns reordered to match destination schema")
print(f"  Final columns: {list(df_transformed.columns)}")


✓ Columns reordered to match destination schema
  Final columns: ['image_ID', 'image_url', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']


In [13]:
# Display transformed data info
print("Transformed Data Info:")
df_transformed.info()
print(f"\nTransformed data preview:")
df_transformed.head()


Transformed Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   image_ID         78 non-null     object
 1   image_url        78 non-null     object
 2   survey_ID        78 non-null     object
 3   date             78 non-null     object
 4   year             78 non-null     int64 
 5   grid_point       78 non-null     int64 
 6   image_direction  78 non-null     object
dtypes: int64(2), object(5)
memory usage: 4.4+ KB

Transformed data preview:


Unnamed: 0,image_ID,image_url,survey_ID,date,year,grid_point,image_direction
0,3E0C8814-CA4F-4370-B998-0FF321625FEF,https://storage.cloud.google.com/gridveg-refer...,B45700C5-D391-4679-8579-217DCB1385A2,2025-05-21,2025,227,North
1,D39F4604-8B98-414A-AAE1-91880E10083B,https://storage.cloud.google.com/gridveg-refer...,B45700C5-D391-4679-8579-217DCB1385A2,2025-05-21,2025,227,West
2,4CAA802B-D9DF-43DD-931F-1A55BB114FC6,https://storage.cloud.google.com/gridveg-refer...,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,2025-05-21,2025,190,North
3,3CCC329F-6956-4E7F-83F7-71E324FB733E,https://storage.cloud.google.com/gridveg-refer...,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,2025-05-21,2025,190,West
4,7B70FE85-49B4-4AA2-85D6-2C85F4B85542,https://storage.cloud.google.com/gridveg-refer...,38A8FE64-8769-474C-BC25-01CBF006BFCC,2025-05-22,2025,331,North


## Read Existing BigQuery Table

Load the current data from BigQuery to compare with the new data.


In [14]:
# Read existing data from BigQuery
print(f"Reading existing data from {BQ_TABLE_ID}...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

try:
    df_existing = bq_client.query(query).to_dataframe()
    print(f"✓ Existing table loaded:")
    print(f"  Rows: {len(df_existing)}")
    print(f"  Columns: {list(df_existing.columns)}")
    print(f"\nExisting data preview:")
    display(df_existing.head())
except Exception as e:
    print(f"⚠ Error reading table: {e}")
    print("  This may be expected if the table doesn't exist yet.")
    df_existing = None


Reading existing data from mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_image_metadata...
✓ Existing table loaded:
  Rows: 2756
  Columns: ['image_ID', 'image_url', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']

Existing data preview:


Unnamed: 0,image_ID,image_url,survey_ID,date,year,grid_point,image_direction
0,3A72E85E-A527-4DBA-96F4-82BF71D12414,https://gridveg-reference-images.s3.amazonaws....,447,2011-05-11,2011,230,West
1,80DB0FC1-4F0A-4071-9DEB-5461BC23D93A,https://gridveg-reference-images.s3.amazonaws....,447,2011-05-11,2011,230,North
2,BC4E63C7-3AD5-4D02-9753-A4229DA141C5,https://gridveg-reference-images.s3.amazonaws....,474,2011-05-11,2011,250,
3,177E28B0-9602-4A01-AFE5-D50AFED0CD63,https://gridveg-reference-images.s3.amazonaws....,474,2011-05-11,2011,250,
4,8F092C5F-16A1-43DF-ABEF-EE600E6972AD,https://gridveg-reference-images.s3.amazonaws....,478,2011-05-11,2011,268,West


In [15]:
# Display existing data info (if available)
if df_existing is not None:
    print("Existing Data Info:")
    df_existing.info()


Existing Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756 entries, 0 to 2755
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   image_ID         2756 non-null   object
 1   image_url        2756 non-null   object
 2   survey_ID        2756 non-null   object
 3   date             2756 non-null   dbdate
 4   year             2756 non-null   Int64 
 5   grid_point       2756 non-null   Int64 
 6   image_direction  2725 non-null   object
dtypes: Int64(2), dbdate(1), object(4)
memory usage: 156.2+ KB


## Compare New vs Existing Data

Identify which rows in the new data are not already in the existing table.


In [16]:
# Compare datasets
if df_existing is not None:
    print("=== Comparison Summary ===\n")
    
    # Row count comparison
    print(f"Row count:")
    print(f"  Existing: {len(df_existing)}")
    print(f"  New CSV:  {len(df_transformed)}")
    
    # Column comparison
    existing_cols = set(df_existing.columns)
    new_cols = set(df_transformed.columns)
    
    if existing_cols == new_cols:
        print(f"\n✓ Columns match ({len(new_cols)} columns)")
    else:
        print("\n⚠ Column differences detected:")
        if new_cols - existing_cols:
            print(f"  New columns: {new_cols - existing_cols}")
        if existing_cols - new_cols:
            print(f"  Missing columns: {existing_cols - new_cols}")
    
    print(f"\nColumns: {list(df_transformed.columns)}")
else:
    print("No existing data to compare - this will be a new table creation.")


=== Comparison Summary ===

Row count:
  Existing: 2756
  New CSV:  78

✓ Columns match (7 columns)

Columns: ['image_ID', 'image_url', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']


In [17]:
# Identify new records (not in existing table)
# Use image_ID as the unique key (described as unique in source schema)
if df_existing is not None:
    existing_ids = set(df_existing['image_ID'])
    new_ids = set(df_transformed['image_ID'])
    
    # Find records in new data that aren't in existing
    ids_to_append = new_ids - existing_ids
    
    if ids_to_append:
        df_to_append = df_transformed[df_transformed['image_ID'].isin(ids_to_append)].copy()
        
        print(f"✓ Found {len(df_to_append)} new records to append")
        
        # Show year breakdown of new records
        print(f"\nNew records by year:")
        year_counts = df_to_append['year'].value_counts().sort_index()
        for year, count in year_counts.items():
            print(f"  {year}: {count} records")
        
        # Show direction breakdown
        print(f"\nNew records by direction:")
        direction_counts = df_to_append['image_direction'].value_counts()
        for direction, count in direction_counts.items():
            print(f"  {direction}: {count} records")
        
        print(f"\nSample of new records:")
        display(df_to_append.head(10))
    else:
        df_to_append = None
        print("⚠ No new records found - all records already exist in table")
        print("  Nothing to append.")
    
    # Check for any duplicates
    duplicate_ids = existing_ids & new_ids
    if duplicate_ids:
        print(f"\n⚠ Warning: {len(duplicate_ids)} records already exist in table")
        print(f"  These will be skipped during append.")
        if len(duplicate_ids) <= 10:
            print(f"\n  Sample duplicate image_IDs:")
            for img_id in list(duplicate_ids)[:10]:
                print(f"    {img_id}")
else:
    # No existing table, so all records are new
    df_to_append = df_transformed.copy()
    print(f"✓ No existing table - will create new table with {len(df_to_append)} records")


✓ Found 78 new records to append

New records by year:
  2025: 78 records

New records by direction:
  North: 39 records
  West: 38 records
  East: 1 records

Sample of new records:


Unnamed: 0,image_ID,image_url,survey_ID,date,year,grid_point,image_direction
0,3E0C8814-CA4F-4370-B998-0FF321625FEF,https://storage.cloud.google.com/gridveg-refer...,B45700C5-D391-4679-8579-217DCB1385A2,2025-05-21,2025,227,North
1,D39F4604-8B98-414A-AAE1-91880E10083B,https://storage.cloud.google.com/gridveg-refer...,B45700C5-D391-4679-8579-217DCB1385A2,2025-05-21,2025,227,West
2,4CAA802B-D9DF-43DD-931F-1A55BB114FC6,https://storage.cloud.google.com/gridveg-refer...,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,2025-05-21,2025,190,North
3,3CCC329F-6956-4E7F-83F7-71E324FB733E,https://storage.cloud.google.com/gridveg-refer...,C0BD2A75-FF0B-48DC-BB9D-941267BF5838,2025-05-21,2025,190,West
4,7B70FE85-49B4-4AA2-85D6-2C85F4B85542,https://storage.cloud.google.com/gridveg-refer...,38A8FE64-8769-474C-BC25-01CBF006BFCC,2025-05-22,2025,331,North
5,9A9DF2C4-19E7-441D-87C4-35EB052F2E0F,https://storage.cloud.google.com/gridveg-refer...,38A8FE64-8769-474C-BC25-01CBF006BFCC,2025-05-22,2025,331,West
6,3BD38821-711D-4690-B440-BDE67BB4A483,https://storage.cloud.google.com/gridveg-refer...,147224CA-F0FC-4E02-B2DE-8B17F5553B29,2025-05-26,2025,45,North
7,4EBE4821-A5E0-48D8-9856-7A6A166F59B8,https://storage.cloud.google.com/gridveg-refer...,147224CA-F0FC-4E02-B2DE-8B17F5553B29,2025-05-26,2025,45,East
8,53B56252-7CC8-4E7D-AC9C-C38B2CA960AB,https://storage.cloud.google.com/gridveg-refer...,CD7E5294-F7D8-4CD6-B35A-EDB356A88A73,2025-05-26,2025,165,North
9,D86644A3-6FCE-45C4-8FB7-0DF110F2C444,https://storage.cloud.google.com/gridveg-refer...,CD7E5294-F7D8-4CD6-B35A-EDB356A88A73,2025-05-26,2025,165,West


## Backup Existing Table

Before making any changes, create a backup of the existing table to GCS.


In [18]:
# Backup existing table to GCS
if df_existing is not None and BACKUP_BUCKET and df_to_append is not None:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Destination: {backup_path}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    extract_job.result()  # Wait for job to complete
    
    print(f"✓ Backup completed successfully")
    print(f"  Files: {backup_path}")
elif df_existing is None:
    print("⚠ No existing table to backup (table doesn't exist yet)")
elif not BACKUP_BUCKET:
    print("⚠ Backup bucket not configured in config.yml")
    print("  Set 'gridveg_image_metadata.gcs.backup_bucket' to enable automatic backups")
elif df_to_append is None:
    print("⚠ No new records to append, skipping backup")


Creating backup of existing table...
  Destination: gs://mpg-data-warehouse/gridVeg/bak/20251031_124607/*.csv
✓ Backup completed successfully
  Files: gs://mpg-data-warehouse/gridVeg/bak/20251031_124607/*.csv


## Append New Records to BigQuery

⚠️ **IMPORTANT**: This will APPEND new rows to the existing table (not replace).

Review the comparison above before proceeding.


In [19]:
# Append new records to BigQuery
if df_to_append is not None and len(df_to_append) > 0:
    print("=" * 60)
    print("APPENDING TO BIGQUERY TABLE")
    print("=" * 60)
    print(f"\nTable: {BQ_TABLE_ID}")
    print(f"Rows to append: {len(df_to_append)}")
    print(f"Mode: WRITE_APPEND (add to existing table)")
    print(f"\nStarting append at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")
    
    # Configure job to append to existing table
    job_config = bigquery.LoadJobConfig(
        write_disposition="WRITE_APPEND"  # Append to existing table
    )
    
    # Load dataframe to BigQuery
    load_job = bq_client.load_table_from_dataframe(
        df_to_append,
        BQ_TABLE_ID,
        job_config=job_config
    )
    
    # Wait for job to complete
    load_job.result()
    
    print(f"\n✓ Append completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  Rows appended: {load_job.output_rows}")
    print(f"  Job ID: {load_job.job_id}")
else:
    print("=" * 60)
    print("NO RECORDS TO APPEND")
    print("=" * 60)
    print("\nNo new records found or no records to append.")
    print("Table remains unchanged.")


APPENDING TO BIGQUERY TABLE

Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_image_metadata
Rows to append: 78
Mode: WRITE_APPEND (add to existing table)

Starting append at 2025-10-31 12:46:37...

✓ Append completed at 2025-10-31 12:46:41
  Rows appended: 78
  Job ID: 216170b6-044a-4c99-afd3-e61d2c5a03e9


## Verify Append

Read back the table to verify the append was successful.


In [20]:
# Read updated table
print("Verifying append...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n✓ Verification complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")

# Show records by year
print(f"\nRecords by year:")
year_counts = df_updated['year'].value_counts().sort_index()
for year, count in year_counts.items():
    print(f"  {year}: {count} records")

# Show records by direction
print(f"\nRecords by direction:")
direction_counts = df_updated['image_direction'].value_counts()
for direction, count in direction_counts.items():
    print(f"  {direction}: {count} records")

print(f"\nUpdated table preview:")
df_updated.tail(10)


Verifying append...

✓ Verification complete
  Rows in table: 2834
  Columns: ['image_ID', 'image_url', 'survey_ID', 'date', 'year', 'grid_point', 'image_direction']

Records by year:
  2010: 4 records
  2011: 696 records
  2012: 310 records
  2013: 52 records
  2015: 108 records
  2016: 1140 records
  2017: 12 records
  2021: 190 records
  2022: 110 records
  2023: 72 records
  2024: 62 records
  2025: 78 records

Records by direction:
  North: 1401 records
  West: 1395 records
  East: 7 records

Updated table preview:


Unnamed: 0,image_ID,image_url,survey_ID,date,year,grid_point,image_direction
2824,5AC961D3-31AE-445C-93C5-CF56CDAC36F1,https://gridveg-reference-images.s3.amazonaws....,4FBE3C92-0867-4E1D-8A8B-3AE4F519CBC6,2022-07-11,2022,538,West
2825,1C134E8B-FE82-451D-8E32-1B0DE0524E42,https://gridveg-reference-images.s3.amazonaws....,4FBE3C92-0867-4E1D-8A8B-3AE4F519CBC6,2022-07-11,2022,538,North
2826,C5D5D9DD-88AA-48C1-AD32-3AEA43F20DF0,https://gridveg-reference-images.s3.amazonaws....,F764CD20-83CC-40DC-ABE7-E18E5F8B7C86,2022-07-11,2022,326,West
2827,12F57EF8-F898-443E-9196-0D5BAD6A4B57,https://gridveg-reference-images.s3.amazonaws....,F764CD20-83CC-40DC-ABE7-E18E5F8B7C86,2022-07-11,2022,326,North
2828,C8B92783-BBEF-4F67-9A11-C8E36B3338D7,https://gridveg-reference-images.s3.amazonaws....,1812FABC-B40F-4EBB-915D-47D081AE094E,2022-07-12,2022,572,North
2829,11578321-EF63-4703-B872-6D7DFAC98A0A,https://gridveg-reference-images.s3.amazonaws....,1812FABC-B40F-4EBB-915D-47D081AE094E,2022-07-12,2022,572,West
2830,9415977E-2F66-4F02-91D3-2A1BBE4E8B18,https://gridveg-reference-images.s3.amazonaws....,597BCC48-254F-4750-8D11-77F8594104C1,2022-07-12,2022,478,West
2831,7ED77A90-2619-4149-B728-50028B50FA46,https://gridveg-reference-images.s3.amazonaws....,597BCC48-254F-4750-8D11-77F8594104C1,2022-07-12,2022,478,North
2832,C0DC347D-00D9-42E3-8D5D-5D579A18F294,https://gridveg-reference-images.s3.amazonaws....,BB722297-A24C-4DEF-9A2C-1E886E08500A,2022-07-13,2022,582,West
2833,82A2CDD5-2199-4B09-8EAB-92DEDEAADC66,https://gridveg-reference-images.s3.amazonaws....,BB722297-A24C-4DEF-9A2C-1E886E08500A,2022-07-13,2022,582,North


In [21]:
# Verify row counts
if df_to_append is not None and len(df_to_append) > 0:
    expected_rows = len(df_existing) + len(df_to_append) if df_existing is not None else len(df_to_append)
    actual_rows = len(df_updated)
    
    print("Data integrity check:")
    if df_existing is not None:
        print(f"  Previous rows:   {len(df_existing)}")
        print(f"  Rows appended:   {len(df_to_append)}")
        print(f"  Expected total:  {expected_rows}")
        print(f"  Actual total:    {actual_rows}")
    else:
        print(f"  Rows written:    {len(df_to_append)}")
        print(f"  Rows in table:   {actual_rows}")
    
    if expected_rows == actual_rows:
        print(f"\n✓ Row count verified - all {len(df_to_append)} new rows successfully appended")
    else:
        print(f"\n⚠ Row count mismatch!")
        print(f"  Expected: {expected_rows}")
        print(f"  Actual:   {actual_rows}")
        print(f"  Difference: {actual_rows - expected_rows}")
else:
    print("No new records were appended.")


Data integrity check:
  Previous rows:   2756
  Rows appended:   78
  Expected total:  2834
  Actual total:    2834

✓ Row count verified - all 78 new rows successfully appended


## Summary Report

Complete summary of the append operation.


In [22]:
# Generate summary report
print("=" * 60)
print("GRIDVEG IMAGE METADATA APPEND SUMMARY")
print("=" * 60)

print(f"\n📅 Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\n📂 Source:")
print(f"  CSV: {GCS_CSV_URL.split('/')[-1]}")
print(f"  Location: {'/'.join(GCS_CSV_URL.split('/')[:-1])}")

print(f"\n🎯 Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\n📊 Data Changes:")
if df_existing is not None:
    print(f"  Previous rows: {len(df_existing)}")
    print(f"  New rows:      {len(df_updated)}")
    print(f"  Rows appended: {len(df_updated) - len(df_existing):+d}")
    
    if df_to_append is not None and len(df_to_append) > 0:
        print(f"\n  Appended records by year:")
        year_counts = df_to_append['year'].value_counts().sort_index()
        for year, count in year_counts.items():
            print(f"    {year}: {count} records")
        
        print(f"\n  Appended records by direction:")
        direction_counts = df_to_append['image_direction'].value_counts()
        for direction, count in direction_counts.items():
            print(f"    {direction}: {count} records")
else:
    print(f"  New table created with {len(df_updated)} rows")

print(f"\n🔄 Transformations Applied:")
print(f"  ✓ Renamed {len(column_mapping)} columns to match BigQuery schema")
print(f"  ✓ Converted date format to ISO (YYYY-MM-DD)")
print(f"  ✓ Cleaned up Direction field (stripped whitespace)")
print(f"  ✓ Generated image_url from image_ID and base URL")
print(f"  ✓ Used image_ID as unique key for duplicate detection")

if BACKUP_BUCKET and df_existing is not None and df_to_append is not None and len(df_to_append) > 0:
    print(f"\n💾 Backup:")
    print(f"  Location: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/")
    print(f"  Status: ✓ Created before append")

if df_to_append is not None and len(df_to_append) > 0:
    print(f"\n✅ Append completed successfully!")
else:
    print(f"\n✅ No changes needed - table is up to date!")
print("=" * 60)


GRIDVEG IMAGE METADATA APPEND SUMMARY

📅 Timestamp: 2025-10-31 12:47:10

📂 Source:
  CSV: 2025-09-18_gridVeg_ref_image_metadata_SOURCE.csv
  Location: gs://mpg-data-warehouse/gridVeg/src/2025

🎯 Target:
  Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_image_metadata
  Project: mpg-data-warehouse

📊 Data Changes:
  Previous rows: 2756
  New rows:      2834
  Rows appended: +78

  Appended records by year:
    2025: 78 records

  Appended records by direction:
    North: 39 records
    West: 38 records
    East: 1 records

🔄 Transformations Applied:
  ✓ Renamed 6 columns to match BigQuery schema
  ✓ Converted date format to ISO (YYYY-MM-DD)
  ✓ Cleaned up Direction field (stripped whitespace)
  ✓ Generated image_url from image_ID and base URL
  ✓ Used image_ID as unique key for duplicate detection

💾 Backup:
  Location: gs://mpg-data-warehouse/gridVeg/bak/
  Status: ✓ Created before append

✅ Append completed successfully!


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, use the backup created at the beginning of this notebook.

```python
# To rollback, first identify and delete the appended rows:
# Create a list of image_IDs to remove
# ids_to_remove = set(df_to_append['image_ID'])
# df_rollback = df_updated[~df_updated['image_ID'].isin(ids_to_remove)]
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_rollback, BQ_TABLE_ID, job_config=job_config)

# Or restore from backup:
# backup_path = "gs://BACKUP_BUCKET/BACKUP_PREFIX/TIMESTAMP/*.csv"
# df_backup = pd.read_csv(backup_path)
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_backup, BQ_TABLE_ID, job_config=job_config)
```

The backup location was printed in the backup cell above.
