# Append GridVeg Species Richness to BigQuery

This notebook appends gridVeg species richness data from CSV in GCS to the BigQuery table.

## Features
- ‚úÖ Reads CSV from Google Cloud Storage
- ‚úÖ Creates backup of existing BigQuery table before appending
- ‚úÖ Appends new data to existing table (WRITE_APPEND mode)
- ‚úÖ Validates data integrity after append
- ‚úÖ Provides detailed summary report

## Requirements
- Google Cloud credentials configured
- Configuration file: `config.yml` with `csv_append` section configured
- Required packages: google-cloud-bigquery, google-cloud-storage, pandas, pyyaml

## Configuration
Update `config.yml` with your specific values in the `csv_append` section:
```yaml
csv_append:
  gcs:
    csv_url: "gs://your-bucket/path/to/data.csv"
    backup_bucket: "your-bucket"
    backup_prefix: "backups/csv_append"
  bigquery:
    table_id: "your-project.your_dataset.your_table"
    project: "your-project-id"
```

In [1]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from google.cloud import storage
from datetime import datetime

print("Libraries imported successfully")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


An error occurred: module 'importlib.metadata' has no attribute 'packages_distributions'
Libraries imported successfully
Timestamp: 2025-11-06 11:40:05


  import pkg_resources


## Load Configuration

Load settings from `config.yml` including:
- CSV source URL in GCS
- BigQuery table information
- Backup location settings


In [2]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Configuration section for GridVeg Species Richness
CONFIG_SECTION = 'gridveg_species_richness_append'

# Extract configuration values
GCS_CSV_URL = config[CONFIG_SECTION]['gcs']['csv_url']
BACKUP_BUCKET = config[CONFIG_SECTION]['gcs'].get('backup_bucket')
BACKUP_PREFIX = config[CONFIG_SECTION]['gcs'].get('backup_prefix', 'backups')
BQ_TABLE_ID = config[CONFIG_SECTION]['bigquery']['table_id']
BQ_PROJECT = config[CONFIG_SECTION]['bigquery'].get('project')

# Verify required config values
if not GCS_CSV_URL or GCS_CSV_URL.startswith('gs://your-'):
    raise ValueError("Please configure csv_append.gcs.csv_url in config.yml")
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure csv_append.bigquery.table_id in config.yml")

print("‚úì Configuration loaded successfully")
print(f"  Config section: {CONFIG_SECTION}")
print(f"  CSV URL: {GCS_CSV_URL[:50]}..." if len(GCS_CSV_URL) > 50 else f"  CSV URL: {GCS_CSV_URL}")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


‚úì Configuration loaded successfully
  Config section: gridveg_species_richness_append
  CSV URL: gs://mpg-data-warehouse/gridVeg/src/2025/gridVeg_s...
  Table ID: mpg-data-warehouse.vegetation_gridVeg_summaries.gridVeg_species_richness
  Backup: gs://mpg-data-warehouse-backups/backups/vegetation_gridVeg_summaries/gridVeg_species_richness


In [3]:
# Initialize Google Cloud clients
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()
storage_client = storage.Client(project=BQ_PROJECT) if BQ_PROJECT else storage.Client()

print(f"‚úì Clients initialized")
print(f"  Project: {bq_client.project}")


‚úì Clients initialized
  Project: mpg-data-warehouse


## Read Existing BigQuery Table

Load the current table to:
- Verify it exists
- Get row count before append
- Prepare for backup


In [4]:
# Read existing data from BigQuery
print(f"Reading existing data from {BQ_TABLE_ID}...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

try:
    df_existing = bq_client.query(query).to_dataframe()
    print(f"‚úì Existing table loaded:")
    print(f"  Rows: {len(df_existing)}")
    print(f"  Columns: {list(df_existing.columns)}")
    print(f"\nFirst few rows:")
    display(df_existing.head())
except Exception as e:
    print(f"‚úó Error reading table: {e}")
    print("  The table must exist before appending data.")
    raise


Reading existing data from mpg-data-warehouse.vegetation_gridVeg_summaries.gridVeg_species_richness...
‚úì Existing table loaded:
  Rows: 38056
  Columns: ['survey_ID', 'grid_point', 'year', 'key_plant_species', 'detection_type']

First few rows:


Unnamed: 0,survey_ID,grid_point,year,key_plant_species,detection_type
0,69,329,2011,435,point_intercept
1,69,329,2011,82,point_intercept
2,69,329,2011,12,point_intercept
3,69,329,2011,496,point_intercept
4,69,329,2011,497,point_intercept


## Backup Existing Table to GCS

‚ö†Ô∏è **CRITICAL STEP**: Create a backup of the existing table before appending new data.

This backup can be used to restore the table if needed.


In [5]:
# Backup existing table to GCS
if BACKUP_BUCKET:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Source table: {BQ_TABLE_ID}")
    print(f"  Destination: {backup_path}")
    print(f"  Rows to backup: {len(df_existing)}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    # Wait for job to complete
    extract_job.result()
    
    print(f"\n‚úì Backup completed successfully")
    print(f"  Job ID: {extract_job.job_id}")
    print(f"  Backup location: {backup_path}")
    
    # Store backup info for later reference
    BACKUP_LOCATION = backup_path
    BACKUP_TIMESTAMP = timestamp
else:
    print("‚ö† Backup bucket not configured in config.yml")
    print("  Set 'csv_append.gcs.backup_bucket' to enable automatic backups")
    print("  Proceeding without backup...")
    BACKUP_LOCATION = None
    BACKUP_TIMESTAMP = None


Creating backup of existing table...
  Source table: mpg-data-warehouse.vegetation_gridVeg_summaries.gridVeg_species_richness
  Destination: gs://mpg-data-warehouse-backups/backups/vegetation_gridVeg_summaries/gridVeg_species_richness/20251106_114025/*.csv
  Rows to backup: 38056

‚úì Backup completed successfully
  Job ID: dd027685-778d-44fc-994a-a829ad115c11
  Backup location: gs://mpg-data-warehouse-backups/backups/vegetation_gridVeg_summaries/gridVeg_species_richness/20251106_114025/*.csv


## Read CSV from GCS

Load the new data to be appended to the BigQuery table.


In [6]:
# Read CSV from GCS
print(f"Reading CSV from GCS...")
print(f"  Source: {GCS_CSV_URL}")

try:
    # Try UTF-8 first, fallback to latin-1 if needed
    df_new = pd.read_csv(GCS_CSV_URL, encoding='utf-8')
except UnicodeDecodeError:
    print("  Note: Using latin-1 encoding to handle special characters")
    df_new = pd.read_csv(GCS_CSV_URL, encoding='latin-1')

print(f"\n‚úì CSV loaded successfully:")
print(f"  Rows: {len(df_new)}")
print(f"  Columns: {list(df_new.columns)}")
print(f"\nFirst few rows:")
df_new.head()


Reading CSV from GCS...
  Source: gs://mpg-data-warehouse/gridVeg/src/2025/gridVeg_species_richness_WRANGLE-251104.csv

‚úì CSV loaded successfully:
  Rows: 2597
  Columns: ['survey_ID', 'grid_point', 'year', 'key_plant_species', 'detection_type']

First few rows:


Unnamed: 0,survey_ID,grid_point,year,key_plant_species,detection_type
0,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,529,point_intercept
1,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,232,point_intercept
2,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,320,point_intercept
3,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,265,point_intercept
4,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,80,point_intercept


## Validate Schema Compatibility

Verify that the CSV columns match the existing BigQuery table schema.


In [7]:
# Validate schema compatibility
print("=== Schema Validation ===\n")

# Check column names
existing_cols = set(df_existing.columns)
new_cols = set(df_new.columns)

if existing_cols == new_cols:
    print(f"‚úì Column names match ({len(new_cols)} columns)")
else:
    print("‚ö† Column differences detected:")
    if new_cols - existing_cols:
        print(f"  Extra columns in CSV: {new_cols - existing_cols}")
    if existing_cols - new_cols:
        print(f"  Missing columns in CSV: {existing_cols - new_cols}")
    
    user_input = input("\nContinue anyway? (yes/no): ")
    if user_input.lower() != 'yes':
        raise ValueError("Schema mismatch - aborting append operation")

print(f"\nColumns: {list(df_new.columns)}")

# Check data types
print(f"\nData type comparison:")
for col in df_new.columns:
    if col in df_existing.columns:
        existing_type = str(df_existing[col].dtype)
        new_type = str(df_new[col].dtype)
        match_symbol = "‚úì" if existing_type == new_type else "‚ö†"
        print(f"  {match_symbol} {col:30s} existing: {existing_type:10s} ‚Üí new: {new_type:10s}")

print(f"\n‚úì Schema validation complete")


=== Schema Validation ===

‚úì Column names match (5 columns)

Columns: ['survey_ID', 'grid_point', 'year', 'key_plant_species', 'detection_type']

Data type comparison:
  ‚úì survey_ID                      existing: object     ‚Üí new: object    
  ‚ö† grid_point                     existing: Int64      ‚Üí new: int64     
  ‚ö† year                           existing: Int64      ‚Üí new: int64     
  ‚ö† key_plant_species              existing: Int64      ‚Üí new: int64     
  ‚úì detection_type                 existing: object     ‚Üí new: object    

‚úì Schema validation complete


## Convert Data Types

Convert data types in the new data to match the existing table schema.


In [8]:
# Convert data types to match existing table
print("Converting data types to match existing schema...")

# Convert survey_sequence to string to match existing table
if 'survey_sequence' in df_new.columns:
    df_new['survey_sequence'] = df_new['survey_sequence'].astype(str)
    print(f"  ‚úì survey_sequence: int64 ‚Üí object (string)")

print("\n‚úì Data type conversions complete")
print("\nUpdated data types:")
for col in df_new.columns:
    if col in df_existing.columns:
        existing_type = str(df_existing[col].dtype)
        new_type = str(df_new[col].dtype)
        match_symbol = "‚úì" if existing_type == new_type else "‚ö†"
        print(f"  {match_symbol} {col:30s} existing: {existing_type:10s} ‚Üí new: {new_type:10s}")


Converting data types to match existing schema...

‚úì Data type conversions complete

Updated data types:
  ‚úì survey_ID                      existing: object     ‚Üí new: object    
  ‚ö† grid_point                     existing: Int64      ‚Üí new: int64     
  ‚ö† year                           existing: Int64      ‚Üí new: int64     
  ‚ö† key_plant_species              existing: Int64      ‚Üí new: int64     
  ‚úì detection_type                 existing: object     ‚Üí new: object    


## Preview Data Comparison

Compare existing and new data before appending.


In [9]:
# Display data summary
print("=== Data Summary ===\n")
print(f"Existing table:")
print(f"  Rows: {len(df_existing)}")
print(f"  Columns: {len(df_existing.columns)}")

print(f"\nNew data to append:")
print(f"  Rows: {len(df_new)}")
print(f"  Columns: {len(df_new.columns)}")

print(f"\nAfter append:")
print(f"  Expected total rows: {len(df_existing) + len(df_new)}")

print("\n--- Existing Data Sample ---")
display(df_existing.head(3))

print("\n--- New Data Sample ---")
display(df_new.head(3))


=== Data Summary ===

Existing table:
  Rows: 38056
  Columns: 5

New data to append:
  Rows: 2597
  Columns: 5

After append:
  Expected total rows: 40653

--- Existing Data Sample ---


Unnamed: 0,survey_ID,grid_point,year,key_plant_species,detection_type
0,69,329,2011,435,point_intercept
1,69,329,2011,82,point_intercept
2,69,329,2011,12,point_intercept



--- New Data Sample ---


Unnamed: 0,survey_ID,grid_point,year,key_plant_species,detection_type
0,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,529,point_intercept
1,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,232,point_intercept
2,27869B01-61AE-4DB3-A2AC-AD4D8C9A7ECD,3,2023,320,point_intercept


## Append Data to BigQuery Table

‚ö†Ô∏è **IMPORTANT**: This will append new rows to the existing table.

The backup has been created. Review the data above before proceeding.


In [10]:
# Append data to BigQuery table
print("=" * 60)
print("APPENDING DATA TO BIGQUERY TABLE")
print("=" * 60)
print(f"\nTable: {BQ_TABLE_ID}")
print(f"Rows to append: {len(df_new)}")
print(f"Current rows: {len(df_existing)}")
print(f"Mode: WRITE_APPEND (add to existing table)")
print(f"\nStarting append at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")

# Configure job to append to existing table
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_APPEND"  # Append to existing table
)

# Load dataframe to BigQuery
load_job = bq_client.load_table_from_dataframe(
    df_new,
    BQ_TABLE_ID,
    job_config=job_config
)

# Wait for job to complete
load_job.result()

print(f"\n‚úì Append completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"  Rows appended: {load_job.output_rows}")
print(f"  Job ID: {load_job.job_id}")


APPENDING DATA TO BIGQUERY TABLE

Table: mpg-data-warehouse.vegetation_gridVeg_summaries.gridVeg_species_richness
Rows to append: 2597
Current rows: 38056
Mode: WRITE_APPEND (add to existing table)

Starting append at 2025-11-06 11:41:03...





‚úì Append completed at 2025-11-06 11:41:06
  Rows appended: 2597
  Job ID: d16b927f-2408-462d-947e-cd3e88fc5b55


## Verify Append Operation

Read back the table to verify the append was successful.

In [11]:
# Read updated table
print("Verifying append operation...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n‚úì Verification complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")
print(f"\nLast few rows of updated table (should include new data):")
df_updated.tail()


Verifying append operation...

‚úì Verification complete
  Rows in table: 40653
  Columns: ['survey_ID', 'grid_point', 'year', 'key_plant_species', 'detection_type']

Last few rows of updated table (should include new data):


Unnamed: 0,survey_ID,grid_point,year,key_plant_species,detection_type
40648,FFFC121E-C275-4271-B8C6-F8AA7503225C,240,2016,530,supplemental_obs
40649,FFFC121E-C275-4271-B8C6-F8AA7503225C,240,2016,545,supplemental_obs
40650,FFFC121E-C275-4271-B8C6-F8AA7503225C,240,2016,561,supplemental_obs
40651,FFFC121E-C275-4271-B8C6-F8AA7503225C,240,2016,520,supplemental_obs
40652,FFFC121E-C275-4271-B8C6-F8AA7503225C,240,2016,522,supplemental_obs


In [12]:
# Verify row counts
print("Data integrity check:")
print(f"  Rows before append:  {len(df_existing)}")
print(f"  Rows appended:       {len(df_new)}")
print(f"  Expected total:      {len(df_existing) + len(df_new)}")
print(f"  Actual rows in table: {len(df_updated)}")

if len(df_updated) == len(df_existing) + len(df_new):
    print(f"\n‚úì Row count verified - all {len(df_new)} rows successfully appended")
else:
    print(f"\n‚ö† Row count mismatch!")
    print(f"  Expected: {len(df_existing) + len(df_new)}")
    print(f"  Actual:   {len(df_updated)}")
    print(f"  Difference: {len(df_updated) - (len(df_existing) + len(df_new))}")


Data integrity check:
  Rows before append:  38056
  Rows appended:       2597
  Expected total:      40653
  Actual rows in table: 40653

‚úì Row count verified - all 2597 rows successfully appended


## Summary Report

Complete summary of the append operation.


In [13]:
# Generate summary report
print("=" * 60)
print("CSV APPEND TO BIGQUERY - SUMMARY REPORT")
print("=" * 60)

print(f"\nüìÖ Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\nüìÇ Source:")
print(f"  CSV: {GCS_CSV_URL.split('/')[-1]}")
print(f"  Location: {'/'.join(GCS_CSV_URL.split('/')[:-1])}")

print(f"\nüéØ Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\nüìä Data Changes:")
print(f"  Rows before:  {len(df_existing)}")
print(f"  Rows added:   {len(df_new)}")
print(f"  Rows after:   {len(df_updated)}")
print(f"  Net change:   +{len(df_updated) - len(df_existing)}")

if BACKUP_LOCATION:
    print(f"\nüíæ Backup:")
    print(f"  Location: {BACKUP_LOCATION}")
    print(f"  Timestamp: {BACKUP_TIMESTAMP}")
    print(f"  Status: ‚úì Created before append")
else:
    print(f"\nüíæ Backup:")
    print(f"  Status: ‚ö† No backup created")

print(f"\n‚úÖ Append completed successfully!")
print("=" * 60)


CSV APPEND TO BIGQUERY - SUMMARY REPORT

üìÖ Timestamp: 2025-11-06 11:41:21

üìÇ Source:
  CSV: gridVeg_species_richness_WRANGLE-251104.csv
  Location: gs://mpg-data-warehouse/gridVeg/src/2025

üéØ Target:
  Table: mpg-data-warehouse.vegetation_gridVeg_summaries.gridVeg_species_richness
  Project: mpg-data-warehouse

üìä Data Changes:
  Rows before:  38056
  Rows added:   2597
  Rows after:   40653
  Net change:   +2597

üíæ Backup:
  Location: gs://mpg-data-warehouse-backups/backups/vegetation_gridVeg_summaries/gridVeg_species_richness/20251106_114025/*.csv
  Timestamp: 20251106_114025
  Status: ‚úì Created before append

‚úÖ Append completed successfully!


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, restore from the backup created at the beginning of this notebook.

### Option 1: Restore from BigQuery backup

```python
# Replace table with backup data
backup_path = "gs://BUCKET/PREFIX/TIMESTAMP/*.csv"
df_backup = pd.read_csv(backup_path)

job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_TRUNCATE"  # Replace entire table
)

load_job = bq_client.load_table_from_dataframe(
    df_backup,
    BQ_TABLE_ID,
    job_config=job_config
)
load_job.result()
print(f"‚úì Table restored from backup")
```

### Option 2: Query to remove appended rows

If you know a way to identify the appended rows (e.g., by timestamp), you can use SQL to delete them:

```sql
DELETE FROM `project.dataset.table`
WHERE condition_to_identify_new_rows;
```

The backup location was printed in the backup cell above.
