# Fix NA Rows in gridVeg Additional Species

This notebook investigates and fixes NA/NULL rows in the BigQuery table `mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species`.

**Operation**: Identify and remove rows with NULL values in critical fields

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, pandas, pyyaml


In [1]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from datetime import datetime

print("Libraries imported successfully")


An error occurred: module 'importlib.metadata' has no attribute 'packages_distributions'
Libraries imported successfully


  import pkg_resources


## Load Configuration


In [2]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values for gridVeg additional species
BQ_TABLE_ID = config['gridveg_additional_species']['bigquery']['table_id']
BQ_PROJECT = config['gridveg_additional_species']['bigquery'].get('project')
BACKUP_BUCKET = config['gridveg_additional_species']['gcs'].get('backup_bucket')
BACKUP_PREFIX = config['gridveg_additional_species']['gcs'].get('backup_prefix', 'backups/gridveg_additional_species')

# Verify required config values
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure gridveg_additional_species.bigquery.table_id in config.yml")

print("‚úì Configuration loaded successfully")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


‚úì Configuration loaded successfully
  Table ID: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
  Backup: gs://mpg-data-warehouse/gridVeg/bak


In [3]:
# Initialize BigQuery client
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()

print(f"‚úì BigQuery client initialized")
print(f"  Project: {bq_client.project}")


‚úì BigQuery client initialized
  Project: mpg-data-warehouse


## Investigate Current Table State


In [4]:
# Get table schema and basic info
table = bq_client.get_table(BQ_TABLE_ID)

print("Table Schema:")
for field in table.schema:
    print(f"  {field.name}: {field.field_type} (nullable: {field.mode != 'REQUIRED'})")

print(f"\nTotal rows in table: {table.num_rows}")


Table Schema:
  survey_ID: STRING (nullable: True)
  grid_point: INTEGER (nullable: True)
  date: DATE (nullable: True)
  year: INTEGER (nullable: True)
  key_plant_species: INTEGER (nullable: True)

Total rows in table: 14052


In [5]:
# Query to get all data from the table
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

print("Loading current table data...")
df_current = bq_client.query(query).to_dataframe()

print(f"‚úì Data loaded: {len(df_current)} rows")
print(f"  Columns: {list(df_current.columns)}")

# Display info
df_current.info()


Loading current table data...
‚úì Data loaded: 14052 rows
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14052 entries, 0 to 14051
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   survey_ID          14052 non-null  object
 1   grid_point         14052 non-null  Int64 
 2   date               14052 non-null  dbdate
 3   year               14052 non-null  Int64 
 4   key_plant_species  14038 non-null  Int64 
dtypes: Int64(3), dbdate(1), object(1)
memory usage: 590.2+ KB


## Analyze NULL/NA Values


In [6]:
# Check for NULL values in each column
print("NULL Value Analysis:")
print("=" * 60)

null_counts = df_current.isnull().sum()
null_percentages = (df_current.isnull().sum() / len(df_current) * 100)

for col in df_current.columns:
    null_count = null_counts[col]
    null_pct = null_percentages[col]
    if null_count > 0:
        print(f"  {col:20s}: {null_count:5d} nulls ({null_pct:5.2f}%)")
    else:
        print(f"  {col:20s}: No nulls")

print("\n" + "=" * 60)


NULL Value Analysis:
  survey_ID           : No nulls
  grid_point          : No nulls
  date                : No nulls
  year                : No nulls
  key_plant_species   :    14 nulls ( 0.10%)



In [7]:
# Identify rows with any NULL values
rows_with_nulls = df_current[df_current.isnull().any(axis=1)]

print(f"Rows with at least one NULL value: {len(rows_with_nulls)}")

if len(rows_with_nulls) > 0:
    print(f"\nBreakdown by column with NULL:")
    for col in df_current.columns:
        null_in_col = df_current[df_current[col].isnull()]
        if len(null_in_col) > 0:
            print(f"  {col}: {len(null_in_col)} rows")
    
    print(f"\nSample of rows with NULL values:")
    display(rows_with_nulls.head(20))


Rows with at least one NULL value: 14

Breakdown by column with NULL:
  key_plant_species: 14 rows

Sample of rows with NULL values:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
11606,E064CE0F-3978-4BE4-BA3A-6575034446A2,87,2021-05-21,2021,
11613,E064CE0F-3978-4BE4-BA3A-6575034446A2,87,2021-05-21,2021,
11852,DEE675D7-3B6B-430C-87F5-E773015AC5CD,4,2021-06-01,2021,
11863,092A30B6-A394-4E9E-B384-167592BE1A38,28,2021-06-02,2021,
12182,24AFCAFA-4F92-493D-8363-36F7C370D320,94,2021-06-18,2021,
12398,92A3F539-A77E-4D70-A7C3-EA92B16C9845,486,2021-06-29,2021,
12414,92A3F539-A77E-4D70-A7C3-EA92B16C9845,486,2021-06-29,2021,
12450,3EBCFA0E-1714-4764-8C6B-E0C768FF56FD,227,2021-06-30,2021,
12754,9D272F52-DDBB-4CC1-8DC5-C12EEB2D4EBA,45,2022-05-18,2022,
12833,12C3CD1F-672B-4DF4-A290-6FB8BB18413E,180,2022-06-01,2022,


In [8]:
# Check specifically for NULL in key_plant_species (the critical field)
null_species = df_current[df_current['key_plant_species'].isnull()]

print(f"Rows with NULL key_plant_species: {len(null_species)}")

if len(null_species) > 0:
    print(f"\nDistribution by year:")
    year_dist = null_species['year'].value_counts().sort_index()
    for year, count in year_dist.items():
        print(f"  {year}: {count} rows")
    
    print(f"\nSample records with NULL key_plant_species:")
    display(null_species.head(20))


Rows with NULL key_plant_species: 14

Distribution by year:
  2021: 8 rows
  2022: 3 rows
  2023: 1 rows
  2024: 2 rows

Sample records with NULL key_plant_species:


Unnamed: 0,survey_ID,grid_point,date,year,key_plant_species
11606,E064CE0F-3978-4BE4-BA3A-6575034446A2,87,2021-05-21,2021,
11613,E064CE0F-3978-4BE4-BA3A-6575034446A2,87,2021-05-21,2021,
11852,DEE675D7-3B6B-430C-87F5-E773015AC5CD,4,2021-06-01,2021,
11863,092A30B6-A394-4E9E-B384-167592BE1A38,28,2021-06-02,2021,
12182,24AFCAFA-4F92-493D-8363-36F7C370D320,94,2021-06-18,2021,
12398,92A3F539-A77E-4D70-A7C3-EA92B16C9845,486,2021-06-29,2021,
12414,92A3F539-A77E-4D70-A7C3-EA92B16C9845,486,2021-06-29,2021,
12450,3EBCFA0E-1714-4764-8C6B-E0C768FF56FD,227,2021-06-30,2021,
12754,9D272F52-DDBB-4CC1-8DC5-C12EEB2D4EBA,45,2022-05-18,2022,
12833,12C3CD1F-672B-4DF4-A290-6FB8BB18413E,180,2022-06-01,2022,


## Analysis Summary


In [9]:
# Generate summary report
print("=" * 60)
print("DATA QUALITY ANALYSIS SUMMARY")
print("=" * 60)

print(f"\nTotal records in table: {len(df_current)}")
print(f"Records with NULL values: {len(rows_with_nulls)} ({len(rows_with_nulls)/len(df_current)*100:.2f}%)")
print(f"Clean records: {len(df_current) - len(rows_with_nulls)} ({(len(df_current) - len(rows_with_nulls))/len(df_current)*100:.2f}%)")

print(f"\nNULL values by column:")
for col in df_current.columns:
    null_count = df_current[col].isnull().sum()
    if null_count > 0:
        print(f"  {col}: {null_count} ({null_count/len(df_current)*100:.2f}%)")

print("\n" + "=" * 60)


DATA QUALITY ANALYSIS SUMMARY

Total records in table: 14052
Records with NULL values: 14 (0.10%)
Clean records: 14038 (99.90%)

NULL values by column:
  key_plant_species: 14 (0.10%)



## Backup Existing Table

Before making any changes, create a backup of the existing table to GCS.


In [10]:
# Backup existing table to GCS
if BACKUP_BUCKET:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/fix_na_rows_{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Destination: {backup_path}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    extract_job.result()  # Wait for job to complete
    
    print(f"‚úì Backup completed successfully")
    print(f"  Files: {backup_path}")
else:
    print("‚ö† Backup bucket not configured in config.yml")
    print("  Set 'gridveg_additional_species.gcs.backup_bucket' to enable automatic backups")


Creating backup of existing table...
  Destination: gs://mpg-data-warehouse/gridVeg/bak/fix_na_rows_20251106_135136/*.csv
‚úì Backup completed successfully
  Files: gs://mpg-data-warehouse/gridVeg/bak/fix_na_rows_20251106_135136/*.csv


## Prepare Clean Data

Remove rows with NULL values in key_plant_species field.


In [12]:
# Create clean dataset by removing rows with NULL key_plant_species
df_clean = df_current[df_current['key_plant_species'].notna()].copy()

print("Clean Dataset Preparation:")
print(f"  Original rows:    {len(df_current)}")
print(f"  Rows with NULL:   {len(df_current) - len(df_clean)}")
print(f"  Clean rows:       {len(df_clean)}")
print(f"  Rows to remove:   {len(df_current) - len(df_clean)}")

# Show what will be removed
if len(df_current) - len(df_clean) > 0:
    print(f"\nRows to be removed (by year):")
    removed_rows = df_current[df_current['key_plant_species'].isna()]
    year_dist = removed_rows['year'].value_counts().sort_index()
    for year, count in year_dist.items():
        print(f"  {year}: {count} rows")

# Verify data integrity
print(f"\nData Integrity Check:")
print(f"  NULL key_plant_species in clean data: {df_clean['key_plant_species'].isna().sum()}")
print(f"  All rows have species?: {df_clean['key_plant_species'].notna().all()}")


Clean Dataset Preparation:
  Original rows:    14052
  Rows with NULL:   14
  Clean rows:       14038
  Rows to remove:   14

Rows to be removed (by year):
  2021: 8 rows
  2022: 3 rows
  2023: 1 rows
  2024: 2 rows

Data Integrity Check:
  NULL key_plant_species in clean data: 0
  All rows have species?: True


## Replace Table with Clean Data

‚ö†Ô∏è **IMPORTANT**: This will REPLACE the entire table with the clean dataset (no NULL rows).

Review the summary above before proceeding.


In [13]:
# Replace table with clean data
print("=" * 60)
print("REPLACING BIGQUERY TABLE WITH CLEAN DATA")
print("=" * 60)
print(f"\nTable: {BQ_TABLE_ID}")
print(f"Current rows: {len(df_current)}")
print(f"New rows (clean): {len(df_clean)}")
print(f"Rows removed: {len(df_current) - len(df_clean)}")
print(f"Mode: WRITE_TRUNCATE (replace entire table)")
print(f"\nStarting replacement at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")

# Configure job to replace existing table
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_TRUNCATE"  # Replace entire table
)

# Load clean dataframe to BigQuery
load_job = bq_client.load_table_from_dataframe(
    df_clean,
    BQ_TABLE_ID,
    job_config=job_config
)

# Wait for job to complete
load_job.result()

print(f"\n‚úì Replacement completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"  Rows written: {load_job.output_rows}")
print(f"  Job ID: {load_job.job_id}")


REPLACING BIGQUERY TABLE WITH CLEAN DATA

Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
Current rows: 14052
New rows (clean): 14038
Rows removed: 14
Mode: WRITE_TRUNCATE (replace entire table)

Starting replacement at 2025-11-06 13:52:21...





‚úì Replacement completed at 2025-11-06 13:52:25
  Rows written: 14038
  Job ID: 37f56469-df08-4cf8-9a4a-79f250e3cc4f


## Verify Fix

Read back the table to verify NA rows have been removed.


In [14]:
# Read updated table
print("Verifying fix...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n‚úì Verification query complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")

# Check for NULL values
print(f"\nNULL Value Check:")
null_counts_after = df_updated.isnull().sum()
for col in df_updated.columns:
    null_count = null_counts_after[col]
    if null_count > 0:
        print(f"  {col}: {null_count} NULLs (‚ö†Ô∏è UNEXPECTED)")
    else:
        print(f"  {col}: No NULLs ‚úì")

# Show records by year
print(f"\nRecords by year:")
year_counts = df_updated['year'].value_counts().sort_index()
for year, count in year_counts.items():
    print(f"  {year}: {count} records")


Verifying fix...

‚úì Verification query complete
  Rows in table: 14038
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'key_plant_species']

NULL Value Check:
  survey_ID: No NULLs ‚úì
  grid_point: No NULLs ‚úì
  date: No NULLs ‚úì
  year: No NULLs ‚úì
  key_plant_species: No NULLs ‚úì

Records by year:
  2011: 4043 records
  2012: 1747 records
  2013: 209 records
  2015: 485 records
  2016: 4906 records
  2017: 39 records
  2021: 1230 records
  2022: 451 records
  2023: 266 records
  2024: 272 records
  2025: 390 records


In [15]:
# Verify row counts
expected_rows = len(df_clean)
actual_rows = len(df_updated)

print("\nData integrity check:")
print(f"  Expected rows:  {expected_rows}")
print(f"  Actual rows:    {actual_rows}")
print(f"  Rows removed:   {len(df_current) - actual_rows}")

if expected_rows == actual_rows:
    print(f"\n‚úì Row count verified - table successfully cleaned")
else:
    print(f"\n‚ö† Row count mismatch!")
    print(f"  Difference: {actual_rows - expected_rows}")

# Check if any NULL key_plant_species remain
null_species_after = df_updated[df_updated['key_plant_species'].isna()]
if len(null_species_after) == 0:
    print(f"\n‚úì SUCCESS: No NULL key_plant_species values found in updated table")
else:
    print(f"\n‚ö† WARNING: {len(null_species_after)} NULL key_plant_species values still exist!")



Data integrity check:
  Expected rows:  14038
  Actual rows:    14038
  Rows removed:   14

‚úì Row count verified - table successfully cleaned

‚úì SUCCESS: No NULL key_plant_species values found in updated table


## Summary Report

Complete summary of the fix operation.


In [16]:
# Generate summary report
print("=" * 60)
print("FIX NA ROWS SUMMARY")
print("=" * 60)

print(f"\nüìÖ Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\nüéØ Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\nüìä Data Changes:")
print(f"  Original rows:  {len(df_current)}")
print(f"  Cleaned rows:   {len(df_updated)}")
print(f"  Rows removed:   {len(df_current) - len(df_updated)}")

if len(df_current) - len(df_updated) > 0:
    removed_rows = df_current[df_current['key_plant_species'].isna()]
    print(f"\n  Removed rows by year:")
    year_counts = removed_rows['year'].value_counts().sort_index()
    for year, count in year_counts.items():
        print(f"    {year}: {count} rows")

print(f"\nüîÑ Operations Performed:")
print(f"  ‚úì Backed up table to GCS")
print(f"  ‚úì Removed rows with NULL key_plant_species")
print(f"  ‚úì Replaced table with clean data")
print(f"  ‚úì Verified data integrity")

if BACKUP_BUCKET:
    print(f"\nüíæ Backup:")
    print(f"  Location: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/")
    print(f"  Status: ‚úì Created before fix")

# Final validation
null_check = df_updated['key_plant_species'].isna().sum()
if null_check == 0:
    print(f"\n‚úÖ Fix completed successfully!")
    print(f"   No NULL key_plant_species values remain in table")
else:
    print(f"\n‚ö†Ô∏è WARNING: {null_check} NULL values still exist")

print("=" * 60)


FIX NA ROWS SUMMARY

üìÖ Timestamp: 2025-11-06 13:52:53

üéØ Target:
  Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_additional_species
  Project: mpg-data-warehouse

üìä Data Changes:
  Original rows:  14052
  Cleaned rows:   14038
  Rows removed:   14

  Removed rows by year:
    2021: 8 rows
    2022: 3 rows
    2023: 1 rows
    2024: 2 rows

üîÑ Operations Performed:
  ‚úì Backed up table to GCS
  ‚úì Removed rows with NULL key_plant_species
  ‚úì Replaced table with clean data
  ‚úì Verified data integrity

üíæ Backup:
  Location: gs://mpg-data-warehouse/gridVeg/bak/
  Status: ‚úì Created before fix

‚úÖ Fix completed successfully!
   No NULL key_plant_species values remain in table


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, restore from the backup created at the beginning of this notebook.

```python
# To rollback, restore from backup:
# backup_path = "gs://BACKUP_BUCKET/BACKUP_PREFIX/fix_na_rows_TIMESTAMP/*.csv"
# df_backup = pd.read_csv(backup_path)
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_backup, BQ_TABLE_ID, job_config=job_config)
```

The backup location was printed in the backup cell above.
