# Fix NA Rows in gridVeg Point Intercept Ground

This notebook investigates and fixes NA/NULL rows in the BigQuery table `mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_point_intercept_ground`.

**Operation**: Identify and remove rows with NULL values in `intercept_ground_code` field

## Requirements
- Google Cloud credentials configured
- Configuration file: copy `config.example.yml` to `config.yml` and fill in your values
- Required packages: google-cloud-bigquery, pandas, pyyaml


In [1]:
# Import required libraries
import yaml
import pandas as pd
from pathlib import Path
from google.cloud import bigquery
from datetime import datetime

print("Libraries imported successfully")


An error occurred: module 'importlib.metadata' has no attribute 'packages_distributions'
Libraries imported successfully


  import pkg_resources


## Load Configuration


In [2]:
# Load configuration from YAML file
config_path = Path("../config.yml")

if not config_path.exists():
    raise FileNotFoundError(
        f"Configuration file not found: {config_path}\n"
        "Please copy config.example.yml to config.yml and fill in your values."
    )

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Extract configuration values for gridVeg point intercepts
BQ_TABLE_ID = config['gridveg_point_intercepts']['bigquery']['table_ground']
BQ_PROJECT = config['gridveg_point_intercepts']['bigquery'].get('project')
BACKUP_BUCKET = config['gridveg_point_intercepts']['gcs'].get('backup_bucket')
BACKUP_PREFIX = config['gridveg_point_intercepts']['gcs'].get('backup_prefix', 'backups/gridveg_point_intercepts')

# Verify required config values
if not BQ_TABLE_ID or 'your-project' in BQ_TABLE_ID:
    raise ValueError("Please configure gridveg_point_intercepts.bigquery.table_ground in config.yml")

print("‚úì Configuration loaded successfully")
print(f"  Table ID: {BQ_TABLE_ID}")
print(f"  Backup: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}" if BACKUP_BUCKET else "  Backup: Not configured")


‚úì Configuration loaded successfully
  Table ID: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_point_intercept_ground
  Backup: gs://mpg-data-warehouse/gridVeg/bak


In [3]:
# Initialize BigQuery client
bq_client = bigquery.Client(project=BQ_PROJECT) if BQ_PROJECT else bigquery.Client()

print(f"‚úì BigQuery client initialized")
print(f"  Project: {bq_client.project}")


‚úì BigQuery client initialized
  Project: mpg-data-warehouse


## Investigate Current Table State


In [4]:
# Get table schema and basic info
table = bq_client.get_table(BQ_TABLE_ID)

print("Table Schema:")
for field in table.schema:
    print(f"  {field.name}: {field.field_type} (nullable: {field.mode != 'REQUIRED'})")

print(f"\nTotal rows in table: {table.num_rows}")


Table Schema:
  survey_ID: STRING (nullable: True)
  grid_point: INTEGER (nullable: True)
  date: DATE (nullable: True)
  year: INTEGER (nullable: True)
  transect_point: STRING (nullable: True)
  intercept_1: INTEGER (nullable: True)
  intercept_ground_code: STRING (nullable: True)

Total rows in table: 298844


In [5]:
# Query to get all data from the table
query = f"SELECT * FROM `{BQ_TABLE_ID}`"

print("Loading current table data...")
df_current = bq_client.query(query).to_dataframe()

print(f"‚úì Data loaded: {len(df_current)} rows")
print(f"  Columns: {list(df_current.columns)}")

# Display info
df_current.info()


Loading current table data...
‚úì Data loaded: 298844 rows
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'transect_point', 'intercept_1', 'intercept_ground_code']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298844 entries, 0 to 298843
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   survey_ID              298844 non-null  object
 1   grid_point             298844 non-null  Int64 
 2   date                   298844 non-null  dbdate
 3   year                   298844 non-null  Int64 
 4   transect_point         298844 non-null  object
 5   intercept_1            298843 non-null  Int64 
 6   intercept_ground_code  298805 non-null  object
dtypes: Int64(3), dbdate(1), object(3)
memory usage: 16.8+ MB


## Analyze NULL/NA Values


In [6]:
# Check for NULL values in each column
print("NULL Value Analysis:")
print("=" * 60)

null_counts = df_current.isnull().sum()
null_percentages = (df_current.isnull().sum() / len(df_current) * 100)

for col in df_current.columns:
    null_count = null_counts[col]
    null_pct = null_percentages[col]
    if null_count > 0:
        print(f"  {col:20s}: {null_count:5d} nulls ({null_pct:5.2f}%)")
    else:
        print(f"  {col:20s}: No nulls")

print("\n" + "=" * 60)


NULL Value Analysis:
  survey_ID           : No nulls
  grid_point          : No nulls
  date                : No nulls
  year                : No nulls
  transect_point      : No nulls
  intercept_1         :     1 nulls ( 0.00%)
  intercept_ground_code:    39 nulls ( 0.01%)



In [7]:
# Identify rows with any NULL values
rows_with_nulls = df_current[df_current.isnull().any(axis=1)]

print(f"Rows with at least one NULL value: {len(rows_with_nulls)}")

if len(rows_with_nulls) > 0:
    print(f"\nBreakdown by column with NULL:")
    for col in df_current.columns:
        null_in_col = df_current[df_current[col].isnull()]
        if len(null_in_col) > 0:
            print(f"  {col}: {len(null_in_col)} rows")
    
    print(f"\nSample of rows with NULL values:")
    display(rows_with_nulls.head(20))

# Check specifically for NULL in intercept_ground_code (the critical field)
if 'intercept_ground_code' in df_current.columns:
    null_ground_code = df_current[df_current['intercept_ground_code'].isnull()]
    
    print(f"\n\nRows with NULL intercept_ground_code: {len(null_ground_code)}")
    
    if len(null_ground_code) > 0:
        print(f"\nSample records with NULL intercept_ground_code:")
        display(null_ground_code.head(20))


Rows with at least one NULL value: 40

Breakdown by column with NULL:
  intercept_1: 1 rows
  intercept_ground_code: 39 rows

Sample of rows with NULL values:


Unnamed: 0,survey_ID,grid_point,date,year,transect_point,intercept_1,intercept_ground_code
0,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E29,360,
1,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E41,360,
2,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E43,360,
3,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E46,360,
4,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E44,360,
5,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E40,360,
6,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E28,360,
7,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E24,360,
8,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E22,360,
9,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E31,360,




Rows with NULL intercept_ground_code: 39

Sample records with NULL intercept_ground_code:


Unnamed: 0,survey_ID,grid_point,date,year,transect_point,intercept_1,intercept_ground_code
0,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E29,360,
1,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E41,360,
2,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E43,360,
3,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E46,360,
4,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E44,360,
5,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E40,360,
6,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E28,360,
7,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E24,360,
8,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E22,360,
9,0B3C78C3-5612-4BC3-A14F-B4D7A7041C35,586,2022-05-16,2022,E31,360,


## Analysis Summary


In [8]:
# Generate summary report
print("=" * 60)
print("DATA QUALITY ANALYSIS SUMMARY")
print("=" * 60)

print(f"\nTotal records in table: {len(df_current)}")
print(f"Records with NULL values: {len(rows_with_nulls)} ({len(rows_with_nulls)/len(df_current)*100:.2f}%)")
print(f"Clean records: {len(df_current) - len(rows_with_nulls)} ({(len(df_current) - len(rows_with_nulls))/len(df_current)*100:.2f}%)")

print(f"\nNULL values by column:")
for col in df_current.columns:
    null_count = df_current[col].isnull().sum()
    if null_count > 0:
        print(f"  {col}: {null_count} ({null_count/len(df_current)*100:.2f}%)")

print("\n" + "=" * 60)


DATA QUALITY ANALYSIS SUMMARY

Total records in table: 298844
Records with NULL values: 40 (0.01%)
Clean records: 298804 (99.99%)

NULL values by column:
  intercept_1: 1 (0.00%)
  intercept_ground_code: 39 (0.01%)



## Backup Existing Table

Before making any changes, create a backup of the existing table to GCS.


In [9]:
# Backup existing table to GCS
if BACKUP_BUCKET:
    # Generate backup path with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/fix_na_rows_{timestamp}/*.csv"
    
    print(f"Creating backup of existing table...")
    print(f"  Destination: {backup_path}")
    
    # Export table to GCS
    extract_job = bq_client.extract_table(
        BQ_TABLE_ID,
        backup_path,
        location="US"
    )
    
    extract_job.result()  # Wait for job to complete
    
    print(f"‚úì Backup completed successfully")
    print(f"  Files: {backup_path}")
else:
    print("‚ö† Backup bucket not configured in config.yml")
    print("  Set 'gridveg_point_intercepts.gcs.backup_bucket' to enable automatic backups")


Creating backup of existing table...
  Destination: gs://mpg-data-warehouse/gridVeg/bak/fix_na_rows_20251107_095205/*.csv
‚úì Backup completed successfully
  Files: gs://mpg-data-warehouse/gridVeg/bak/fix_na_rows_20251107_095205/*.csv


## Prepare Clean Data

Remove rows with NULL values in intercept_ground_code field.


In [10]:
# Create clean dataset by removing rows with NULL intercept_ground_code
df_clean = df_current[df_current['intercept_ground_code'].notna()].copy()

print("Clean Dataset Preparation:")
print(f"  Original rows:    {len(df_current)}")
print(f"  Rows with NULL intercept_ground_code: {len(df_current) - len(df_clean)}")
print(f"  Clean rows:       {len(df_clean)}")
print(f"  Rows to remove:   {len(df_current) - len(df_clean)}")

# Verify data integrity
print(f"\nData Integrity Check:")
print(f"  NULL intercept_ground_code in clean data: {df_clean['intercept_ground_code'].isna().sum()}")
print(f"  All rows have ground code?: {df_clean['intercept_ground_code'].notna().all()}")


Clean Dataset Preparation:
  Original rows:    298844
  Rows with NULL intercept_ground_code: 39
  Clean rows:       298805
  Rows to remove:   39

Data Integrity Check:
  NULL intercept_ground_code in clean data: 0
  All rows have ground code?: True


## Replace Table with Clean Data

‚ö†Ô∏è **IMPORTANT**: This will REPLACE the entire table with the clean dataset (no NULL intercept_ground_code rows).

Review the summary above before proceeding.


In [11]:
# Replace table with clean data
print("=" * 60)
print("REPLACING BIGQUERY TABLE WITH CLEAN DATA")
print("=" * 60)
print(f"\nTable: {BQ_TABLE_ID}")
print(f"Current rows: {len(df_current)}")
print(f"New rows (clean): {len(df_clean)}")
print(f"Rows removed: {len(df_current) - len(df_clean)}")
print(f"Mode: WRITE_TRUNCATE (replace entire table)")
print(f"\nStarting replacement at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...")

# Configure job to replace existing table
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_TRUNCATE"  # Replace entire table
)

# Load clean dataframe to BigQuery
load_job = bq_client.load_table_from_dataframe(
    df_clean,
    BQ_TABLE_ID,
    job_config=job_config
)

# Wait for job to complete
load_job.result()

print(f"\n‚úì Replacement completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"  Rows written: {load_job.output_rows}")
print(f"  Job ID: {load_job.job_id}")


REPLACING BIGQUERY TABLE WITH CLEAN DATA

Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_point_intercept_ground
Current rows: 298844
New rows (clean): 298805
Rows removed: 39
Mode: WRITE_TRUNCATE (replace entire table)

Starting replacement at 2025-11-07 09:52:27...





‚úì Replacement completed at 2025-11-07 09:52:30
  Rows written: 298805
  Job ID: f8a39721-59f6-49a6-92a8-fee191c19b22


## Verify Fix

Read back the table to verify NA rows have been removed.


In [12]:
# Read updated table
print("Verifying fix...")
query = f"SELECT * FROM `{BQ_TABLE_ID}`"
df_updated = bq_client.query(query).to_dataframe()

print(f"\n‚úì Verification query complete")
print(f"  Rows in table: {len(df_updated)}")
print(f"  Columns: {list(df_updated.columns)}")

# Check for NULL values
print(f"\nNULL Value Check:")
null_counts_after = df_updated.isnull().sum()
for col in df_updated.columns:
    null_count = null_counts_after[col]
    if null_count > 0:
        print(f"  {col}: {null_count} NULLs (‚ö†Ô∏è UNEXPECTED)")
    else:
        print(f"  {col}: No NULLs ‚úì")


Verifying fix...

‚úì Verification query complete
  Rows in table: 298805
  Columns: ['survey_ID', 'grid_point', 'date', 'year', 'transect_point', 'intercept_1', 'intercept_ground_code']

NULL Value Check:
  survey_ID: No NULLs ‚úì
  grid_point: No NULLs ‚úì
  date: No NULLs ‚úì
  year: No NULLs ‚úì
  transect_point: No NULLs ‚úì
  intercept_1: 1 NULLs (‚ö†Ô∏è UNEXPECTED)
  intercept_ground_code: No NULLs ‚úì


In [13]:
# Verify row counts
expected_rows = len(df_clean)
actual_rows = len(df_updated)

print("\nData integrity check:")
print(f"  Expected rows:  {expected_rows}")
print(f"  Actual rows:    {actual_rows}")
print(f"  Rows removed:   {len(df_current) - actual_rows}")

if expected_rows == actual_rows:
    print(f"\n‚úì Row count verified - table successfully cleaned")
else:
    print(f"\n‚ö† Row count mismatch!")
    print(f"  Difference: {actual_rows - expected_rows}")

# Check if any NULL intercept_ground_code values remain
null_ground_code_after = df_updated[df_updated['intercept_ground_code'].isna()]
if len(null_ground_code_after) == 0:
    print(f"\n‚úì SUCCESS: No NULL intercept_ground_code values found in updated table")
else:
    print(f"\n‚ö† WARNING: {len(null_ground_code_after)} NULL intercept_ground_code values still exist!")



Data integrity check:
  Expected rows:  298805
  Actual rows:    298805
  Rows removed:   39

‚úì Row count verified - table successfully cleaned

‚úì SUCCESS: No NULL intercept_ground_code values found in updated table


## Summary Report

Complete summary of the fix operation.


In [14]:
# Generate summary report
print("=" * 60)
print("FIX NA ROWS SUMMARY")
print("=" * 60)

print(f"\nüìÖ Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\nüéØ Target:")
print(f"  Table: {BQ_TABLE_ID}")
print(f"  Project: {bq_client.project}")

print(f"\nüìä Data Changes:")
print(f"  Original rows:  {len(df_current)}")
print(f"  Cleaned rows:   {len(df_updated)}")
print(f"  Rows removed:   {len(df_current) - len(df_updated)}")

if len(df_current) - len(df_updated) > 0:
    print(f"\n  Removed rows had NULL values in intercept_ground_code")

print(f"\nüîÑ Operations Performed:")
print(f"  ‚úì Backed up table to GCS")
print(f"  ‚úì Removed rows with NULL intercept_ground_code")
print(f"  ‚úì Replaced table with clean data")
print(f"  ‚úì Verified data integrity")

if BACKUP_BUCKET:
    print(f"\nüíæ Backup:")
    print(f"  Location: gs://{BACKUP_BUCKET}/{BACKUP_PREFIX}/")
    print(f"  Status: ‚úì Created before fix")

# Final validation
null_check = df_updated['intercept_ground_code'].isna().sum()
if null_check == 0:
    print(f"\n‚úÖ Fix completed successfully!")
    print(f"   No NULL intercept_ground_code values remain in table")
else:
    print(f"\n‚ö†Ô∏è WARNING: {null_check} NULL values still exist")

print("=" * 60)


FIX NA ROWS SUMMARY

üìÖ Timestamp: 2025-11-07 09:53:03

üéØ Target:
  Table: mpg-data-warehouse.vegetation_point_intercept_gridVeg.gridVeg_point_intercept_ground
  Project: mpg-data-warehouse

üìä Data Changes:
  Original rows:  298844
  Cleaned rows:   298805
  Rows removed:   39

  Removed rows had NULL values in intercept_ground_code

üîÑ Operations Performed:
  ‚úì Backed up table to GCS
  ‚úì Removed rows with NULL intercept_ground_code
  ‚úì Replaced table with clean data
  ‚úì Verified data integrity

üíæ Backup:
  Location: gs://mpg-data-warehouse/gridVeg/bak/
  Status: ‚úì Created before fix

‚úÖ Fix completed successfully!
   No NULL intercept_ground_code values remain in table


## Rollback Instructions (If Needed)

If you need to rollback to the previous version, restore from the backup created at the beginning of this notebook.

```python
# To rollback, restore from backup:
# backup_path = "gs://BACKUP_BUCKET/BACKUP_PREFIX/fix_na_rows_TIMESTAMP/*.csv"
# df_backup = pd.read_csv(backup_path)
# job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
# bq_client.load_table_from_dataframe(df_backup, BQ_TABLE_ID, job_config=job_config)
```

The backup location was printed in the backup cell above.
