# GDELT Demo Data Prep

This notebook demonstrates working with GDELT (Global Database of Events, Language, and Tone) data for graph analysis.


## 🔧 Configuration Setup

This cell defines all the essential configuration variables for the GDELT data analysis project:

### **Project Settings**
- **GCP_PROJECT_ID**: Your Google Cloud Platform project identifier
- **PROJECT_REGION**: Target region for BigQuery operations (us-central1)
- **BIGQUERY_DATASET**: Dataset name where GDELT data will be stored locally

### **GDELT Source Settings**  
- **GDELT_PROJECT_ID**: Public GDELT BigQuery project (gdelt-bq)
- **GDELT_DATASET**: Public GDELT dataset (gdeltv2)
- **GDELT_REGION**: Source region for GDELT data (US)

### **Data Tables**
- **BIGQUERY_TABLES**: List of GDELT tables to copy for analysis
  - `gkg_partitioned`: Global Knowledge Graph data
  - `events_partitioned`: Event data
  - `eventmentions_partitioned`: Event mentions data

### **Storage**
- **GCS_BUCKET**: Google Cloud Storage bucket for data exports

⚠️ **Important**: Update GCP_PROJECT_ID with your actual project ID before running.

In [2]:
# Configuration variables
GCP_PROJECT_ID = "graph-demo-471710"  # Replace with your actual GCP project ID
PROJECT_REGION = "us-central1"
BIGQUERY_DATASET = "gdelt"  # Replace with your actual BigQuery dataset name
BIGQUERY_TABLES = ["gkg_partitioned", "events_partitioned","eventmentions_partitioned"]  # List of tables to copy
GDELT_PROJECT_ID = "gdelt-bq"
GDELT_DATASET = "gdeltv2"  
GDELT_REGION = "us"
GCS_BUCKET = "gdelt_graph"

# Derived variables - will be generated for each table
print(f"Configuration loaded:")
print(f"  GCP Project: {GCP_PROJECT_ID}")
print(f"  BigQuery Dataset: {BIGQUERY_DATASET}")
print(f"  BigQuery Tables: {BIGQUERY_TABLES}")
print(f"  GDELT Project: {GDELT_PROJECT_ID}")
print(f"  GDELT Dataset: {GDELT_DATASET}")
print(f"  GDELT Region: {GDELT_REGION}")
print(f"  GCS Bucket: {GCS_BUCKET}")

Configuration loaded:
  GCP Project: graph-demo-471710
  BigQuery Dataset: gdelt
  BigQuery Tables: ['gkg_partitioned', 'events_partitioned', 'eventmentions_partitioned']
  GDELT Project: gdelt-bq
  GDELT Dataset: gdeltv2
  GDELT Region: us
  GCS Bucket: gdelt_graph


## 📚 Library Imports

This cell imports all necessary Python libraries for the GDELT data analysis workflow:

### **Google Cloud Services**
- `google.cloud.bigquery`: For querying and managing BigQuery data
- `google.cloud.storage`: For Google Cloud Storage operations
- `google.auth`: For GCP authentication handling

### **Data Processing**
- `pandas`: For data manipulation and analysis
- `json`: For JSON data handling
- `datetime`: For date/time operations

### **Network Analysis & Visualization**
- `networkx`: For creating and analyzing graph networks
- `matplotlib.pyplot`: For creating visualizations and plots

### **System & File Operations**
- `os`, `pathlib.Path`: For file system operations
- `subprocess`: For running system commands
- `shutil`: For file operations

All libraries are tested for successful import with confirmation messages.

In [3]:
# Import required libraries
import os
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
import json
from datetime import datetime
import networkx as nx
import matplotlib.pyplot as plt
import os
from pathlib import Path
import subprocess
import os
import shutil
from google.auth import default
from google.auth.exceptions import DefaultCredentialsError
from google.cloud import bigquery
from datetime import datetime

print("✅ All libraries imported successfully!")
print("   - BigQuery and Cloud Storage clients ready")
print("   - NetworkX and Matplotlib ready for visualization")
print("   - Pandas ready for data processing")


✅ All libraries imported successfully!
   - BigQuery and Cloud Storage clients ready
   - NetworkX and Matplotlib ready for visualization
   - Pandas ready for data processing


## 🔐 GCP Authentication Setup

This cell provides a comprehensive GCP authentication function that handles various authentication scenarios:

### **Authentication Process**
1. **Credential Check**: Verifies existing Google Cloud credentials
2. **Project Validation**: Ensures credentials match the target project
3. **Credential Reset**: Clears old credentials if project mismatch detected
4. **Project Configuration**: Sets the correct GCP project using gcloud CLI
5. **Re-authentication**: Initiates browser-based OAuth flow if needed
6. **Quota Project**: Sets quota project to avoid billing warnings
7. **Verification**: Confirms successful authentication

### **Error Handling**
- Handles missing credentials gracefully
- Provides manual fallback instructions
- Manages project mismatches automatically
- Shows detailed error messages and troubleshooting tips

### **Environment Setup**
- Sets `GOOGLE_CLOUD_PROJECT` environment variable
- Configures application default credentials
- Prepares credentials for BigQuery and Cloud Storage clients

⚡ **Note**: This function may open a browser window for OAuth authentication.


In [4]:
# GCP Authentication Setup


def setup_gcp_authentication():
    """Complete GCP authentication setup with error handling"""
    print("🔐 Setting up GCP Authentication...")
    
    try:
        # Step 1: Try to use existing credentials first
        print("🔍 Checking for existing credentials...")
        try:
            credentials, default_project = default()
            print(f"✅ Found existing credentials for project: {default_project}")
            
            # If the project matches, we're good
            if default_project == GCP_PROJECT_ID:
                print(f"🎯 Project matches target project: {GCP_PROJECT_ID}")
                os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
                return credentials, GCP_PROJECT_ID
            else:
                print(f"⚠️  Project mismatch: {default_project} vs {GCP_PROJECT_ID}")
                print("🔄 Will re-authenticate with correct project...")
        except DefaultCredentialsError:
            print("❌ No existing credentials found")
            print("🔄 Will authenticate from scratch...")
        
        # Step 2: Clear old credentials if needed
        print("🗑️  Clearing old credentials...")
        adc_path = os.path.expanduser("~/.config/gcloud/application_default_credentials.json")
        if os.path.exists(adc_path):
            os.remove(adc_path)
            print("✅ Removed old application default credentials")
        
        # Step 3: Set the correct project
        print(f"🎯 Setting gcloud project to: {GCP_PROJECT_ID}")
        result = subprocess.run(['gcloud', 'config', 'set', 'project', GCP_PROJECT_ID], 
                              capture_output=True, text=True, check=True)
        print("✅ Project set successfully")
        
        # Step 4: Re-authenticate
        print("🔄 Re-authenticating with application default credentials...")
        print("   This will open a browser window for authentication...")
        
        result = subprocess.run(['gcloud', 'auth', 'application-default', 'login'], 
                              check=True)
        print("✅ Re-authentication successful")
        
        # Step 5: Set quota project to avoid warnings
        print("💰 Setting quota project...")
        try:
            subprocess.run(['gcloud', 'auth', 'application-default', 'set-quota-project', GCP_PROJECT_ID], 
                          capture_output=True, text=True, check=True)
            print("✅ Quota project set successfully")
        except:
            print("⚠️  Could not set quota project (this is usually fine)")
        
        # Step 6: Verify the setup
        print("🧪 Verifying authentication...")
        credentials, project = default()
        print(f"✅ Authentication successful - Project: {project}")
        
        # Set environment variable
        os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
        print(f"🌍 Set GOOGLE_CLOUD_PROJECT environment variable to: {GCP_PROJECT_ID}")
        
        return credentials, GCP_PROJECT_ID
        
    except subprocess.CalledProcessError as e:
        print(f"❌ Command failed: {e}")
        print("💡 Manual steps required:")
        print(f"   1. gcloud config set project {GCP_PROJECT_ID}")
        print("   2. gcloud auth application-default login")
        print(f"   3. gcloud auth application-default set-quota-project {GCP_PROJECT_ID}")
        return None, None
    except Exception as e:
        print(f"❌ Error: {e}")
        return None, None

# Run authentication setup
credentials, authenticated_project = setup_gcp_authentication()


🔐 Setting up GCP Authentication...
🔍 Checking for existing credentials...
✅ Found existing credentials for project: graph-demo-471710
🎯 Project matches target project: graph-demo-471710


In [5]:
# Test GCP connectivity
def test_gcp_connectivity():
    """Test basic connectivity to GCP services"""
    print("🔍 Testing GCP connectivity...")
    
    # Check if authentication was successful
    if not credentials or not authenticated_project:
        print("❌ Authentication required - please run the authentication cell first")
        return False
    
    print(f"✅ Using authenticated project: {authenticated_project}")
    
    # Test 1: Test BigQuery connectivity
    try:
        # Use explicit credentials and project
        client = bigquery.Client(credentials=credentials, project=authenticated_project)
        print(f"🔗 BigQuery client created for project: {client.project}")
        
        # Simple query to test connectivity
        query = "SELECT 1 as test_value"
        result = client.query(query).result()
        for row in result:
            print(f"✅ BigQuery connectivity successful - Test query result: {row.test_value}")
            break  # Only need first row
    except Exception as e:
        error_str = str(e)
        if "has been deleted" in error_str or "USER_PROJECT_DENIED" in error_str:
            print(f"❌ BigQuery connectivity failed: Project mismatch detected")
            print(f"   Error: {e}")
            print(f"🔧 This usually means your credentials are cached for a different project")
            print(f"   💡 Try running the authentication cell again")
            print(f"   📋 Or manually run: gcloud auth application-default login")
            return False
        else:
            print(f"❌ BigQuery connectivity failed: {e}")
            return False
    
    # Test 2: Test BigQuery dataset access
    try:
        client = bigquery.Client(credentials=credentials, project=authenticated_project)
        dataset_ref = client.dataset(BIGQUERY_DATASET)
        dataset = client.get_dataset(dataset_ref)
        print(f"✅ BigQuery dataset '{BIGQUERY_DATASET}' accessible")
        
        # List tables in the dataset
        tables = list(client.list_tables(dataset_ref))
        print(f"📊 Found {len(tables)} tables in dataset")
        for table in tables[:5]:  # Show first 5 tables
            print(f"   - {table.table_id}")
        if len(tables) > 5:
            print(f"   ... and {len(tables) - 5} more tables")
            
    except Exception as e:
        print(f"❌ BigQuery dataset access failed: {e}")
        print(f"   Make sure dataset '{BIGQUERY_DATASET}' exists in project '{authenticated_project}'")
        return False
    
    # Test 3: Test Cloud Storage connectivity
    try:
        storage_client = storage.Client(credentials=credentials, project=authenticated_project)
        # List buckets to test connectivity
        buckets = list(storage_client.list_buckets())
        print(f"✅ Cloud Storage connectivity successful - Found {len(buckets)} buckets")
    except Exception as e:
        print(f"❌ Cloud Storage connectivity failed: {e}")
        return False
    
    print("🎉 All GCP connectivity tests passed!")
    return True

# Run the connectivity test
test_gcp_connectivity()


🔍 Testing GCP connectivity...
✅ Using authenticated project: graph-demo-471710
🔗 BigQuery client created for project: graph-demo-471710
✅ BigQuery connectivity successful - Test query result: 1
✅ BigQuery dataset 'gdelt' accessible
📊 Found 12 tables in dataset
   - article
   - event
   - event_participant
   - eventmentions_partitioned
   - events_partitioned
   ... and 7 more tables
✅ Cloud Storage connectivity successful - Found 1 buckets
🎉 All GCP connectivity tests passed!


True

In [6]:
# Ready for GDELT analysis!
print("🎉 Setup complete! Ready to work with GDELT data.")
print(f"📊 Project: {GCP_PROJECT_ID}")
print(f"🗄️  Dataset: {BIGQUERY_DATASET}")
print("🚀 You can now run queries against your GDELT data!")


🎉 Setup complete! Ready to work with GDELT data.
📊 Project: graph-demo-471710
🗄️  Dataset: gdelt
🚀 You can now run queries against your GDELT data!


## 📊 GDELT Dataset Discovery

This cell explores the public GDELT BigQuery project to understand available datasets and tables:

### **Dataset Exploration**
- Connects to the public GDELT project (`gdelt-bq`)
- Lists all available datasets in the GDELT project
- Provides metadata for each dataset (creation date, location, description)
- Counts tables within each dataset

### **Information Gathered**
- **Dataset Names**: All available GDELT datasets
- **Table Counts**: Number of tables in each dataset
- **Creation/Modification Dates**: When datasets were last updated
- **Geographic Location**: Where datasets are stored (typically US region)
- **Sample Tables**: Preview of table names in each dataset

### **Purpose**
- Helps understand the structure of public GDELT data
- Identifies which datasets contain the tables we need
- Provides context for the data import process
- Assists in troubleshooting data access issues

📋 **Output**: Detailed listing of all GDELT datasets with metadata and table information.


In [7]:
# List datasets in the GDELT_PROJECT_ID project
def list_gdelt_datasets():
    """List all datasets in the GDELT_PROJECT_ID project"""
    print(f"🔍 Listing datasets in GDELT project: {GDELT_PROJECT_ID}")
    
    try:
        # Create BigQuery client for the GDELT project
        gdelt_client = bigquery.Client(project=GDELT_PROJECT_ID)
        print(f"✅ Connected to GDELT project: {gdelt_client.project}")
        
        # List all datasets in the project
        datasets = list(gdelt_client.list_datasets())
        
        if not datasets:
            print("📭 No datasets found in the GDELT project")
            return []
        
        print(f"📊 Found {len(datasets)} datasets in {GDELT_PROJECT_ID}:")
        print("-" * 60)
        
        dataset_info = []
        for dataset in datasets:
            # Get dataset details
            dataset_ref = gdelt_client.dataset(dataset.dataset_id)
            full_dataset = gdelt_client.get_dataset(dataset_ref)
            
            # Count tables in the dataset
            tables = list(gdelt_client.list_tables(dataset_ref))
            
            info = {
                'dataset_id': dataset.dataset_id,
                'description': full_dataset.description or 'No description',
                'created': full_dataset.created,
                'modified': full_dataset.modified,
                'location': full_dataset.location,
                'table_count': len(tables)
            }
            dataset_info.append(info)
            
            print(f"📁 Dataset: {dataset.dataset_id}")
            print(f"   Description: {info['description']}")
            print(f"   Created: {info['created']}")
            print(f"   Modified: {info['modified']}")
            print(f"   Location: {info['location']}")
            print(f"   Tables: {info['table_count']}")
            
            # Show first few tables if any
            if tables:
                print(f"   Sample tables:")
                for table in tables[:5]:
                    print(f"     - {table.table_id}")
                if len(tables) > 5:
                    print(f"     ... and {len(tables) - 5} more")
            print()
        
        return dataset_info
        
    except Exception as e:
        print(f"❌ Error listing datasets: {e}")
        return []

# Run the function to list datasets
gdelt_datasets = list_gdelt_datasets()


🔍 Listing datasets in GDELT project: gdelt-bq
✅ Connected to GDELT project: gdelt-bq
📊 Found 8 datasets in gdelt-bq:
------------------------------------------------------------
📁 Dataset: covid19
   Description: No description
   Created: 2020-03-30 00:36:39.040000+00:00
   Modified: 2024-11-19 22:08:34.013000+00:00
   Location: US
   Tables: 3
   Sample tables:
     - onlinenews
     - onlinenewsgeo
     - tvnews

📁 Dataset: extra
   Description: No description
   Created: 2014-12-03 05:22:28.037000+00:00
   Modified: 2024-11-19 22:09:01.418000+00:00
   Location: US
   Tables: 6
   Sample tables:
     - countries_by_media_50pct
     - countrygeolookup
     - countryinfo
     - countryinfo2
     - sourcesbycountry
     ... and 1 more

📁 Dataset: full
   Description: No description
   Created: 2014-04-23 23:04:48.708000+00:00
   Modified: 2024-11-19 22:05:19.124000+00:00
   Location: US
   Tables: 3
   Sample tables:
     - crosswalk_geocountrycodetohuman
     - events
     - events_pa

## Cross-Region GDELT Data Copy Function

This function efficiently copies GDELT data from the US region to your local US-CENTRAL1 region using a smart multi-step approach:

### 🎯 **Purpose**
- Copies GDELT data for a specific date (September 11, 2025) from the public GDELT dataset
- Handles cross-region data transfer from US region to US-CENTRAL1 region
- Optimizes for cost and speed with intelligent caching

### 🔄 **Process Flow**
1. **Destination Check**: Verifies if target table already exists (skips if data present)
2. **Dataset Setup**: Creates required datasets in both US and US-CENTRAL1 regions
3. **Temporary Table Check**: Checks if temp table exists in US region (reuses if available)
4. **Data Query**: Queries GDELT data and saves to temporary table in US region
5. **Cross-Region Copy**: Copies data from US region to US-CENTRAL1 region
6. **Cleanup**: Removes temporary table and verifies final data

### ⚡ **Optimizations**
- **Smart Caching**: Skips expensive operations if data already exists
- **Cost Efficient**: Reuses temporary tables when possible
- **Error Resilient**: Handles various BigQuery errors gracefully
- **Progress Tracking**: Detailed logging throughout the process

### 📊 **Output**
- Creates table: `{GCP_PROJECT_ID}.gdelt.gkg_partitioned` in US-CENTRAL1 region
- Shows row counts and verification details
- Provides troubleshooting tips if errors occur


In [None]:
# Copy GDELT data for specific partition (September 11, 2025) - Cross-region approach for multiple tables


def copy_gdelt_partition_cross_region():
    """
    Copy data from GDELT tables (US region) to local tables (US-CENTRAL1 region).
    Uses a temporary table approach to handle cross-region data access.
    Processes each table in BIGQUERY_TABLES list.
    """
    print("�� Starting cross-region GDELT data copy for multiple tables...")
    print(f" Target date: September 11, 2025")
    print(f" Source: GDELT tables in {GDELT_PROJECT_ID}.{GDELT_DATASET} (US region)")
    print(f" Destination: {GCP_PROJECT_ID}.{BIGQUERY_DATASET} (US-CENTRAL1 region)")
    print(f" Tables to process: {BIGQUERY_TABLES}")
    print("-" * 70)
    
    results = {}
    
    try:
        # Create BigQuery client
        local_client = bigquery.Client(project=GCP_PROJECT_ID)
        print("✅ BigQuery client created")
        
        # Step 0: Create dataset if it doesn't exist (same for all tables)
        print(f" Checking if dataset '{BIGQUERY_DATASET}' exists...")
        dataset_ref = local_client.dataset(BIGQUERY_DATASET)
        
        try:
            dataset = local_client.get_dataset(dataset_ref)
            print(f"✅ Dataset '{BIGQUERY_DATASET}' already exists")
        except Exception:
            print(f"📝 Dataset '{BIGQUERY_DATASET}' doesn't exist, creating it...")
            
            # Create dataset with proper location
            dataset = bigquery.Dataset(dataset_ref)
            dataset.location = "US-CENTRAL1"  # Specify the region
            dataset.description = "GDELT data for graph analysis"
            
            dataset = local_client.create_dataset(dataset, timeout=30)
            print(f"✅ Dataset '{BIGQUERY_DATASET}' created successfully in us-central1")
        
        # Step 1: Create dataset in US region for temporary tables
        print("📝 Creating dataset in US region for temporary tables...")
        us_dataset_name = f"{BIGQUERY_DATASET}_us"
        us_dataset_ref = bigquery.DatasetReference(GCP_PROJECT_ID, us_dataset_name)
        
        try:
            us_dataset = local_client.get_dataset(us_dataset_ref)
            print(f"✅ Dataset '{us_dataset_name}' already exists in US region")
        except Exception as e:
            if "notFound" in str(e) or "404" in str(e):
                print(f"📝 Creating dataset '{us_dataset_name}' in US region...")
                us_dataset = bigquery.Dataset(us_dataset_ref)
                us_dataset.location = "US"
                us_dataset.description = "GDELT data for graph analysis (US region - temporary)"
                try:
                    us_dataset = local_client.create_dataset(us_dataset, timeout=30)
                    print(f"✅ Dataset '{us_dataset_name}' created in US region")
                except Exception as create_error:
                    if "Already Exists" in str(create_error) or "409" in str(create_error):
                        print(f"✅ Dataset '{us_dataset_name}' already exists in US region (created by another process)")
                    else:
                        raise create_error
            else:
                print(f"⚠️  Unexpected error checking dataset in US region: {e}")
                raise e
        
        # Process each table
        for i, table_name in enumerate(BIGQUERY_TABLES, 1):
            print(f"\n{'='*80}")
            print(f"�� Processing table {i}/{len(BIGQUERY_TABLES)}: {table_name}")
            print(f"{'='*80}")
            
            try:
                # Create GDELT table reference for this table
                gdelt_table = f"{GDELT_PROJECT_ID}.{GDELT_DATASET}.{table_name}"
                
                # Check if destination table already exists
                print(f"🔍 Checking if destination table '{table_name}' already exists...")
                dest_table_ref = f"{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.{table_name}"
                try:
                    existing_dest_table = local_client.get_table(dest_table_ref)
                    print(f"✅ Destination table already exists: {existing_dest_table.full_table_id}")
                    print(f"   Rows: {existing_dest_table.num_rows:,}")
                    print("⏭️  Skipping data copy process, destination table already has data")
                    
                    # Optional: Verify the data is for the correct date
                    print("🔍 Verifying existing data...")
                    try:
                        simple_query = f"SELECT COUNT(*) as row_count FROM `{dest_table_ref}`"
                        result = local_client.query(simple_query, location="US-CENTRAL1").result()
                        for row in result:
                            print(f"�� Existing data summary:")
                            print(f"   Total rows: {row.row_count:,}")
                            print("✅ Data verification completed")
                    except Exception as verify_error:
                        print(f"⚠️  Could not verify existing data: {verify_error}")
                    
                    results[table_name] = {"status": "skipped", "reason": "already_exists"}
                    continue
                    
                except Exception as e:
                    if "notFound" in str(e) or "404" in str(e):
                        print("📝 Destination table doesn't exist, proceeding with data copy...")
                    else:
                        print(f"⚠️  Error checking destination table: {e}")
                        print("📝 Proceeding with data copy...")
                
                # Check if temporary table already exists, if not query GDELT data
                temp_table_ref = local_client.dataset(us_dataset_name).table(f"temp_{table_name}")
                
                print("🔍 Checking if temporary table already exists...")
                try:
                    existing_temp_table = local_client.get_table(temp_table_ref)
                    print(f"✅ Temporary table already exists: {existing_temp_table.full_table_id}")
                    print(f"   Rows: {existing_temp_table.num_rows:,}")
                    print("⏭️  Skipping data query, using existing temporary table")
                except Exception as e:
                    if "notFound" in str(e) or "404" in str(e):
                        print("📊 Temporary table doesn't exist, querying GDELT data and saving to temporary table...")
                        
                        # Configure the query job to save to temporary table in US region
                        job_config = bigquery.QueryJobConfig()
                        job_config.destination = temp_table_ref
                        job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
                        job_config.create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED
                        
                        # Query the GDELT table
                        query = f"""
                        SELECT *
                        FROM `{gdelt_table}`
                        WHERE _PARTITIONTIME = TIMESTAMP('2025-09-11')
                        """
                        
                        print("📊 Executing query...")
                        print(f"🔍 Query: {query}")
                        print(f"🎯 Destination: {GCP_PROJECT_ID}.{us_dataset_name}.temp_{table_name}")
                        
                        # Run the query - this will automatically handle cross-region data transfer
                        query_job = local_client.query(
                            query,
                            job_config=job_config,
                            location="US"  # Query in US region where GDELT table exists
                        )
                        
                        print(f"⏳ Query job started: {query_job.job_id}")
                        print("⏳ Waiting for query to complete...")
                        query_job.result()  # Wait for job to complete
                        print("✅ Data copied to temporary table in US region")
                    else:
                        print(f"⚠️  Unexpected error checking temporary table: {e}")
                        raise e
                
                # Define source table reference
                source_table_ref = bigquery.TableReference.from_string(f"{GCP_PROJECT_ID}.{us_dataset_name}.temp_{table_name}")
                
                # Copy data from US region temp table to US-CENTRAL1 region
                print("🔄 Copying data from US region to US-CENTRAL1 region...")
                
                # Configure the copy job
                copy_job_config = bigquery.CopyJobConfig()
                copy_job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
                copy_job_config.create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED
                
                # Destination table (in US-CENTRAL1 region)
                dest_table_ref = local_client.dataset(BIGQUERY_DATASET).table(table_name)
                
                # Copy the data - need to specify source location
                copy_job = local_client.copy_table(
                    source_table_ref,
                    dest_table_ref,
                    job_config=copy_job_config,
                    location="US"  # Source is in US region
                )
                
                print(f"⏳ Copy job started: {copy_job.job_id}")
                print("⏳ Waiting for copy to complete...")
                copy_job.result()  # Wait for job to complete
                print("✅ Data copied to US-CENTRAL1 region successfully")
                
                # Clean up temporary table
                print("🧹 Cleaning up temporary table...")
                try:
                    local_client.delete_table(source_table_ref)
                    print("✅ Temporary table deleted")
                except Exception as e:
                    print(f"⚠️  Could not delete temporary table: {e}")
                
                # Verify the data
                print("🔍 Verifying imported data...")
                
                try:
                    table = local_client.get_table(f"{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.{table_name}")
                    print(f"✅ Table found: {table.full_table_id}")
                    print(f"   Rows: {table.num_rows:,}")
                    print(f"   Columns: {len(table.schema)}")
                    
                    # Try a simple count query first
                    simple_verification_query = f"""
                    SELECT COUNT(*) as row_count
                    FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.{table_name}`
                    """
                    
                    verification_result = local_client.query(simple_verification_query, location="US-CENTRAL1").result()
                    for row in verification_result:
                        print(f"📊 Imported data summary:")
                        print(f"   Total rows: {row.row_count:,}")
                        
                    results[table_name] = {"status": "success", "rows": table.num_rows}
                    
                except Exception as e:
                    print(f"❌ Error during verification: {e}")
                    print("💡 Table may have been created but verification failed")
                    results[table_name] = {"status": "partial_success", "error": str(e)}
                
            except Exception as e:
                print(f"❌ Error processing table '{table_name}': {e}")
                results[table_name] = {"status": "failed", "error": str(e)}
                print(f"⚠️  Continuing with next table...")
        
        # Print final summary
        print(f"\n{'='*80}")
        print("📊 FINAL SUMMARY")
        print(f"{'='*80}")
        
        success_count = sum(1 for r in results.values() if r["status"] == "success")
        skipped_count = sum(1 for r in results.values() if r["status"] == "skipped")
        failed_count = sum(1 for r in results.values() if r["status"] == "failed")
        partial_count = sum(1 for r in results.values() if r["status"] == "partial_success")
        
        print(f"✅ Successfully processed: {success_count}/{len(BIGQUERY_TABLES)} tables")
        print(f"⏭️  Skipped (already exists): {skipped_count}/{len(BIGQUERY_TABLES)} tables")
        print(f"⚠️  Partial success: {partial_count}/{len(BIGQUERY_TABLES)} tables")
        print(f"❌ Failed: {failed_count}/{len(BIGQUERY_TABLES)} tables")
        
        print(f"\nDetailed results:")
        for table_name, result in results.items():
            status_emoji = {
                "success": "✅",
                "skipped": "⏭️",
                "partial_success": "⚠️",
                "failed": "❌"
            }.get(result["status"], "❓")
            
            print(f"  {status_emoji} {table_name}: {result['status']}")
            if "rows" in result:
                print(f"      Rows: {result['rows']:,}")
            if "error" in result:
                print(f"      Error: {result['error']}")
        
        print(f"\n🎉 Multi-table cross-region data copy completed!")
        return results
        
    except Exception as e:
        print(f"❌ Critical error during data copy: {e}")
        print("💡 Troubleshooting tips:")
        print("   - Check if the GDELT tables exist and are accessible")
        print("   - Verify your project has BigQuery API enabled")
        print("   - Ensure you have the necessary permissions")
        return results

# Run the cross-region copy process for multiple tables
copy_results = copy_gdelt_partition_cross_region()

## 👥 Person Co-occurrence Analysis

This cell performs advanced person co-occurrence analysis on GDELT data to identify relationships between individuals:

### **Query Functionality**
- **Person Extraction**: Parses person names from GDELT V2Persons field
- **Name Cleaning**: Removes suffixes and standardizes person names
- **Co-occurrence Detection**: Finds people mentioned together in the same articles
- **Relationship Scoring**: Counts frequency of co-appearances
- **Flexible Filtering**: Can search for specific person or analyze all relationships

### **SQL Analysis Process**
1. **Data Extraction**: Unnests person lists from V2Persons field
2. **Name Standardization**: Cleans and normalizes person names
3. **Self-Join**: Matches articles to find person pairs
4. **Aggregation**: Counts co-occurrence frequencies
5. **Ranking**: Orders results by relationship strength

### **Output Format**
- **Person Pairs**: Two-person combinations (name1, name2)
- **Co-occurrence Count**: Number of articles mentioning both people
- **Relationship Strength**: Frequency-based scoring
- **Top Results**: Limited to 25,000 strongest relationships

### **Use Cases**
- Political network analysis
- Media relationship mapping
- Influence pattern detection
- Social network construction

🔍 **Configurable**: Set `search_person` variable to focus on specific individual or leave empty for all relationships.

In [None]:
# Query to find person co-occurrence patterns
def query_person_cooccurrence(search_person=""):
    """
    Query to find person co-occurrence patterns for a specified person in GDELT data.
    This query identifies which people appear together with the specified person in the same articles.
    If search_person is empty, returns all person co-occurrence patterns.
    """
    
    if search_person:
        print(f"🔍 Querying person co-occurrence patterns for '{search_person}'...")
        where_clause = f"WHERE V2Persons LIKE '%{search_person}%'"
    else:
        print("🔍 Querying all person co-occurrence patterns...")
        where_clause = ""
    
    # The BigQuery SQL query
    query = f"""
    WITH ArticleNames AS (
      SELECT DISTINCT  -- DISTINCT moves here to apply to the whole row
        GKGRECORDID,
        REGEXP_REPLACE(person, r',.*', '') AS name -- DISTINCT removed from this line
      FROM
        `gdelt.gkg_partitioned`,
        UNNEST(SPLIT(V2Persons, ';')) AS person
      {where_clause}
    )
    -- This section creates the pairs by joining the table to itself
    SELECT
      a.name AS name1,
      b.name AS name2,
      COUNT(*) AS pair_count
    FROM
      ArticleNames AS a
    JOIN
      ArticleNames AS b ON a.GKGRECORDID = b.GKGRECORDID
    WHERE
      a.name < b.name -- This avoids duplicates and self-pairs
    GROUP BY
      1, 2
    ORDER BY
      pair_count DESC
    LIMIT 25000;  
    """
    
    try:
        # Create BigQuery client
        client = bigquery.Client(project=GCP_PROJECT_ID)
        print(f"✅ Connected to BigQuery project: {GCP_PROJECT_ID}")
        
        # Execute the query
        print("📊 Executing query...")
        if search_person:
            print(f"🔍 Query: Finding person co-occurrence patterns for '{search_person}'")
        else:
            print("🔍 Query: Finding all person co-occurrence patterns")
        
        query_job = client.query(query, location="US-CENTRAL1")
        results = query_job.result()
        
        # Process results manually (simple approach)
        print("📋 Processing results...")
        rows = []
        for row in results:
            rows.append({
                'name1': row.name1,
                'name2': row.name2,
                'pair_count': row.pair_count
            })
        
        # Create DataFrame manually
        df = pd.DataFrame(rows)
        print("✅ Results processed successfully")
        
        print(f"✅ Query completed successfully!")
        print(f"📊 Found {len(df)} person co-occurrence pairs")
        print("\n" + "="*80)
        if search_person:
            print(f"📈 TOP PERSON CO-OCCURRENCE PATTERNS WITH '{search_person.upper()}'")
        else:
            print("📈 TOP PERSON CO-OCCURRENCE PATTERNS")
        print("="*80)
        
        if len(df) > 0:
            # Display the results
            print(df.to_string(index=False))
            
            # Show some statistics
            print(f"\n📊 Summary Statistics:")
            print(f"   Total pairs found: {len(df)}")
            print(f"   Highest co-occurrence count: {df['pair_count'].max()}")
            print(f"   Average co-occurrence count: {df['pair_count'].mean():.2f}")
            
            # Show the top 10 most frequent co-occurrences
            print(f"\n🏆 TOP 10 MOST FREQUENT CO-OCCURRENCES:")
            print("-" * 60)
            top_10 = df.head(10)
            for idx, row in top_10.iterrows():
                print(f"{idx+1:2d}. {row['name1']} & {row['name2']} - {row['pair_count']} times")
        else:
            print(f"❌ No co-occurrence patterns found for '{search_person}'")
            print("💡 This could mean:")
            print(f"   - No articles contain '{search_person}' in the V2Persons field")
            print("   - The data might not be loaded for the target date")
            print("   - There might be a spelling variation in the data")
        
        return df
        
    except Exception as e:
        print(f"❌ Error executing query: {e}")
        print("💡 Troubleshooting tips:")
        print("   - Check if the gkg_partitioned table exists and has data")
        print("   - Verify the table has V2Persons column")
        print("   - Ensure you have the necessary BigQuery permissions")
        return None

# Execute the query
search_person = ""  # Define the search person here
cooccurrence_results = query_person_cooccurrence(search_person)


## 🕸️ Network Graph Visualization

This cell creates interactive network visualizations from person co-occurrence data using NetworkX and Matplotlib:

### **Graph Construction**
- **Node Creation**: Each person becomes a network node
- **Edge Creation**: Co-occurrence relationships become weighted edges
- **Weight Scaling**: Normalizes co-occurrence counts for visualization
- **Layout Algorithm**: Uses spring layout for optimal node positioning

### **Visualization Features**
- **Large Canvas**: 30x20 inch figure for detailed viewing
- **Node Styling**: Light blue nodes with configurable sizes
- **Edge Styling**: Gray edges with transparency for clarity
- **Label Display**: Shows person names on nodes
- **Weight Labels**: Displays relationship strengths on significant edges

### **Network Analytics**
- **Node Count**: Total number of people in the network
- **Edge Count**: Total number of relationships
- **Average Degree**: Mean connections per person
- **Centrality Analysis**: Identifies most connected individuals
- **Top Nodes Ranking**: Shows most influential people by connections

### **Customization Options**
- **Search Focus**: Can highlight specific person's network
- **Title Customization**: Dynamic titles based on analysis focus
- **Threshold Filtering**: Shows only significant relationships
- **Size Scaling**: Adjustable node and edge sizing

📊 **Output**: High-resolution network graph with statistical summary and centrality rankings.


In [None]:
# Create network visualization from co-occurrence results
def create_person_network_graph(df, search_person="Rayner", title="Person Co-occurrence Network"):
    """
    Create a network graph from the co-occurrence DataFrame
    """
    print("🕸️  Creating network graph from co-occurrence data...")
    
    if df is None or len(df) == 0:
        print("❌ No data available to create network graph")
        return None
    
    # Create the graph
    g = nx.Graph()
    
    # Add edges with weights based on co-occurrence count
    for _, row in df.iterrows():
        name1 = row['name1']
        name2 = row['name2']
        weight = row['pair_count']
            
        # Add edge with weight (scaled down for visualization)
        g.add_edge(name1, name2, weight=weight/10)
    
    print(f"✅ Graph created with {g.number_of_nodes()} nodes and {g.number_of_edges()} edges")
    
    if not search_person:
        search_person = "All"
    # Create the visualization
    plt.figure(figsize=(30, 20))
    plt.title(f'GDELT Project: {title}\nPerson Co-occurrence Network for "{search_person}"', 
              y=0.97, fontsize=20, fontweight='bold')
    
    # Draw the network
    pos = nx.spring_layout(g, k=3, iterations=50)  # Layout algorithm
    nx.draw(g, pos, 
            with_labels=True, 
            node_color='lightblue',
            node_size=500,
            font_size=8,
            font_weight='bold',
            edge_color='gray',
            alpha=0.7)
    
    # Add edge labels for weights (only for top edges to avoid clutter)
    edge_labels = {}
    for (u, v, d) in g.edges(data=True):
        if d['weight'] > 5:  # Only show labels for significant connections
            edge_labels[(u, v)] = f"{d['weight']:.1f}"
    
    nx.draw_networkx_edge_labels(g, pos, edge_labels, font_size=6)
    
    plt.tight_layout()
    plt.show()
    
    # Print some network statistics
    print(f"\n📊 Network Statistics:")
    print(f"   Nodes: {g.number_of_nodes()}")
    print(f"   Edges: {g.number_of_edges()}")
    print(f"   Average degree: {sum(dict(g.degree()).values()) / g.number_of_nodes():.2f}")
    
    # Find the most connected nodes
    degree_centrality = nx.degree_centrality(g)
    top_nodes = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    print(f"\n🏆 Most Connected Nodes:")
    for i, (node, centrality) in enumerate(top_nodes, 1):
        print(f"   {i}. {node}: {centrality:.3f}")
    
    return g

# Create the network graph from the co-occurrence results
network_graph = create_person_network_graph(cooccurrence_results, search_person)


## 📁 Network Export for Gephi

This cell exports the NetworkX graph to GEXF format for advanced analysis in Gephi software:

### **Export Process**
- **Directory Detection**: Automatically finds suitable output directory
- **Fallback Strategy**: Uses multiple directory options (current, home, /tmp)
- **GEXF Format**: Exports to Graph Exchange XML Format
- **File Management**: Creates organized export directory structure

### **Directory Strategy**
1. **Current Directory**: Tries notebook's working directory first
2. **Home Directory**: Falls back to user's home directory
3. **Temporary Directory**: Uses /tmp as last resort
4. **Export Folder**: Creates `gephi_exports` subdirectory for organization

### **Gephi Integration**
- **File Format**: Standard GEXF format compatible with Gephi
- **Metadata Preservation**: Maintains node and edge attributes
- **Import Instructions**: Provides step-by-step Gephi import guide
- **Layout Recommendations**: Suggests Force Atlas 2 algorithm

### **Advanced Analysis Capabilities**
Once imported into Gephi, you can:
- Apply sophisticated layout algorithms
- Perform community detection
- Calculate advanced centrality measures
- Create publication-quality visualizations
- Export to various formats (PNG, PDF, SVG)

### **File Output**
- **Filename**: `gdelt_person_cooccurrence.gexf`
- **Location**: `~/gephi_exports/` (or alternative directory)
- **Format**: XML-based graph exchange format

🎯 **Purpose**: Enables professional-grade network analysis and visualization in Gephi software.


In [None]:
# Export NetworkX graph to Gephi file formats

def export_graph_to_gephi(graph, filename_prefix="gdelt_person_network"):
    """
    Export NetworkX graph to GEXF format for Gephi
    """
    if graph is None:
        print("❌ No graph available to export")
        return None
    
    print("📁 Exporting NetworkX graph to GEXF format...")
    print(f"   Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")
    
    # Use a more robust approach to get the output directory
    # Try multiple fallback options
    possible_dirs = []
    
    # Option 1: Try to get current working directory
    try:
        current_dir = Path.cwd()
        possible_dirs.append(current_dir)
        print(f"📂 Found current directory: {current_dir}")
    except (FileNotFoundError, OSError) as e:
        print(f"⚠️  Current directory not accessible: {e}")
    
    # Option 2: Use the notebook directory (where this notebook is located)
    try:
        notebook_dir = Path(__file__).parent if '__file__' in globals() else None
        if notebook_dir and notebook_dir.exists():
            possible_dirs.append(notebook_dir)
            print(f"📂 Found notebook directory: {notebook_dir}")
    except:
        pass
    
    # Option 3: Use home directory as fallback
    try:
        home_dir = Path.home()
        possible_dirs.append(home_dir)
        print(f"📂 Using home directory as fallback: {home_dir}")
    except:
        pass
    
    # Option 4: Use /tmp as last resort
    try:
        tmp_dir = Path("/tmp")
        if tmp_dir.exists():
            possible_dirs.append(tmp_dir)
            print(f"📂 Using /tmp directory as last resort: {tmp_dir}")
    except:
        pass
    
    # Select the first available directory
    if not possible_dirs:
        print("❌ No suitable directory found for export")
        return None
    
    base_dir = possible_dirs[0]
    output_dir = base_dir / "gephi_exports"
    
    # Create output directory if it doesn't exist
    try:
        output_dir.mkdir(exist_ok=True)
        print(f"📂 Output directory: {output_dir}")
    except Exception as e:
        print(f"❌ Could not create output directory: {e}")
        # Fall back to base directory
        output_dir = base_dir
        print(f"📂 Using base directory: {output_dir}")
    
    try:
        # Export to GEXF format only
        gexf_filename = output_dir / f"{filename_prefix}.gexf"
        nx.write_gexf(graph, str(gexf_filename))
        print(f"✅ GEXF file exported: {gexf_filename}")
        
        print(f"\n🎉 Graph export completed successfully!")
        print(f"📂 File created: {gexf_filename}")
        
        print(f"\n💡 To import into Gephi:")
        print(f"   1. Open Gephi")
        print(f"   2. File → Open → Select '{gexf_filename.name}'")
        print(f"   3. Choose appropriate layout algorithm (e.g., Force Atlas 2)")
        print(f"   4. Adjust node sizes and colors as needed")
        
        return str(gexf_filename)
        
    except Exception as e:
        print(f"❌ Error exporting graph: {e}")
        print(f"💡 Base directory: {base_dir}")
        print(f"💡 Output directory: {output_dir}")
        print(f"�� Directory exists: {output_dir.exists()}")
        print(f"💡 Directory writable: {os.access(output_dir, os.W_OK)}")
        return None

# Export the network graph to GEXF format
if 'network_graph' in locals() and network_graph is not None:
    export_file = export_graph_to_gephi(network_graph, "gdelt_person_cooccurrence")
else:
    print("⚠️  No network graph available. Please run the network creation cell first.")

## 🏗️ Graph Database Schema Creation

This cell creates a normalized graph database schema optimized for GDELT network analysis:

### **Schema Design**
The schema implements a proper graph database structure with dedicated tables for:

#### **Node Tables (Entities)**
- **person**: Individual people with name parsing and mention statistics
- **organization**: Organizations with type classification and geographic info
- **location**: Geographic locations with coordinates and country codes
- **event**: GDELT events with categorization and descriptions
- **article**: Article metadata with tone scores and publication info

#### **Relationship Tables (Edges)**
- **person_cooccurrence**: Person-to-person relationships with strength scores
- **person_organization**: Person-organization affiliations
- **person_location**: Person-location associations
- **event_participant**: Event participation relationships

### **Schema Features**
- **UUID Primary Keys**: Unique identifiers for all entities
- **Clustering**: Optimized for query performance on key columns
- **Temporal Tracking**: First/last seen dates for relationship evolution
- **Array Fields**: Supports multiple values (name variations, article IDs)
- **Metadata Fields**: Creation and update timestamps
- **Flexible Relationships**: Support for different relationship types

### **Performance Optimizations**
- **Clustered Tables**: Organized by most frequently queried columns
- **No Partitioning**: Simplified structure to avoid BigQuery complexity
- **Efficient Joins**: Designed for fast relationship queries
- **Index-Friendly**: Column ordering optimized for BigQuery

🎯 **Purpose**: Creates a foundation for sophisticated graph queries and analysis on GDELT data.
```

In [None]:
# Create Graph Schema Tables (Fixed - No Partitioning)

def create_graph_schema_tables_fixed():
    """
    Create normalized graph schema tables for GDELT data analysis.
    Fixed version without partitioning to avoid BigQuery errors.
    """
    print("🏗️  Creating Graph Schema Tables (Fixed Version)...")
    print(f"📊 Project: {GCP_PROJECT_ID}")
    print(f"🗄️  Dataset: {BIGQUERY_DATASET}")
    print("-" * 70)
    
    try:
        # Create BigQuery client
        client = bigquery.Client(project=GCP_PROJECT_ID)
        print("✅ BigQuery client created")
        
        # Define all table creation queries (without partitioning)
        table_definitions = {
            "person": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.person` (
              person_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the person."),
              name STRING NOT NULL OPTIONS(description="The common name of the person."),
              first_name STRING,
              last_name STRING,
              full_name STRING,
              name_variations ARRAY<STRING> OPTIONS(description="Known variations or aliases of the person's name."),
              first_seen_date DATE OPTIONS(description="The date the person was first mentioned."),
              last_seen_date DATE OPTIONS(description="The date the person was last mentioned."),
              total_mentions INT64 OPTIONS(description="A total count of the person's mentions."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (person_id) NOT ENFORCED
            )
            CLUSTER BY person_id, name
            OPTIONS(
              description="Node table containing master list of all identified individuals."
            )
            """,
            
            "organization": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.organization` (
              org_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the organization."),
              name STRING NOT NULL OPTIONS(description="The common name of the organization."),
              org_type STRING OPTIONS(description="The type or category of the organization."),
              country_code STRING OPTIONS(description="ISO country code where the organization is based."),
              first_seen_date DATE OPTIONS(description="The date the organization was first mentioned."),
              last_seen_date DATE OPTIONS(description="The date the organization was last mentioned."),
              total_mentions INT64 OPTIONS(description="A total count of the organization's mentions."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (org_id) NOT ENFORCED
            )
            CLUSTER BY org_id, name
            OPTIONS(
              description="Node table containing master list of all identified organizations."
            )
            """,
            
            "location": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.location` (
              location_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the location."),
              name STRING NOT NULL OPTIONS(description="The common name of the location."),
              location_type STRING OPTIONS(description="The type of location (city, country, region, etc.)."),
              country_code STRING OPTIONS(description="ISO country code for the location."),
              latitude FLOAT64 OPTIONS(description="Geographic latitude coordinate."),
              longitude FLOAT64 OPTIONS(description="Geographic longitude coordinate."),
              first_seen_date DATE OPTIONS(description="The date the location was first mentioned."),
              last_seen_date DATE OPTIONS(description="The date the location was last mentioned."),
              total_mentions INT64 OPTIONS(description="A total count of the location's mentions."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (location_id) NOT ENFORCED
            )
            CLUSTER BY location_id, name
            OPTIONS(
              description="Node table containing master list of all identified geographic locations."
            )
            """,
            
            "event": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.event` (
              event_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the event."),
              event_code STRING OPTIONS(description="GDELT event code classification."),
              event_description STRING OPTIONS(description="Human-readable description of the event."),
              event_category STRING OPTIONS(description="High-level category classification of the event."),
              first_seen_date DATE OPTIONS(description="The date the event was first mentioned."),
              last_seen_date DATE OPTIONS(description="The date the event was last mentioned."),
              total_mentions INT64 OPTIONS(description="A total count of the event's mentions."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (event_id) NOT ENFORCED
            )
            CLUSTER BY event_id, event_code
            OPTIONS(
              description="Node table containing master list of all identified events."
            )
            """,
            
            "person_cooccurrence": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.person_cooccurrence` (
              relationship_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for this specific co-occurrence."),
              person1_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing person_id in the person table."),
              person2_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing person_id in the person table."),
              cooccurrence_count INT64 OPTIONS(description="The number of times these two people were mentioned together."),
              first_cooccurrence_date DATE OPTIONS(description="The date of the first joint mention."),
              last_cooccurrence_date DATE OPTIONS(description="The date of the most recent joint mention."),
              article_ids ARRAY<STRING> OPTIONS(description="A list of article IDs where the co-occurrence was found."),
              themes ARRAY<STRING> OPTIONS(description="A list of themes associated with their joint mentions."),
              themes_summary STRING OPTIONS(description="A summary of the common themes in their co-occurrence."),
              strength_score FLOAT64 OPTIONS(description="A calculated score representing the strength of the relationship."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (relationship_id) NOT ENFORCED,
              CONSTRAINT fk_person1 FOREIGN KEY (person1_id) REFERENCES `{project_id}.{dataset}.person` (person_id) NOT ENFORCED,
              CONSTRAINT fk_person2 FOREIGN KEY (person2_id) REFERENCES `{project_id}.{dataset}.person` (person_id) NOT ENFORCED
            )
            CLUSTER BY person1_id, person2_id
            OPTIONS(
              description="Edge table storing the relationships (co-occurrences) between individuals from the person table."
            )
            """,
            
            "person_organization": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.person_organization` (
              relationship_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the person-organization relationship."),
              person_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing person.person_id."),
              org_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing organization.org_id."),
              relationship_type STRING OPTIONS(description="Type of relationship between the person and organization."),
              mention_count INT64 OPTIONS(description="Number of mentions linking this person to the organization."),
              first_mention_date DATE OPTIONS(description="Date of the first mention linking them."),
              last_mention_date DATE OPTIONS(description="Date of the most recent mention linking them."),
              article_ids ARRAY<STRING> OPTIONS(description="List of article IDs where this relationship was mentioned."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (relationship_id) NOT ENFORCED,
              CONSTRAINT fk_po_person FOREIGN KEY (person_id) REFERENCES `{project_id}.{dataset}.person` (person_id) NOT ENFORCED,
              CONSTRAINT fk_po_org FOREIGN KEY (org_id) REFERENCES `{project_id}.{dataset}.organization` (org_id) NOT ENFORCED
            )
            CLUSTER BY person_id, org_id
            OPTIONS(
              description="Edge table for person-to-organization affiliations and mentions."
            )
            """,
            
            "person_location": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.person_location` (
              relationship_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the person-location relationship."),
              person_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing person.person_id."),
              location_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing location.location_id."),
              relationship_type STRING OPTIONS(description="Type of relationship between the person and location."),
              mention_count INT64 OPTIONS(description="Number of mentions linking this person to the location."),
              first_mention_date DATE OPTIONS(description="Date of the first mention linking them."),
              last_mention_date DATE OPTIONS(description="Date of the most recent mention linking them."),
              article_ids ARRAY<STRING> OPTIONS(description="List of article IDs where this relationship was mentioned."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (relationship_id) NOT ENFORCED,
              CONSTRAINT fk_pl_person FOREIGN KEY (person_id) REFERENCES `{project_id}.{dataset}.person` (person_id) NOT ENFORCED,
              CONSTRAINT fk_pl_location FOREIGN KEY (location_id) REFERENCES `{project_id}.{dataset}.location` (location_id) NOT ENFORCED
            )
            CLUSTER BY person_id, location_id
            OPTIONS(
              description="Edge table for person-to-location associations and mentions."
            )
            """,
            
            "event_participant": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.event_participant` (
              relationship_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the event-participant relationship."),
              event_id STRING NOT NULL OPTIONS(description="Logical Foreign Key referencing event.event_id."),
              participant_id STRING NOT NULL OPTIONS(description="Identifier of the participant (person, organization, or location)."),
              participant_type STRING OPTIONS(description="Type of participant: PERSON, ORGANIZATION, or LOCATION."),
              role STRING OPTIONS(description="Role of the participant in the event (e.g., initiator, target)."),
              mention_count INT64 OPTIONS(description="Number of mentions linking this participant to the event."),
              first_mention_date DATE OPTIONS(description="Date of the first mention linking them."),
              last_mention_date DATE OPTIONS(description="Date of the most recent mention linking them."),
              article_ids ARRAY<STRING> OPTIONS(description="List of article IDs where this participation was mentioned."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was last updated."),
              PRIMARY KEY (relationship_id) NOT ENFORCED,
              CONSTRAINT fk_ep_event FOREIGN KEY (event_id) REFERENCES `{project_id}.{dataset}.event` (event_id) NOT ENFORCED
            )
            CLUSTER BY event_id, participant_id
            OPTIONS(
              description="Edge table for event participation by persons, organizations, or locations."
            )
            """,
            
            "article": """
            CREATE TABLE IF NOT EXISTS `{project_id}.{dataset}.article` (
              article_id STRING NOT NULL OPTIONS(description="Logical Primary Key. Unique identifier for the article."),
              gkg_record_id STRING OPTIONS(description="Reference to the original GDELT GKG record."),
              url STRING OPTIONS(description="URL of the original article."),
              title STRING OPTIONS(description="Title of the article."),
              publish_date DATE OPTIONS(description="Date when the article was published."),
              source_name STRING OPTIONS(description="Name of the media source."),
              language STRING OPTIONS(description="Language code of the article."),
              tone_score FLOAT64 OPTIONS(description="Sentiment tone score of the article."),
              created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() OPTIONS(description="Timestamp when the record was created."),
              PRIMARY KEY (article_id) NOT ENFORCED
            )
            CLUSTER BY article_id, publish_date
            OPTIONS(
              description="Node table containing master list of all articles from GDELT data."
            )
            """
        }
        
        # Create tables
        results = {}
        for table_name, query_template in table_definitions.items():
            print(f"\n📝 Creating table: {table_name}")
            
            try:
                # Format the query with project and dataset
                query = query_template.format(
                    project_id=GCP_PROJECT_ID,
                    dataset=BIGQUERY_DATASET
                )
                
                # Execute the query
                query_job = client.query(query, location="US-CENTRAL1")
                query_job.result()  # Wait for completion
                
                print(f"✅ Table '{table_name}' created successfully")
                results[table_name] = "success"
                
            except Exception as e:
                error_msg = str(e)
                if "already exists" in error_msg.lower() or "409" in error_msg:
                    print(f"⏭️  Table '{table_name}' already exists")
                    results[table_name] = "already_exists"
                else:
                    print(f"❌ Error creating table '{table_name}': {e}")
                    results[table_name] = f"error: {e}"
        
        # Print summary
        print(f"\n{'='*70}")
        print("📊 GRAPH SCHEMA CREATION SUMMARY (FIXED)")
        print(f"{'='*70}")
        
        success_count = sum(1 for r in results.values() if r == "success")
        exists_count = sum(1 for r in results.values() if r == "already_exists")
        error_count = sum(1 for r in results.values() if r.startswith("error"))
        
        print(f"✅ Successfully created: {success_count}/{len(table_definitions)} tables")
        print(f"⏭️  Already existed: {exists_count}/{len(table_definitions)} tables")
        print(f"❌ Errors: {error_count}/{len(table_definitions)} tables")
        
        print(f"\nDetailed results:")
        for table_name, result in results.items():
            if result == "success":
                print(f"  ✅ {table_name}: Created")
            elif result == "already_exists":
                print(f"  ⏭️  {table_name}: Already exists")
            else:
                print(f"  ❌ {table_name}: {result}")
        
        print(f"\n🎉 Graph schema tables creation completed!")
        print(f"\n💡 Next steps:")
        print(f"   1. Populate node tables with data from gkg_partitioned")
        print(f"   2. Create relationship tables from co-occurrence analysis")
        print(f"   3. Run graph queries on the normalized schema")
        
        return results
        
    except Exception as e:
        print(f"❌ Critical error during table creation: {e}")
        print("💡 Troubleshooting tips:")
        print("   - Check if you have BigQuery admin permissions")
        print("   - Verify the dataset exists")
        print("   - Ensure BigQuery API is enabled")
        return None

# Execute the fixed table creation
schema_results_fixed = create_graph_schema_tables_fixed()


🏗️  Creating Graph Schema Tables (Fixed Version)...
📊 Project: graph-demo-471710
🗄️  Dataset: gdelt
----------------------------------------------------------------------
✅ BigQuery client created

📝 Creating table: person
✅ Table 'person' created successfully

📝 Creating table: organization
✅ Table 'organization' created successfully

📝 Creating table: location
✅ Table 'location' created successfully

📝 Creating table: event
✅ Table 'event' created successfully

📝 Creating table: person_cooccurrence
✅ Table 'person_cooccurrence' created successfully

📝 Creating table: person_organization
✅ Table 'person_organization' created successfully

📝 Creating table: person_location
✅ Table 'person_location' created successfully

📝 Creating table: event_participant
✅ Table 'event_participant' created successfully

📝 Creating table: article
✅ Table 'article' created successfully

📊 GRAPH SCHEMA CREATION SUMMARY (FIXED)
✅ Successfully created: 9/9 tables
⏭️  Already existed: 0/9 tables
❌ Errors: 0

## 👥 Person Entity Population

This cell populates the person table with cleaned and deduplicated person data from GDELT:

### **Data Processing Pipeline**
1. **Name Extraction**: Extracts person names from V2Persons field
2. **Name Cleaning**: Removes titles, suffixes, and standardizes format
3. **Name Parsing**: Separates first and last names using string splitting
4. **Deduplication**: Ensures each unique person appears only once
5. **Mention Counting**: Calculates total mentions across all articles

### **Name Cleaning Rules**
- **Suffix Removal**: Strips everything after commas (titles, descriptions)
- **Prefix Removal**: Removes common prefixes like "A " 
- **Standardization**: Normalizes spacing and formatting
- **Validation**: Filters out empty or invalid names

### **Data Enrichment**
- **UUID Generation**: Creates unique identifiers for each person
- **Name Components**: Extracts first_name and last_name fields
- **Full Name**: Preserves complete name for display
- **Statistics**: Counts total mentions across all articles
- **Temporal Data**: Records first and last seen dates

### **Quality Assurance**
- **Duplicate Prevention**: Uses DISTINCT to avoid person duplicates
- **Null Filtering**: Excludes empty or null person entries
- **Validation**: Ensures clean_name is not empty after processing

📊 **Expected Output**: ~195,000 unique persons with cleaned names and mention statistics.

In [None]:
# Step 1: Populate Person Table

def populate_person_table():
    """
    Populate the person table with unique persons from GDELT data.
    Removes duplicates and properly parses first/last names.
    """
    print("👥 Populating Person Table...")
    print(f"📊 Project: {GCP_PROJECT_ID}")
    print(f"🗄️  Dataset: {BIGQUERY_DATASET}")
    print("-" * 50)
    
    try:
        # Create BigQuery client
        client = bigquery.Client(project=GCP_PROJECT_ID)
        print("✅ BigQuery client created")
        
        person_query = f"""
        INSERT INTO `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person` 
        (person_id, name, first_name, last_name, full_name, first_seen_date, last_seen_date, total_mentions)
        WITH CleanedPersons AS (
          SELECT DISTINCT
            REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '') as clean_name,
            CASE 
              WHEN ARRAY_LENGTH(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', ''), ' ')) > 0 
              THEN SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', ''), ' ')[OFFSET(0)]
              ELSE NULL
            END as first_name,
            CASE 
              WHEN ARRAY_LENGTH(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', ''), ' ')) > 1 
              THEN SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', ''), ' ')[OFFSET(ARRAY_LENGTH(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', ''), ' ')) - 1)]
              ELSE NULL
            END as last_name,
            REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '') as full_name
          FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.gkg_partitioned`,
          UNNEST(SPLIT(V2Persons, ';')) AS person
          WHERE V2Persons IS NOT NULL AND V2Persons != '' AND REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '') != ''
        ),
        PersonCounts AS (
          SELECT 
            REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '') as clean_name,
            COUNT(*) as total_mentions
          FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.gkg_partitioned`,
          UNNEST(SPLIT(V2Persons, ';')) AS person
          WHERE V2Persons IS NOT NULL AND V2Persons != '' AND REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '') != ''
          GROUP BY REGEXP_REPLACE(REGEXP_REPLACE(person, r',.*', ''), r'^A ', '')
        )
        SELECT 
          GENERATE_UUID() as person_id,
          cp.clean_name as name,
          cp.first_name,
          cp.last_name,
          cp.full_name,
          CURRENT_DATE() as first_seen_date,
          CURRENT_DATE() as last_seen_date,
          pc.total_mentions
        FROM CleanedPersons cp
        JOIN PersonCounts pc ON cp.clean_name = pc.clean_name
        """
        
        query_job = client.query(person_query, location="US-CENTRAL1")
        result = query_job.result()
        print("✅ Person table populated successfully")
        
        # Verify the data
        verification_query = f"SELECT COUNT(*) as count FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person`"
        query_job = client.query(verification_query, location="US-CENTRAL1")
        result = query_job.result()
        for row in result:
            print(f"📊 Total persons: {row.count:,}")
        
        return True
        
    except Exception as e:
        print(f"❌ Error populating person table: {e}")
        return False

# Execute person table population
person_success = populate_person_table()


## 📰 Article Metadata Population

This cell populates the article table with metadata from GDELT Global Knowledge Graph records:

### **Article Data Extraction**
- **Unique IDs**: Generates UUID for each article record
- **GKG Record ID**: Links to original GDELT GKG record
- **URL Mapping**: Uses DocumentIdentifier as article URL
- **Source Information**: Extracts source collection identifiers

### **Tone Analysis**
- **Tone Score Parsing**: Extracts numeric tone values from V2Tone field
- **Sentiment Indication**: Tone scores indicate article sentiment
  - Positive values: Positive tone
  - Negative values: Negative tone
  - Values near zero: Neutral tone

### **Data Fields**
- **article_id**: Unique identifier (UUID)
- **gkg_record_id**: Original GDELT record reference
- **url**: Article web address
- **publish_date**: Publication date (set to current date)
- **source_name**: Media source identifier
- **tone_score**: Numerical sentiment score

### **Limitations & Notes**
- Uses current date as placeholder for publish_date
- Some fields use DocumentIdentifier as placeholder
- Tone score is first value from comma-separated V2Tone field
- Designed for demonstration purposes with simplified mapping

📊 **Expected Output**: ~382,000 article records with basic metadata and tone scores.
```

In [None]:
# Step 2: Populate Article Table

def populate_article_table():
    """
    Populate the article table with article metadata from GDELT data.
    """
    print("📰 Populating Article Table...")
    print(f"📊 Project: {GCP_PROJECT_ID}")
    print(f"🗄️  Dataset: {BIGQUERY_DATASET}")
    print("-" * 50)
    
    try:
        # Create BigQuery client
        client = bigquery.Client(project=GCP_PROJECT_ID)
        print("✅ BigQuery client created")
        
        article_query = f"""
        INSERT INTO `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.article` 
        (article_id, gkg_record_id, url, title, publish_date, source_name, language, tone_score)
        SELECT 
          GENERATE_UUID() as article_id,
          GKGRECORDID as gkg_record_id,
          DocumentIdentifier as url,
          V2Tone as title,  -- Using V2Tone as placeholder for title
          CURRENT_DATE() as publish_date,
          CAST(SourceCollectionIdentifier AS STRING) as source_name,
          DocumentIdentifier as language,  -- Using DocumentIdentifier as placeholder
          SAFE_CAST(SPLIT(V2Tone, ',')[OFFSET(0)] AS FLOAT64) as tone_score
        FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.gkg_partitioned`
        WHERE GKGRECORDID IS NOT NULL
        """
        
        query_job = client.query(article_query, location="US-CENTRAL1")
        result = query_job.result()
        print("✅ Article table populated successfully")
        
        # Verify the data
        verification_query = f"SELECT COUNT(*) as count FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.article`"
        query_job = client.query(verification_query, location="US-CENTRAL1")
        result = query_job.result()
        for row in result:
            print(f"📊 Total articles: {row.count:,}")
        
        return True
        
    except Exception as e:
        print(f"❌ Error populating article table: {e}")
        return False

# Execute article table population
article_success = populate_article_table()


## 🔗 Person Relationship Population

This cell creates the person co-occurrence relationship table by analyzing which people appear together in articles:

### **Relationship Detection Process**
1. **Person Pair Generation**: Creates all possible person combinations within each article
2. **Name Standardization**: Applies same cleaning rules as person table
3. **Duplicate Prevention**: Uses `a < b` condition to avoid duplicate pairs
4. **Foreign Key Matching**: Links to person table using cleaned names
5. **Aggregation**: Counts co-occurrence frequencies across all articles

### **Relationship Metrics**
- **Cooccurrence Count**: Total number of shared article mentions
- **Temporal Tracking**: First and last co-occurrence dates
- **Article References**: Array of article IDs containing both people
- **Relationship Strength**: Implicit in co-occurrence frequency

### **Data Quality Features**
- **Self-Pair Exclusion**: Prevents person from being paired with themselves
- **Empty Name Filtering**: Excludes invalid or empty person names
- **Referential Integrity**: Ensures both people exist in person table
- **Unique Relationships**: Each person pair appears only once

### **Performance Considerations**
- **Cross Join**: Computationally intensive operation on large datasets
- **Aggregation**: Groups by person IDs for relationship counting
- **Array Collection**: Gathers all article IDs for each relationship

### **Schema Integration**
- Links to person table via foreign keys
- Supports graph traversal queries
- Enables network analysis and visualization
- Foundation for influence and community detection

📊 **Expected Output**: ~1.2 million person-to-person relationships with co-occurrence statistics.

In [31]:
# Step 3: Populate Person Co-occurrence Table

def populate_person_cooccurrence_table():
    """
    Populate the person_cooccurrence table with relationships between people
    who appear together in the same articles.
    """
    print("🔗 Populating Person Co-occurrence Table...")
    print(f"📊 Project: {GCP_PROJECT_ID}")
    print(f"🗄️  Dataset: {BIGQUERY_DATASET}")
    print("-" * 50)
    
    try:
        # Create BigQuery client
        client = bigquery.Client(project=GCP_PROJECT_ID)
        print("✅ BigQuery client created")
        
        cooccurrence_query = f"""
        INSERT INTO `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person_cooccurrence` 
        (relationship_id, person1_id, person2_id, cooccurrence_count, first_cooccurrence_date, last_cooccurrence_date, article_ids, themes)
        WITH PersonPairs AS (
          SELECT DISTINCT
            g.GKGRECORDID,
            g.V2Themes,
            REGEXP_REPLACE(REGEXP_REPLACE(a, r',.*', ''), r'^A ', '') AS name1,
            REGEXP_REPLACE(REGEXP_REPLACE(b, r',.*', ''), r'^A ', '') AS name2,
            PARSE_DATE('%Y%m%d', SUBSTR(CAST(DATE AS STRING), 1, 8)) as cooccurrence_date
          FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.gkg_partitioned` g,
          UNNEST(SPLIT(V2Persons, ';')) AS a,
          UNNEST(SPLIT(V2Persons, ';')) AS b
          WHERE a < b  -- Avoid duplicates and self-pairs
            AND REGEXP_REPLACE(REGEXP_REPLACE(a, r',.*', ''), r'^A ', '') != ''
            AND REGEXP_REPLACE(REGEXP_REPLACE(b, r',.*', ''), r'^A ', '') != ''
        ),
        PersonCooccurrence AS (
          SELECT 
            p1.person_id as person1_id,
            p2.person_id as person2_id,
            COUNT(*) as cooccurrence_count,
            MIN(cooccurrence_date) as first_cooccurrence_date,
            MAX(cooccurrence_date) as last_cooccurrence_date,
            ARRAY_AGG(DISTINCT GKGRECORDID) as article_ids
          FROM PersonPairs pp
          JOIN `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person` p1 ON pp.name1 = p1.name
          JOIN `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person` p2 ON pp.name2 = p2.name
          GROUP BY p1.person_id, p2.person_id
        ),
        PersonThemes AS (
          SELECT 
            p1.person_id as person1_id,
            p2.person_id as person2_id,
            ARRAY_AGG(
              theme
              ORDER BY theme_count DESC
              LIMIT 20
            ) as themes
          FROM (
            SELECT 
              pp.name1,
              pp.name2,
              REGEXP_REPLACE(theme, r',.*', '') as theme,
              COUNT(*) as theme_count
            FROM PersonPairs pp,
            UNNEST(SPLIT(V2Themes, ';')) AS theme
            WHERE theme IS NOT NULL AND theme != ''
            GROUP BY pp.name1, pp.name2, REGEXP_REPLACE(theme, r',.*', '')
          ) theme_counts
          JOIN `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person` p1 ON theme_counts.name1 = p1.name
          JOIN `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person` p2 ON theme_counts.name2 = p2.name
          GROUP BY p1.person_id, p2.person_id
        )
        SELECT 
          GENERATE_UUID() as relationship_id,
          pc.person1_id,
          pc.person2_id,
          pc.cooccurrence_count,
          pc.first_cooccurrence_date,
          pc.last_cooccurrence_date,
          pc.article_ids,
          COALESCE(pt.themes, []) as themes
        FROM PersonCooccurrence pc
        LEFT JOIN PersonThemes pt ON pc.person1_id = pt.person1_id AND pc.person2_id = pt.person2_id
        WHERE pc.person1_id != pc.person2_id  -- Ensure person1 ≠ person2
        """
        
        query_job = client.query(cooccurrence_query, location="US-CENTRAL1")
        result = query_job.result()
        print("✅ Person cooccurrence table populated successfully")
        
        # Verify the data
        verification_query = f"SELECT COUNT(*) as count FROM `{GCP_PROJECT_ID}.{BIGQUERY_DATASET}.person_cooccurrence`"
        query_job = client.query(verification_query, location="US-CENTRAL1")
        result = query_job.result()
        for row in result:
            print(f"📊 Total co-occurrence relationships: {row.count:,}")
        
        return True
        
    except Exception as e:
        print(f"❌ Error populating person_cooccurrence table: {e}")
        return False

# Execute person cooccurrence table population
cooccurrence_success = populate_person_cooccurrence_table()


🔗 Populating Person Co-occurrence Table...
📊 Project: graph-demo-471710
🗄️  Dataset: gdelt
--------------------------------------------------
✅ BigQuery client created
✅ Person cooccurrence table populated successfully
📊 Total co-occurrence relationships: 1,173,279


In [None]:
# Create network visualization from co-occurrence results
def create_person_network_graph(df, search_person="Rayner", title="Person Co-occurrence Network"):
    """
    Create a network graph from the co-occurrence DataFrame
    """
    print("🕸️  Creating network graph from co-occurrence data...")
    
    if df is None or len(df) == 0:
        print("❌ No data available to create network graph")
        return None
    
    # Create the graph
    g = nx.Graph()
    
    # Add edges with weights based on co-occurrence count
    for _, row in df.iterrows():
        name1 = row['name1']
        name2 = row['name2']
        weight = row['pair_count']
            
        # Add edge with weight (scaled down for visualization)
        g.add_edge(name1, name2, weight=weight/10)
    
    print(f"✅ Graph created with {g.number_of_nodes()} nodes and {g.number_of_edges()} edges")
    
    if not search_person:
        search_person = "All"
    # Create the visualization
    plt.figure(figsize=(30, 20))
    plt.title(f'GDELT Project: {title}\nPerson Co-occurrence Network for "{search_person}"', 
              y=0.97, fontsize=20, fontweight='bold')
    
    # Draw the network
    pos = nx.spring_layout(g, k=3, iterations=50)  # Layout algorithm
    nx.draw(g, pos, 
            with_labels=True, 
            node_color='lightblue',
            node_size=500,
            font_size=8,
            font_weight='bold',
            edge_color='gray',
            alpha=0.7)
    
    # Add edge labels for weights (only for top edges to avoid clutter)
    edge_labels = {}
    for (u, v, d) in g.edges(data=True):
        if d['weight'] > 5:  # Only show labels for significant connections
            edge_labels[(u, v)] = f"{d['weight']:.1f}"
    
    nx.draw_networkx_edge_labels(g, pos, edge_labels, font_size=6)
    
    plt.tight_layout()
    plt.show()
    
    # Print some network statistics
    print(f"\n📊 Network Statistics:")
    print(f"   Nodes: {g.number_of_nodes()}")
    print(f"   Edges: {g.number_of_edges()}")
    print(f"   Average degree: {sum(dict(g.degree()).values()) / g.number_of_nodes():.2f}")
    
    # Find the most connected nodes
    degree_centrality = nx.degree_centrality(g)
    top_nodes = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    print(f"\n🏆 Most Connected Nodes:")
    for i, (node, centrality) in enumerate(top_nodes, 1):
        print(f"   {i}. {node}: {centrality:.3f}")
    
    return g

# Create the network graph from the co-occurrence results
network_graph = create_person_network_graph(cooccurrence_results, search_person)
