# GDELT Demo Data Prep


This notebook demonstrates working with GDELT (Global Database of Events, Language, and Tone) data for graph analysis.


## üîß Configuration Setup

This cell defines all the essential configuration variables for the GDELT data analysis project:

### **Project Settings**
- **GCP_PROJECT_ID**: Your Google Cloud Platform project identifier
- **PROJECT_REGION**: Target region for BigQuery operations (us-central1)
- **BIGQUERY_DATASET**: Dataset name where GDELT data will be stored locally

### **GDELT Source Settings**  
- **GDELT_PROJECT_ID**: Public GDELT BigQuery project (gdelt-bq)
- **GDELT_DATASET**: Public GDELT dataset (gdeltv2)
- **GDELT_REGION**: Source region for GDELT data (US)

### **Data Tables**
- **BIGQUERY_TABLES**: List of GDELT tables to copy for analysis
  - `gkg_partitioned`: Global Knowledge Graph data
  - `events_partitioned`: Event data
  - `eventmentions_partitioned`: Event mentions data

### **Storage**
- **GCS_BUCKET**: Google Cloud Storage bucket for data exports

‚ö†Ô∏è **Important**: Update GCP_PROJECT_ID with your actual project ID before running.

In [5]:
# Configuration variables
GCP_PROJECT_ID = "graph-demo-471710"  # Replace with your actual GCP project ID
PROJECT_REGION = "us-central1"
BIGQUERY_DATASET = "gdelt"  # Replace with your actual BigQuery dataset name
BIGQUERY_TABLES = ["gkg_partitioned", "events_partitioned","eventmentions_partitioned"]  # List of tables to copy


# Derived variables - will be generated for each table
print(f"Configuration loaded:")
print(f"  BigQuery Dataset: {BIGQUERY_DATASET}")
print(f"  BigQuery Tables: {BIGQUERY_TABLES}")


Configuration loaded:
  BigQuery Dataset: gdelt
  BigQuery Tables: ['gkg_partitioned', 'events_partitioned', 'eventmentions_partitioned']


## üìö Library Imports

This cell imports all necessary Python libraries for the GDELT data analysis workflow:

### **Google Cloud Services**
- `google.cloud.bigquery`: For querying and managing BigQuery data
- `google.cloud.storage`: For Google Cloud Storage operations
- `google.auth`: For GCP authentication handling

### **Data Processing**
- `pandas`: For data manipulation and analysis
- `json`: For JSON data handling
- `datetime`: For date/time operations

### **Network Analysis & Visualization**
- `networkx`: For creating and analyzing graph networks
- `matplotlib.pyplot`: For creating visualizations and plots

### **System & File Operations**
- `os`, `pathlib.Path`: For file system operations
- `subprocess`: For running system commands
- `shutil`: For file operations

All libraries are tested for successful import with confirmation messages.

In [6]:
# Import required libraries
import os
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
import json
from datetime import datetime
import networkx as nx
import matplotlib.pyplot as plt
import os
from pathlib import Path
import subprocess
import os
import shutil
from google.auth import default
from google.auth.exceptions import DefaultCredentialsError
from google.cloud import bigquery
from datetime import datetime

print("‚úÖ All libraries imported successfully!")
print("   - BigQuery and Cloud Storage clients ready")
print("   - NetworkX and Matplotlib ready for visualization")
print("   - Pandas ready for data processing")


‚úÖ All libraries imported successfully!
   - BigQuery and Cloud Storage clients ready
   - NetworkX and Matplotlib ready for visualization
   - Pandas ready for data processing


## üîê GCP Authentication Setup

This cell provides a comprehensive GCP authentication function that handles various authentication scenarios:

### **Authentication Process**
1. **Credential Check**: Verifies existing Google Cloud credentials
2. **Project Validation**: Ensures credentials match the target project
3. **Credential Reset**: Clears old credentials if project mismatch detected
4. **Project Configuration**: Sets the correct GCP project using gcloud CLI
5. **Re-authentication**: Initiates browser-based OAuth flow if needed
6. **Quota Project**: Sets quota project to avoid billing warnings
7. **Verification**: Confirms successful authentication

### **Error Handling**
- Handles missing credentials gracefully
- Provides manual fallback instructions
- Manages project mismatches automatically
- Shows detailed error messages and troubleshooting tips

### **Environment Setup**
- Sets `GOOGLE_CLOUD_PROJECT` environment variable
- Configures application default credentials
- Prepares credentials for BigQuery and Cloud Storage clients

‚ö° **Note**: This function may open a browser window for OAuth authentication.


In [7]:
# GCP Authentication Setup


def setup_gcp_authentication():
    """Complete GCP authentication setup with error handling"""
    print("üîê Setting up GCP Authentication...")
    
    try:
        # Step 1: Try to use existing credentials first
        print("üîç Checking for existing credentials...")
        try:
            credentials, default_project = default()
            print(f"‚úÖ Found existing credentials for project: {default_project}")
            
            # If the project matches, we're good
            if default_project == GCP_PROJECT_ID:
                print(f"üéØ Project matches target project: {GCP_PROJECT_ID}")
                os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
                return credentials, GCP_PROJECT_ID
            else:
                print(f"‚ö†Ô∏è  Project mismatch: {default_project} vs {GCP_PROJECT_ID}")
                print("üîÑ Will re-authenticate with correct project...")
        except DefaultCredentialsError:
            print("‚ùå No existing credentials found")
            print("üîÑ Will authenticate from scratch...")
        
        # Step 2: Clear old credentials if needed
        print("üóëÔ∏è  Clearing old credentials...")
        adc_path = os.path.expanduser("~/.config/gcloud/application_default_credentials.json")
        if os.path.exists(adc_path):
            os.remove(adc_path)
            print("‚úÖ Removed old application default credentials")
        
        # Step 3: Set the correct project
        print(f"üéØ Setting gcloud project to: {GCP_PROJECT_ID}")
        result = subprocess.run(['gcloud', 'config', 'set', 'project', GCP_PROJECT_ID], 
                              capture_output=True, text=True, check=True)
        print("‚úÖ Project set successfully")
        
        # Step 4: Re-authenticate
        print("üîÑ Re-authenticating with application default credentials...")
        print("   This will open a browser window for authentication...")
        
        result = subprocess.run(['gcloud', 'auth', 'application-default', 'login'], 
                              check=True)
        print("‚úÖ Re-authentication successful")
        
        # Step 5: Set quota project to avoid warnings
        print("üí∞ Setting quota project...")
        try:
            subprocess.run(['gcloud', 'auth', 'application-default', 'set-quota-project', GCP_PROJECT_ID], 
                          capture_output=True, text=True, check=True)
            print("‚úÖ Quota project set successfully")
        except:
            print("‚ö†Ô∏è  Could not set quota project (this is usually fine)")
        
        # Step 6: Verify the setup
        print("üß™ Verifying authentication...")
        credentials, project = default()
        print(f"‚úÖ Authentication successful - Project: {project}")
        
        # Set environment variable
        os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID
        print(f"üåç Set GOOGLE_CLOUD_PROJECT environment variable to: {GCP_PROJECT_ID}")
        
        return credentials, GCP_PROJECT_ID
        
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Command failed: {e}")
        print("üí° Manual steps required:")
        print(f"   1. gcloud config set project {GCP_PROJECT_ID}")
        print("   2. gcloud auth application-default login")
        print(f"   3. gcloud auth application-default set-quota-project {GCP_PROJECT_ID}")
        return None, None
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None, None

# Run authentication setup
credentials, authenticated_project = setup_gcp_authentication()


üîê Setting up GCP Authentication...
üîç Checking for existing credentials...
‚úÖ Found existing credentials for project: graph-demo-471710
üéØ Project matches target project: graph-demo-471710


## üï∏Ô∏è Create Property Graph

This cell executes the DDL to create a Property Graph in BigQuery.

### **Graph Structure**
#### **Nodes**
- **Person**: Derived from `person` table
- **Organization**: Derived from `organization` table
- **Location**: Derived from `location` table
- **Event**: Derived from `event` table
- **Article**: Derived from `article` table

#### **Edges**
- **CO_OCCURS_WITH**: Person to Person (from `person_cooccurrence`)
- **AFFILIATED_WITH**: Person to Organization (from `person_organization`)
- **LOCATED_AT**: Person to Location (from `person_location`)

### **Logic**
1. Initializes the BigQuery client.
2. Defines the `CREATE OR REPLACE PROPERTY GRAPH` SQL statement.
3. Executes the query to build the complete graph schema.


In [None]:
# Create Graph
from google.api_core.exceptions import GoogleAPICallError

# Initialize BigQuery client
client = bigquery.Client(project=GCP_PROJECT_ID, credentials=credentials)

# Define the CREATE PROPERTY GRAPH query
create_graph_query = f"""
CREATE OR REPLACE PROPERTY GRAPH {BIGQUERY_DATASET}.GdeltGraph
  NODE TABLES (
    {BIGQUERY_DATASET}.person
      KEY (person_id)
      LABEL Person
      PROPERTIES (person_id, name, first_name, last_name, full_name, total_mentions, first_seen_date, last_seen_date),
    {BIGQUERY_DATASET}.organization
      KEY (org_id)
      LABEL Organization
      PROPERTIES (org_id, name, org_type, country_code, total_mentions, first_seen_date, last_seen_date),
    {BIGQUERY_DATASET}.location
      KEY (location_id)
      LABEL Location
      PROPERTIES (location_id, name, location_type, country_code, latitude, longitude, total_mentions),
    {BIGQUERY_DATASET}.event
      KEY (event_id)
      LABEL Event
      PROPERTIES (event_id, event_code, event_description, event_category, total_mentions),
    {BIGQUERY_DATASET}.article
      KEY (article_id)
      LABEL Article
      PROPERTIES (article_id, url, title, publish_date, source_name, tone_score)
  )
  EDGE TABLES (
    {BIGQUERY_DATASET}.person_cooccurrence
      KEY (relationship_id)
      SOURCE KEY (person1_id) REFERENCES Person (person_id)
      DESTINATION KEY (person2_id) REFERENCES Person (person_id)
      LABEL CO_OCCURS_WITH
      PROPERTIES (relationship_id, cooccurrence_count, strength_score, first_cooccurrence_date, last_cooccurrence_date, article_ids, themes, countries, states, cities, themes_summary, created_at, updated_at),\n",
    {BIGQUERY_DATASET}.person_organization
      KEY (relationship_id)
      SOURCE KEY (person_id) REFERENCES Person (person_id)
      DESTINATION KEY (org_id) REFERENCES Organization (org_id)
      LABEL AFFILIATED_WITH
      PROPERTIES (relationship_id, relationship_type, mention_count, first_mention_date, last_mention_date),
    {BIGQUERY_DATASET}.person_location
      KEY (relationship_id)
      SOURCE KEY (person_id) REFERENCES Person (person_id)
      DESTINATION KEY (location_id) REFERENCES Location (location_id)
      LABEL LOCATED_AT
      PROPERTIES (relationship_id, relationship_type, mention_count, first_mention_date, last_mention_date)
  );
"""

# Execute the query
print(f"Creating Property Graph '{BIGQUERY_DATASET}.GdeltGraph'...")
try:
    job = client.query(create_graph_query)
    job.result()  # Wait for the job to complete
    print(f"‚úÖ Property Graph '{BIGQUERY_DATASET}.GdeltGraph' created successfully.")
except GoogleAPICallError as e:
    print(f"‚ùå Error creating graph: {e}")
except Exception as e:
    print(f"‚ùå Unexpected error: {e}")


Creating Property Graph 'gdelt.GdeltGraph'...
‚úÖ Property Graph 'gdelt.GdeltGraph' created successfully.
