# Base Graph Creation from GeoPackage

This notebook demonstrates the end-to-end process of creating a basic maritime navigation graph using S-57 data stored in a local GeoPackage file.

The workflow covers:
1.  Defining an area of interest (AOI) between two ports.
2.  Filtering Electronic Navigational Charts (ENCs) that cover the AOI.
3.  Generating a navigable sea grid from the ENC data.
4.  Constructing a `networkx` graph from the grid.
5.  Performing a basic pathfinding operation on the resulting graph.

---

## Required Data

This notebook requires:
1. **ENC Data**: S-57 charts converted to GeoPackage format
2. **Data File**:
   - File location: `output/us_enc_all.gpkg` (or your custom path)
   - Required layers: `seaare`, `lndare`, `fairwy`, `drgare`, `tsslpt`, `prcare`
3. **Port Data**: Standard or custom port definitions (included with package)

**Setup Instructions:**
See `docs/SETUP.md` for converting S-57 charts to GeoPackage backend.

**Troubleshooting:**
If you encounter issues, see `docs/TROUBLESHOOTING.md` for common problems and solutions.

In [None]:
import sys
import os
import time
from pathlib import Path
from dotenv import load_dotenv
import plotly.io as pio
import pandas as pd
import plotly.express as px

# --- Setup Python Environment ---
# Add the src directory to the Python path to enable module imports
project_root = Path.cwd().parent.parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Load environment variables from .env file at the project root
# This loads configuration settings (optional for GeoPackage backend)
load_dotenv(project_root / ".env")
pio.renderers.default = "notebook_connected"

# Import maritime module components
from src.nautical_graph_toolkit.core.s57_data import ENCDataFactory, S57AdvancedConfig
from src.nautical_graph_toolkit.utils.port_utils import Boundaries, PortData
from src.nautical_graph_toolkit.utils.plot_utils import PlotlyChart

# --- Define Output Directory ---
# Create output directory for saving results
output_dir = Path.cwd() / 'output'
output_dir.mkdir(exist_ok=True)

# --- Define Data Source (GeoPackage File) ---
# For file-based backends, we use a Path object instead of connection dict
# Note: Can be .gpkg or .sqlite - ENCDataFactory auto-detects the format
data_file = Path.cwd() / "output" / "enc_west.gpkg"

print(f"Output directory: {output_dir}")
print(f"Data source: {data_file.name}")

# --- Performance Tracking ---
# Dictionary to store timing metrics for each pipeline step
performance_metrics = {}

## 1. Define Area of Interest (AOI)

This first step defines the geographic scope for our graph. We select two ports and create an expanded bounding box around them to ensure all relevant navigational data is included.


In [None]:
# --- Define Area of Interest by Selecting Two Ports ---
# Get port data and create a bounding box between Los Angeles and San Francisco
# The expansion parameter adds a buffer around the ports to include surrounding navigable areas
start_time = time.perf_counter()

port  = PortData()
bbox = Boundaries()
port1 = port.get_port_by_name('Los Angeles')
port2 = port.get_port_by_name('San Francisco')

# --- Validate Port Selection ---
# Ensure both ports were found in the database before proceeding
if port1.empty or port2.empty:
    raise ValueError("Could not find one or both ports. Please check the names.")
else:
    print(port.format_port_string(port1))
    print(port.format_port_string(port2))
    # Create expanded boundary (24 nautical miles) around the two ports
    # date_line=True handles cases where routes cross the International Date Line
    port_bbox = bbox.create_geo_boundary(geometries = [port1.geometry, port2.geometry],
                                      expansion=24,
                                      date_line=True)

end_time = time.perf_counter()
performance_metrics['Port Selection & Boundary'] = end_time - start_time
print(f"\nPort selection and boundary creation took: {end_time - start_time:.2f}s")
port_bbox

### 2. Visualize the Area of Interest

 Here, we plot the selected ports and the calculated boundary on a map to visually confirm our area of interest.

In [None]:
# --- Visualize Ports on Interactive Map ---
# Create a Plotly map and add both ports as markers
# This helps verify port locations before proceeding with graph creation
ply = PlotlyChart()
ply_fig = ply.create_base_map(mapbox_token=os.getenv('MAPBOX_TOKEN'))
ply.plotly_base_config(ply_fig)
port1_df = port.get_port_details_df(port1)
port2_df = port.get_port_details_df(port2)
# Add departure port (Los Angeles) in blue
ply.add_single_port_trace(ply_fig, port1, name=port1['PORT_NAME'], color='blue')
# Add arrival port (San Francisco) in red
ply.add_single_port_trace(ply_fig, port2, name=port2['PORT_NAME'], color='red')
ply_fig.show()

In [None]:
# --- Add Boundary to Map Visualization ---
# Display the expanded boundary box on the map to show our area of interest
ply.add_boundary_trace(ply_fig, port_bbox)
ply_fig.show()

## 3. ENC Data Preparation

 With the AOI defined, we now query the GeoPackage file to find all Electronic Navigational Charts (ENCs) that intersect with our boundary. This ensures we only process relevant chart data, which is critical for performance.

In [None]:
# --- Initialize ENC Data Factory for GeoPackage Backend ---
# The factory provides a unified interface for accessing ENC data
# regardless of backend (PostGIS/GeoPackage/SpatiaLite)
start_time = time.perf_counter()

gpkg_factory = ENCDataFactory(source=data_file)

# --- Filter ENCs by Boundary ---
# Step 1: Get the list of ENC names that intersect with our area of interest
# This ensures we only process relevant chart data, critical for performance
enc_names_in_boundary = gpkg_factory.get_encs_by_boundary(port_bbox.geometry.iloc[0])

# Step 2: Get the bounding box GeoDataFrame for only those filtered ENCs
# This provides geographic extents for visualization
enc_bbox_gdf = gpkg_factory.get_enc_bounding_boxes(enc_names_in_boundary)

# --- Visualize ENC Coverage on Map ---
# Step 3: Add the ENC boundaries to our map to verify coverage
# Different usage bands (1-6) represent different chart scales/detail levels
ply.add_enc_bbox_trace(figure=ply_fig, bbox_df=enc_bbox_gdf, usage_bands=[1,2,3,4,5,6])

end_time = time.perf_counter()
performance_metrics['ENC Filtering'] = end_time - start_time
print(f"ENC filtering took: {end_time - start_time:.2f}s")
ply_fig.show()

## 4. Graph Generation and Pathfinding

#### 4.1 Create Navigable Grid

In [None]:
# --- Import Graph Creation Module ---
from src.nautical_graph_toolkit.core.graph import BaseGraph

# --- Initialize BaseGraph for GeoPackage Backend ---
# BaseGraph provides the core graph creation functionality and works
# with any data backend (PostGIS, GeoPackage, SpatiaLite).
# It handles:
#   - Querying S-57 layers from the data source
#   - Creating navigable grids from chart data
#   - Building NetworkX graphs with proper connectivity
#   - Saving graphs to various formats
start_time = time.perf_counter()

gpkg_bg = BaseGraph(data_factory=ENCDataFactory(source=data_file),
                  graph_schema_name="graph")

# --- Create Navigable Grid ---
# This step queries the S-57 'seaare' (Sea Area) layer and other navigable
# layers to create a single polygon representing all navigable water within the AOI.
#
# The create_base_grid method:
# 1. Queries 'seaare' layer for primary navigable areas
# 2. Adds supplementary navigable layers (fairways, channels, traffic lanes)
# 3. Subtracts obstacle layers (land, constructions)
# 4. Optionally reduces the navigable area by reduce_distance_nm to maintain
#    safe distance from hazards (3 NM in this example for coastal routing)
#
# Parameters:
#   - port_boundary: Geographic boundary defining the area of interest
#   - departure_port/arrival_port: Used to ensure connectivity near ports
#   - layer_table: Primary navigable layer (typically "seaare")
#   - reduce_distance_nm: Safety buffer to shrink navigable area (3 = 3nm buffer)
grid = gpkg_bg.create_base_grid(port_boundary=port_bbox,
                              departure_port=port1,
                              arrival_port=port2,
                              layer_table="seaare",
                              reduce_distance_nm=3)

end_time = time.perf_counter()
performance_metrics['Grid Creation'] = end_time - start_time
print(f"Grid creation took: {end_time - start_time:.2f}s")
print(f"Grid components: {len(grid)}")
print(f"Grid type: {type(grid)}")
print(f"Grid keys: {list(grid.keys())}")


#### 4.2 Visualize Grid Components

In [None]:
# --- Visualize Grid Components ---
# We plot the different components of the generated grid to understand coverage:
# - main_grid (red): Primary sea area polygons from 'seaare' layer
# - extra_grids (green): Additional navigable areas (fairways, channels, etc.)
# - combined_grid (blue): Final merged navigable polygon used for graph creation
ply_grid = ply.create_base_map(mapbox_token=os.getenv('MAPBOX_TOKEN'))
ply.plotly_base_config(ply_grid)
ply.add_grid_trace(ply_grid, grid_geojson=grid["main_grid"], color="red")
ply.add_grid_trace(ply_grid, grid_geojson=grid["extra_grids"], color="green")
ply.add_grid_trace(ply_grid, grid_geojson=grid["combined_grid"], color="blue")
ply_grid.show()

#### 4.3 Construct Graph from Grid

In [None]:
# --- Construct Graph from Grid ---
# This is the core graph creation step. It populates the navigable grid polygon 
# with a dense network of nodes and edges.
#
# The create_base_graph method:
# 1. Generates a regular grid of nodes within the navigable polygon
#    - Node spacing: 0.3 nautical miles (configurable for performance vs precision)
#    - Finer spacing = more nodes = better route precision but longer computation
# 2. Creates edges between adjacent nodes (8-connectivity: N, S, E, W, NE, NW, SE, SW)
# 3. Calculates edge lengths in nautical miles for distance-based routing
# 4. Optionally filters to keep only the largest connected component
#    - Removes isolated node clusters that would cause pathfinding failures
#    - Ensures start and end nodes are always in the same connected network
#
# Performance considerations:
#   - 0.3 NM spacing: ~40K nodes for LA-SF route (fast, suitable for ocean routing)
#   - 0.1 NM spacing: ~360K nodes (slower, better for detailed coastal/harbor routing)
#   - keep_largest_component=True: Recommended to prevent routing errors

start_time = time.perf_counter()

G = gpkg_bg.create_base_graph(grid["combined_grid"], 
                              spacing_nm=0.3,
                              keep_largest_component=True)

end_time = time.perf_counter()
performance_metrics['Graph Creation'] = end_time - start_time
print(f"Graph creation took: {end_time - start_time:.2f}s")
print(f"Graph has {G.number_of_nodes():,} nodes and {G.number_of_edges():,} edges")


#### 4.4 Save Graph to File

In [None]:
# --- Save Graph to GeoPackage File ---
# Save the graph in GeoPackage format for portability and offline use.
#
# GeoPackage Advantages:
# ----------------------
# 1. Single-file format: Easy to share, archive, and version control
# 2. No server required: Works offline without database setup
# 3. Cross-platform: Opens in QGIS, ArcGIS, and other GIS software
# 4. Open standard: OGC-compliant format with wide tool support
# 5. Self-contained: All data in one file, no external dependencies
#
# The saved GeoPackage contains two layers:
#   - nodes: Point geometries with attributes (node_id, lon, lat)
#   - edges: LineString geometries with attributes (source, target, length)
#
# Use cases:
#   - Visualization in QGIS for quality assurance
#   - Sharing graphs with collaborators (no database access needed)
#   - Archiving graph snapshots for reproducibility
#   - Loading into other analysis tools (R, Python, QGIS processing)
#   - Working offline in the field or remote locations

start_time = time.perf_counter()

output_file = output_dir / "base_graph_GPKG.gpkg"
gpkg_bg.save_graph_to_gpkg(G, output_file)

end_time = time.perf_counter()
performance_metrics['Save to GPKG'] = end_time - start_time
print(f"Saving to GeoPackage took: {end_time - start_time:.2f}s")
print(f"Saved to: {output_file}")


#### 4.5 Perform Base Routing

In [None]:
# --- Perform Base Routing (Shortest Path Calculation) ---
# With the graph created, compute the optimal maritime route between ports
# using the A* pathfinding algorithm.
#
# A* Algorithm:
# -------------
# A* is an informed search algorithm that finds the shortest path efficiently by:
# 1. Using actual path cost (distance traveled so far)
# 2. Adding heuristic estimate (straight-line distance to goal)
# 3. Always exploring most promising nodes first
# 4. Guaranteeing optimal solution (with admissible heuristic)
#
# The Route class handles:
# ------------------------
# 1. Coordinate mapping: Maps port lat/lon to nearest graph nodes using spatial index
# 2. Validation: Ensures start and end nodes exist and are in same connected component
# 3. Pathfinding: Runs A* with Euclidean distance heuristic
# 4. Geometry creation: Converts node sequence to LineString route geometry
# 5. Distance calculation: Computes total route distance in nautical miles
#
# Current implementation uses base distance weighting (edges weighted by length only).
# Future enhancements can incorporate:
#   - Weather routing (wind, currents, waves)
#   - Traffic separation scheme compliance
#   - Depth restrictions (avoid shallow areas)
#   - Vessel-specific constraints (draft, turning radius)

from src.nautical_graph_toolkit.core.pathfinding_lite import Route

start_time = time.perf_counter()

route = Route(graph=G, data_manager=gpkg_factory.manager)
route_geometry, distance = route.base_route(
    departure_point=port1.geometry,
    arrival_point=port2.geometry
)

end_time = time.perf_counter()
performance_metrics['Pathfinding'] = end_time - start_time
print(f"Pathfinding took: {end_time - start_time:.2f}s")
print(f"Route distance: {distance:.2f} nautical miles")


#### 4.6 Visualize and Save Route

In [None]:
# --- Visualize Computed Route on Map ---
# Display the calculated route as a line on the map along with both ports.
# This provides a visual verification that the routing algorithm produced
# a sensible maritime path between the two locations.
ply_route = ply.create_base_map(mapbox_token=os.getenv('MAPBOX_TOKEN'))
ply.plotly_base_config(ply_route)
# Add the route line
ply.add_route_trace(figure=ply_route,
                    line=route_geometry,
                    name="Base Route")
# Add departure port marker
ply.add_single_port_trace(ply_route, port1, name=port1['PORT_NAME'], color='blue')
# Add arrival port marker
ply.add_single_port_trace(ply_route, port2, name=port2['PORT_NAME'], color='red')
ply_route.show()

In [None]:
# --- Save Route to GeoPackage File ---
# Store the computed route geometry for future reference and analysis.
#
# Route Storage Benefits:
# -----------------------
# 1. Persistence: Routes remain available across sessions
# 2. Comparison: Compare different routing strategies or parameters
# 3. Analysis: Analyze route characteristics (distance, waypoints, segments)
# 4. Visualization: Load routes in GIS tools for presentation
# 5. Integration: Use routes in other workflows (fuel estimation, ETA calculation)
# 6. Portability: Share routes with collaborators via single file
#
# Routes are saved in a dedicated GeoPackage file (maritime_routes.gpkg) with metadata:
#   - route_name: Identifier for retrieval
#   - geometry: LineString route path
#   - distance: Total distance in nautical miles
#   - timestamp: When route was computed
#
# The overwrite parameter controls behavior when a route with the same name exists:
#   - True: Replace existing route with new computation
#   - False: Raise error if route name already exists

gpkg_factory.save_route(route_geom=route_geometry,
                      route_name="base_route_GPKG",
                      table_name="base_route_table",
                      overwrite=True)
print("Route saved successfully to GeoPackage")
print(f"Location: {output_dir / 'maritime_routes.gpkg'}")
print(f"Table: base_route_table | Route name: base_route_GPKG")


## 5. Performance Summary

 This section visualizes the time taken for each step of the pipeline.

In [None]:
# --- Visualize Pipeline Performance Metrics ---
# Create an interactive bar chart showing time taken for each pipeline step.
# This helps identify bottlenecks and optimize the workflow for larger areas.
if performance_metrics:
    # Convert the dictionary to a pandas DataFrame for easy plotting
    perf_df = pd.DataFrame(list(performance_metrics.items()), columns=['Step', 'Time (seconds)'])
    perf_df = perf_df.sort_values(by='Time (seconds)', ascending=False)

    # Create an interactive bar chart with time values displayed
    fig = px.bar(
         perf_df,
         x='Step',
         y='Time (seconds)',
         title='Base Graph Creation Pipeline Performance (GeoPackage)',
         text_auto='.2f',
         labels={'Step': 'Pipeline Step', 'Time (seconds)': 'Time Taken (seconds)'}
    )
    fig.update_traces(textposition='outside')
    fig.show()
else:
    print("No performance metrics were recorded. Run the notebook cells to generate the summary.")

# --- Export Performance Benchmarks to CSV ---
# Save detailed performance metrics to CSV for long-term tracking and analysis.
# This allows comparison across different runs, configurations, and backends.
if performance_metrics:
    from datetime import datetime
    
    # Get graph statistics
    node_count = G.number_of_nodes()
    edge_count = G.number_of_edges()
    
    # Calculate normalized metrics (per 100K nodes)
    time_per_100k_nodes = {}
    if node_count > 0:
        for step, time_val in performance_metrics.items():
            time_per_100k_nodes[step] = (time_val / node_count) * 100000
    
    # Build benchmark record with metadata and metrics
    benchmark_record = {
        'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'workflow': 'graph_GeoPackage_v2',
        'data_source': 'GeoPackage',
        'db_schema': '',  # Not applicable for file-based backends
        'node_count': node_count,
        'edge_count': edge_count,
        'spacing_nm': 0.3,
        'reduce_distance_nm': 3,
        'aoi': 'Los Angeles - San Francisco',
        # Individual timing metrics
        'port_selection_boundary_sec': performance_metrics.get('Port Selection & Boundary', 0),
        'enc_filtering_sec': performance_metrics.get('ENC Filtering', 0),
        'grid_creation_sec': performance_metrics.get('Grid Creation', 0),
        'graph_creation_sec': performance_metrics.get('Graph Creation', 0),
        'save_gpkg_sec': performance_metrics.get('Save to GPKG', 0),
        'save_postgis_sec': 0,  # Not applicable for GeoPackage backend
        'pathfinding_sec': performance_metrics.get('Pathfinding', 0),
        # Normalized metrics (time per 100K nodes)
        'grid_creation_per_100k_nodes': time_per_100k_nodes.get('Grid Creation', 0),
        'graph_creation_per_100k_nodes': time_per_100k_nodes.get('Graph Creation', 0),
        'save_gpkg_per_100k_nodes': time_per_100k_nodes.get('Save to GPKG', 0),
        'save_postgis_per_100k_nodes': 0,  # Not applicable
        'pathfinding_per_100k_nodes': time_per_100k_nodes.get('Pathfinding', 0),
        # Total pipeline time
        'total_pipeline_sec': sum(performance_metrics.values()),
    }
    
    # Convert to DataFrame
    benchmark_df = pd.DataFrame([benchmark_record])
    
    # UNIFIED CSV: Same file across all base graph notebooks for cross-backend comparison
    benchmark_csv = output_dir / 'benchmark_graph_base.csv'
    
    # Append to existing CSV or create new one
    if benchmark_csv.exists():
        existing_df = pd.read_csv(benchmark_csv)
        combined_df = pd.concat([existing_df, benchmark_df], ignore_index=True)
        combined_df.to_csv(benchmark_csv, index=False)
        print(f"\nAppended benchmark to existing file: {benchmark_csv}")
        print(f"Total benchmark records: {len(combined_df)}")
    else:
        benchmark_df.to_csv(benchmark_csv, index=False)
        print(f"\nCreated new benchmark file: {benchmark_csv}")
    
    # Display the current benchmark record
    print("\n=== Current Benchmark Record ===")
    print(f"Timestamp: {benchmark_record['timestamp']}")
    print(f"Workflow: {benchmark_record['workflow']}")
    print(f"Data Source: {benchmark_record['data_source']}")
    print(f"Nodes: {benchmark_record['node_count']:,}")
    print(f"Edges: {benchmark_record['edge_count']:,}")
    print(f"Total Pipeline Time: {benchmark_record['total_pipeline_sec']:.2f}s")
    print(f"\nMost demanding operations:")
    
    # Show top 3 time-consuming operations
    top_operations = sorted(
        [(k, v) for k, v in performance_metrics.items()],
        key=lambda x: x[1],
        reverse=True
    )[:3]
    for i, (op, time_val) in enumerate(top_operations, 1):
        print(f"  {i}. {op}: {time_val:.2f}s")
else:
    print("No performance metrics to export.")
