# Day 1, Session 3: Python for Geospatial Data

## CoPhil 4-Day Advanced Training on AI/ML for Earth Observation

**EU-Philippines Copernicus Capacity Support Programme**

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Set up** a Python geospatial environment in Google Colab
2. **Load, inspect, and visualize** vector data using **GeoPandas**
3. **Load, inspect, and visualize** raster data using **Rasterio**
4. **Perform** basic geospatial operations (filtering, clipping, cropping)
5. **Calculate** vegetation indices (NDVI, NDWI) from Sentinel-2 imagery
6. **Combine** vector and raster data for integrated analysis
7. **Apply** these skills to Philippine EO applications (DRR, CCA, NRM)

---

## Why This Session Matters

**Python geospatial skills are the foundation of ALL AI/ML workflows in Earth Observation.**

You cannot:
- Train a model without loading training data ‚úó
- Preprocess satellite images without raster operations ‚úó
- Validate results without vector boundaries ‚úó
- Deploy solutions without understanding data formats ‚úó

**This session gives you the superpowers to:**
- Handle Sentinel-2 imagery like a pro ‚úì
- Work with Philippine administrative boundaries ‚úì
- Prepare analysis-ready datasets ‚úì
- Build production-ready EO applications ‚úì

---

## Prerequisites

- Basic Python knowledge (variables, loops, functions)
- Google account for Colab access
- Completion of Sessions 1-2 (Copernicus overview, AI/ML concepts)

---

## Session Structure

**Part 1:** Environment Setup (10 min)
**Part 2:** Python Basics Recap (10 min)
**Part 3:** GeoPandas for Vector Data (40 min)
**Part 4:** Rasterio for Raster Data (50 min)
**Part 5:** Combined Operations (30 min)

**Total:** ~2 hours with exercises

---

## Part 1: Environment Setup

### 1.1 Mount Google Drive

We'll use Google Drive to:
- Access sample datasets
- Save outputs and results
- Share data between sessions

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create working directory
import os
work_dir = '/content/drive/MyDrive/CoPhil_Training'
os.makedirs(work_dir, exist_ok=True)
os.makedirs(f'{work_dir}/outputs', exist_ok=True)

print(f"‚úì Google Drive mounted successfully!")
print(f"‚úì Working directory: {work_dir}")

### 1.2 Install Required Packages

**Core geospatial libraries:**
- **`geopandas`** - Vector data (shapefiles, GeoJSON)
- **`rasterio`** - Raster data (GeoTIFF, satellite imagery)
- **`shapely`** - Geometric operations
- **`pyproj`** - Coordinate reference systems
- **`pystac-client`** - Search satellite imagery catalogs (STAC API)
- **`planetary-computer`** - Access Microsoft Planetary Computer data

**Installation time:** 1-2 minutes

In [None]:
# Install geospatial libraries (suppress output for cleaner notebook)
!pip install geopandas rasterio shapely pyproj matplotlib contextily pystac-client planetary-computer -q

print("‚úì All packages installed successfully!")

### 1.3 Import Libraries and Verify Installation

In [None]:
# Core scientific libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Geospatial libraries
import geopandas as gpd
import rasterio
from rasterio.plot import show
from rasterio.mask import mask
from rasterio.warp import calculate_default_transform, reproject, Resampling
from shapely.geometry import Point, Polygon, box
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set visualization defaults for professional-looking plots
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 11
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['xtick.labelsize'] = 9
plt.rcParams['ytick.labelsize'] = 9
plt.rcParams['legend.fontsize'] = 10

# Print versions
print("‚úì All libraries imported successfully!\n")
print("Library Versions:")
print(f"  ‚Ä¢ NumPy: {np.__version__}")
print(f"  ‚Ä¢ Pandas: {pd.__version__}")
print(f"  ‚Ä¢ GeoPandas: {gpd.__version__}")
print(f"  ‚Ä¢ Rasterio: {rasterio.__version__}")
print(f"  ‚Ä¢ Matplotlib: {plt.matplotlib.__version__}")
print("\n" + "="*60)

---

## Part 2: Python Basics Quick Recap

Before diving into geospatial operations, let's review Python fundamentals you'll encounter throughout this notebook.

**If you're comfortable with Python, feel free to skim this section.**

### 2.1 Data Types and Structures

In [None]:
# Strings - text data
province_name = "Palawan"
region = "MIMAROPA"

# Numbers - integers and floats
population = 1200000  # integer
area_km2 = 14649.73   # float (decimal)

# Lists - ordered collections (can be modified)
philippine_islands = ["Luzon", "Visayas", "Mindanao"]
band_numbers = [2, 3, 4, 8]  # Sentinel-2 bands

# Dictionaries - key-value pairs
province_data = {
    "name": "Palawan",
    "capital": "Puerto Princesa",
    "population": 1200000,
    "area_km2": 14649.73,
    "coordinates": [118.73, 9.85]
}

# Accessing data
print(f"Province: {province_name}")
print(f"First island: {philippine_islands[0]}")
print(f"Capital: {province_data['capital']}")
print(f"Population density: {population / area_km2:.1f} people/km¬≤")

### 2.2 Control Structures - Loops and Conditionals

In [None]:
# For loops - iterate over collections
print("Philippine Island Groups:")
for island in philippine_islands:
    print(f"  ‚Ä¢ {island}")

# If-elif-else - conditional execution
ndvi_value = 0.65

if ndvi_value < 0:
    vegetation_class = "Water/Bare soil"
elif ndvi_value < 0.2:
    vegetation_class = "Sparse vegetation"
elif ndvi_value < 0.5:
    vegetation_class = "Moderate vegetation"
else:
    vegetation_class = "Dense vegetation"

print(f"\nNDVI = {ndvi_value} ‚Üí {vegetation_class}")

# List comprehension - compact way to create lists
band_names = [f"Band_{b}" for b in band_numbers]
print(f"\nBand names: {band_names}")

### 2.3 Functions - Reusable Code Blocks

In [None]:
def calculate_ndvi(nir, red):
    """
    Calculate Normalized Difference Vegetation Index.
    
    NDVI = (NIR - Red) / (NIR + Red)
    
    Parameters:
    -----------
    nir : array-like
        Near-infrared band values
    red : array-like
        Red band values
    
    Returns:
    --------
    ndvi : array-like
        NDVI values (-1 to 1)
    """
    # Convert to float to avoid integer division
    nir = nir.astype(float)
    red = red.astype(float)
    
    # Calculate NDVI, handling division by zero
    denominator = nir + red
    ndvi = np.where(denominator != 0, (nir - red) / denominator, 0)
    
    return ndvi

# Test the function
nir_test = np.array([5000, 3000, 1000])
red_test = np.array([1500, 1200, 900])
result = calculate_ndvi(nir_test, red_test)

print("NDVI Calculation Test:")
for i in range(len(result)):
    print(f"  NIR={nir_test[i]}, Red={red_test[i]} ‚Üí NDVI={result[i]:.3f}")

---

## Part 3: GeoPandas for Vector Data

**GeoPandas** extends pandas for geospatial vector data (points, lines, polygons).

### Why GeoPandas?

- ‚úì Read/write multiple formats (Shapefile, GeoJSON, KML, etc.)
- ‚úì Spatial operations (intersection, buffer, union)
- ‚úì Coordinate reference system (CRS) transformations
- ‚úì Easy visualization
- ‚úì Integration with pandas (filtering, grouping, etc.)

### 3.1 Loading Real Philippine Administrative Boundaries

We'll load actual Philippine province boundaries from **Natural Earth Data**, a public domain dataset maintained by cartographers worldwide.

**Data Source:** https://www.naturalearthdata.com/  
**License:** Public Domain  
**Coverage:** Global administrative boundaries

In [None]:
# Load Philippine provinces from Natural Earth Data
print("Loading Philippine administrative boundaries from Natural Earth...")
print("This may take a moment on first load (downloading ~2MB)...\n")

# Natural Earth Admin 1 (provinces/states) - 10m resolution
url = "https://naciscdn.org/naturalearth/10m/cultural/ne_10m_admin_1_states_provinces.zip"

# Read all provinces worldwide
world_provinces = gpd.read_file(url)

# Filter for Philippines only
philippines_gdf = world_provinces[world_provinces['admin'] == 'Philippines'].copy()

# Select and rename relevant columns for clarity
philippines_gdf = philippines_gdf[['name', 'region', 'geometry']].copy()
philippines_gdf.columns = ['Province', 'Region', 'geometry']

# Add approximate population data for major provinces (2020 estimates)
# Source: Philippine Statistics Authority (PSA)
pop_data = {
    'Metropolitan Manila': 13484462,
    'Cebu': 5155000,
    'Pangasinan': 3163000,
    'Bulacan': 3708000,
    'Cavite': 4344000,
    'Laguna': 3382000,
    'Rizal': 3330000,
    'Batangas': 2908000,
    'Pampanga': 2609000,
    'Negros Occidental': 2623000,
    'Palawan': 939594,
    'Davao del Sur': 2804000,
    'Iloilo': 2092000,
    'Cagayan': 1268000
}

# Map population data
philippines_gdf['Population'] = philippines_gdf['Province'].map(pop_data)

# Fill missing population with estimated average
avg_pop = 800000
philippines_gdf['Population'] = philippines_gdf['Population'].fillna(avg_pop)

# Calculate area in km¬≤ using accurate projected CRS
philippines_utm = philippines_gdf.to_crs('EPSG:32651')  # UTM Zone 51N
philippines_gdf['Area_km2'] = philippines_utm.geometry.area / 1e6

# Calculate population density
philippines_gdf['Density'] = philippines_gdf['Population'] / philippines_gdf['Area_km2']

# Add island group classification
def classify_island_group(province_name):
    """Classify provinces into island groups based on location"""
    luzon = ['Metropolitan Manila', 'Bulacan', 'Cavite', 'Laguna', 'Rizal', 'Batangas', 
             'Pampanga', 'Pangasinan', 'Cagayan', 'Nueva Ecija', 'Tarlac', 'Zambales',
             'Bataan', 'Aurora', 'Nueva Vizcaya', 'Quirino', 'Isabela', 'Ifugao',
             'Benguet', 'La Union', 'Ilocos Norte', 'Ilocos Sur', 'Abra', 'Kalinga',
             'Mountain Province', 'Apayao', 'Albay', 'Camarines Norte', 'Camarines Sur',
             'Catanduanes', 'Masbate', 'Sorsogon', 'Marinduque', 'Occidental Mindoro',
             'Oriental Mindoro', 'Palawan', 'Romblon']
    
    visayas = ['Cebu', 'Bohol', 'Negros Occidental', 'Negros Oriental', 'Iloilo', 
               'Aklan', 'Antique', 'Capiz', 'Guimaras', 'Leyte', 'Southern Leyte',
               'Samar', 'Eastern Samar', 'Northern Samar', 'Biliran', 'Siquijor']
    
    if any(luzon_prov in province_name for luzon_prov in luzon):
        return 'Luzon'
    elif any(visayas_prov in province_name for visayas_prov in visayas):
        return 'Visayas'
    else:
        return 'Mindanao'

philippines_gdf['Island_Group'] = philippines_gdf['Province'].apply(classify_island_group)

print("‚úì Philippine provinces loaded successfully!")
print(f"  Total provinces: {len(philippines_gdf)}")
print(f"  CRS: {philippines_gdf.crs.name}")
print(f"  Island groups: {philippines_gdf['Island_Group'].unique()}")
print(f"\nSample provinces:")
display(philippines_gdf[['Province', 'Region', 'Island_Group', 'Population', 'Area_km2']].head(10))

### 3.2 Inspecting the GeoDataFrame

In [None]:
# Display first few rows
print("First 3 provinces:")
display(philippines_gdf.head(3))

# Check data types
print("\nColumn data types:")
print(philippines_gdf.dtypes)

# Summary statistics
print("\nSummary statistics:")
display(philippines_gdf[['Population', 'Area_km2', 'Density']].describe())

In [None]:
# Coordinate Reference System (CRS) information
print("CRS Details:")
print(f"  Name: {philippines_gdf.crs.name}")
print(f"  EPSG Code: {philippines_gdf.crs.to_epsg()}")
print(f"  Units: {philippines_gdf.crs.axis_info[0].unit_name}")

# Bounds (extent)
bounds = philippines_gdf.total_bounds
print(f"\nGeographic Extent:")
print(f"  Min Longitude: {bounds[0]:.2f}¬∞")
print(f"  Min Latitude:  {bounds[1]:.2f}¬∞")
print(f"  Max Longitude: {bounds[2]:.2f}¬∞")
print(f"  Max Latitude:  {bounds[3]:.2f}¬∞")

### 3.3 Filtering and Querying Vector Data

In [None]:
# Filter by attribute: Select provinces in Mindanao
mindanao = philippines_gdf[philippines_gdf['Island_Group'] == 'Mindanao']
print("Mindanao Provinces:")
print(mindanao[['Province', 'Population', 'Area_km2']])

# Filter by condition: High-density provinces
high_density = philippines_gdf[philippines_gdf['Density'] > 1000]
print("\nHigh Density Provinces (>1000 people/km¬≤):")
print(high_density[['Province', 'Density']].sort_values('Density', ascending=False))

# Multiple conditions: Large AND populous
major_provinces = philippines_gdf[
    (philippines_gdf['Population'] > 1000000) & 
    (philippines_gdf['Area_km2'] > 5000)
]
print("\nMajor Provinces (>1M pop AND >5000 km¬≤):")
print(major_provinces[['Province', 'Population', 'Area_km2']])

### 3.4 Spatial Operations

In [None]:
# Calculate centroids for all provinces
philippines_gdf['centroid'] = philippines_gdf.geometry.centroid

print("Centroid coordinates (first 10 provinces):")
for idx, row in philippines_gdf.head(10).iterrows():
    print(f"  {row['Province']:<30} ({row['centroid'].x:.3f}, {row['centroid'].y:.3f})")

# Create buffer around Metropolitan Manila (50km radius)
manila = philippines_gdf[philippines_gdf['Province'] == 'Metropolitan Manila']

if len(manila) > 0:
    # Project to UTM for accurate buffering (meters)
    manila_utm = manila.to_crs('EPSG:32651')  # UTM Zone 51N
    manila_buffer_utm = manila_utm.buffer(50000)  # 50km buffer in meters
    manila_buffer = manila_buffer_utm.to_crs('EPSG:4326')  # Back to geographic
    
    print(f"\n‚úì Created 50km buffer around Metropolitan Manila")
    print(f"  Original area: {manila['Area_km2'].values[0]:.0f} km¬≤")
    
    # Calculate buffer area (approximate)
    buffer_area_km2 = manila_buffer_utm.area.values[0] / 1e6
    print(f"  Buffer area: {buffer_area_km2:.0f} km¬≤")
    
    # Find provinces that intersect with Manila buffer
    philippines_gdf_utm = philippines_gdf.to_crs('EPSG:32651')
    intersects = philippines_gdf_utm.geometry.intersects(manila_buffer_utm.geometry.values[0])
    nearby_provinces = philippines_gdf[intersects]['Province'].tolist()
    
    print(f"\nProvinces within 50km of Manila:")
    for prov in nearby_provinces:
        print(f"  ‚Ä¢ {prov}")
else:
    print("\n‚ö† Metropolitan Manila not found in dataset")
    print("  Proceeding with other spatial operations...")

### 3.5 Visualizing Vector Data

In [None]:
# Visualization with basemap - Philippine Provinces
import contextily as ctx

fig, ax = plt.subplots(figsize=(14, 12))

# Project to Web Mercator for basemap compatibility
philippines_web = philippines_gdf.to_crs(epsg=3857)

# Plot provinces
philippines_web.plot(
    ax=ax,
    color='lightblue',
    edgecolor='darkblue',
    linewidth=1.5,
    alpha=0.5
)

# Add basemap (OpenStreetMap)
ctx.add_basemap(
    ax,
    source=ctx.providers.OpenStreetMap.Mapnik,
    zoom=6,
    alpha=0.6
)

# Add province labels (for major provinces)
major_provinces = philippines_web[philippines_web['Population'] > 2000000]
for idx, row in major_provinces.iterrows():
    centroid = row['geometry'].centroid
    ax.annotate(
        text=row['Province'],
        xy=(centroid.x, centroid.y),
        ha='center',
        fontsize=9,
        fontweight='bold',
        color='darkred',
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='darkred', alpha=0.7)
    )

ax.set_title('Philippine Provinces with OpenStreetMap Basemap', 
             fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Longitude', fontsize=11)
ax.set_ylabel('Latitude', fontsize=11)
ax.set_axis_off()  # Hide axis for cleaner map
plt.tight_layout()
plt.show()

print("‚úì Map with basemap created!")
print("  Basemap source: OpenStreetMap (Mapnik)")

In [None]:
# Choropleth map - Population with basemap
fig, ax = plt.subplots(figsize=(14, 12))

# Project to Web Mercator
philippines_web = philippines_gdf.to_crs(epsg=3857)

# Plot choropleth
philippines_web.plot(
    ax=ax,
    column='Population',
    cmap='YlOrRd',
    edgecolor='black',
    linewidth=0.8,
    legend=True,
    alpha=0.7,
    legend_kwds={
        'label': 'Population',
        'orientation': 'vertical',
        'shrink': 0.6,
        'pad': 0.05
    }
)

# Add basemap
ctx.add_basemap(
    ax,
    source=ctx.providers.CartoDB.Positron,  # Light basemap for better contrast
    zoom=6,
    alpha=0.4
)

ax.set_title('Philippine Provinces by Population (with Basemap)', 
             fontsize=14, fontweight='bold', pad=20)
ax.set_axis_off()
plt.tight_layout()
plt.show()

print("‚úì Choropleth map with basemap created!")
print("  Basemap: CartoDB Positron (light theme)")

In [None]:
# Categorical map - Island Groups with basemap
fig, ax = plt.subplots(figsize=(14, 12))

# Project to Web Mercator
philippines_web = philippines_gdf.to_crs(epsg=3857)

# Define colors for island groups
island_colors = {'Luzon': '#2ecc71', 'Visayas': '#3498db', 'Mindanao': '#e74c3c'}
philippines_web['color'] = philippines_web['Island_Group'].map(island_colors)

# Plot provinces by island group
philippines_web.plot(
    ax=ax,
    color=philippines_web['color'],
    edgecolor='black',
    linewidth=0.8,
    alpha=0.6
)

# Add basemap
ctx.add_basemap(
    ax,
    source=ctx.providers.Stamen.TonerLite,
    zoom=6,
    alpha=0.5
)

# Create custom legend
legend_elements = [
    Patch(facecolor='#2ecc71', edgecolor='black', label='Luzon'),
    Patch(facecolor='#3498db', edgecolor='black', label='Visayas'),
    Patch(facecolor='#e74c3c', edgecolor='black', label='Mindanao')
]
ax.legend(handles=legend_elements, loc='upper left', title='Island Group',
          fontsize=11, title_fontsize=12, frameon=True, fancybox=True, shadow=True)

ax.set_title('Philippine Island Groups with Basemap', 
             fontsize=14, fontweight='bold', pad=20)
ax.set_axis_off()
plt.tight_layout()
plt.show()

print("‚úì Island group map with basemap created!")
print("  Basemap: Stamen Toner Lite")

### üìù Exercise 1: Select and Plot Your Home Province

**Task:** 
1. Select a province from the GeoDataFrame
2. Calculate its population density
3. Create a focused map showing only that province
4. Add informative labels

**Hint:** Use boolean filtering: `gdf[gdf['Province'] == 'YourProvince']`

In [None]:
# YOUR CODE HERE
# Example solution (uncomment and modify):

# my_province = philippines_gdf[philippines_gdf['Province'] == 'Palawan']
# density = my_province['Population'].values[0] / my_province['Area_km2'].values[0]

# fig, ax = plt.subplots(figsize=(10, 8))
# my_province.plot(ax=ax, color='green', edgecolor='black', linewidth=2, alpha=0.6)
# ax.set_title(f"{my_province['Province'].values[0]} Province\nDensity: {density:.1f} people/km¬≤",
#              fontsize=14, fontweight='bold')
# plt.show()

<details>
<summary><b>Click to see solution</b></summary>

```python
# Select Palawan
my_province = philippines_gdf[philippines_gdf['Province'] == 'Palawan']

# Calculate density
pop = my_province['Population'].values[0]
area = my_province['Area_km2'].values[0]
density = pop / area

# Create visualization
fig, ax = plt.subplots(figsize=(10, 8))
my_province.plot(
    ax=ax,
    color='forestgreen',
    edgecolor='darkgreen',
    linewidth=2,
    alpha=0.6
)

# Add info text
info_text = f"Population: {pop:,}\nArea: {area:.0f} km¬≤\nDensity: {density:.1f} people/km¬≤"
ax.text(0.02, 0.98, info_text,
        transform=ax.transAxes,
        fontsize=10,
        verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax.set_title(f"{my_province['Province'].values[0]} Province",
             fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Longitude (¬∞E)')
ax.set_ylabel('Latitude (¬∞N)')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
</details>

---

## Part 4: Rasterio for Raster Data

**Rasterio** is the go-to library for working with raster/gridded data like satellite imagery.

### Why Rasterio?

- ‚úì Read/write GeoTIFF and other raster formats
- ‚úì NumPy integration for fast array operations
- ‚úì Handles multi-band imagery (Sentinel-2 has 13 bands!)
- ‚úì Georeferencing and coordinate transformations
- ‚úì Masking, clipping, resampling, reprojection

### 4.1 Loading Real Sentinel-2 Data from Cloud Storage

We'll use **Microsoft Planetary Computer** or **Element84 Earth Search** to access real Sentinel-2 imagery over Palawan.

**Why cloud access?**
- No need to manually download large files
- Access to entire Sentinel-2 archive
- Data is already processed to L2A (bottom-of-atmosphere)
- Cloud-optimized GeoTIFF format for efficient streaming

In [None]:
from pystac_client import Client
import planetary_computer

print("Searching for Sentinel-2 imagery over Palawan, Philippines...")
print("This may take a moment to query the catalog and access cloud data...\n")

# Define area of interest - Palawan bounding box
palawan_bbox = [118.5, 8.5, 119.5, 11.5]  # [min_lon, min_lat, max_lon, max_lat]

# Open Microsoft Planetary Computer STAC catalog
catalog = Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace  # Handles authentication
)

# Search for Sentinel-2 L2A imagery
search = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=palawan_bbox,
    datetime="2024-01-01/2024-12-31",  # Full year 2024
    query={"eo:cloud_cover": {"lt": 10}}  # Less than 10% cloud cover
)

# Get items
items = list(search.items())
print(f"‚úì Found {len(items)} Sentinel-2 scenes with <10% cloud cover")

if len(items) == 0:
    print("‚ö† No suitable imagery found. Expanding search criteria...")
    # Fallback: relax cloud cover constraint
    search = catalog.search(
        collections=["sentinel-2-l2a"],
        bbox=palawan_bbox,
        datetime="2024-01-01/2024-12-31",
        query={"eo:cloud_cover": {"lt": 30}}  # Less than 30% cloud cover
    )
    items = list(search.items())
    print(f"‚úì Found {len(items)} scenes with <30% cloud cover")

# Select the least cloudy scene
items_sorted = sorted(items, key=lambda x: x.properties.get("eo:cloud_cover", 100))
selected_item = items_sorted[0]

# Print scene information
print(f"\n{'='*70}")
print("SELECTED SENTINEL-2 SCENE")
print(f"{'='*70}")
print(f"  Scene ID: {selected_item.id}")
print(f"  Date: {selected_item.datetime.strftime('%Y-%m-%d %H:%M:%S UTC')}")
print(f"  Cloud Cover: {selected_item.properties.get('eo:cloud_cover', 'N/A'):.2f}%")
print(f"  Platform: {selected_item.properties.get('platform', 'Sentinel-2')}")
print(f"  Processing Level: L2A (Bottom-of-Atmosphere)")

# Print available bands
print(f"\nAvailable Bands:")
bands_needed = ['B02', 'B03', 'B04', 'B08']
for band in bands_needed:
    if band in selected_item.assets:
        print(f"  ‚Ä¢ {band}: {selected_item.assets[band].title}")

print(f"{'='*70}\n")

# Store the selected item for next cells
sentinel_item = selected_item

### 4.2 Writing Real Sentinel-2 Subset to Local File

Let's save this subset as a multi-band GeoTIFF for later use.

In [None]:
from rasterio.windows import Window

print("Loading Sentinel-2 bands from cloud storage...")
print("Reading subset of data for performance (1000x1000 pixels)\n")

# Get signed URLs for the bands we need
band_blue_url = sentinel_item.assets["B02"].href  # Blue (10m)
band_green_url = sentinel_item.assets["B03"].href  # Green (10m)
band_red_url = sentinel_item.assets["B04"].href  # Red (10m)
band_nir_url = sentinel_item.assets["B08"].href  # NIR (10m)

# Define window to read (subset for performance)
# We'll read a 1000x1000 pixel window from the center
# This represents approximately 10km x 10km at 10m resolution

# Open one band to get dimensions
with rasterio.open(band_red_url) as src:
    full_height, full_width = src.height, src.width
    
    # Calculate center window
    window_size = min(1000, full_width, full_height)  # 1000x1000 or smaller if image is smaller
    col_off = (full_width - window_size) // 2
    row_off = (full_height - window_size) // 2
    
    window = Window(col_off, row_off, window_size, window_size)
    
    print(f"Full scene dimensions: {full_width} x {full_height} pixels")
    print(f"Reading subset: {window_size} x {window_size} pixels")
    print(f"Window location: row {row_off}, col {col_off}\n")
    
    # Store metadata for later
    transform = src.window_transform(window)
    crs = src.crs
    bounds = src.window_bounds(window)

# Load each band with the same window
print("Loading bands...")

with rasterio.open(band_blue_url) as src:
    band_blue = src.read(1, window=window)
print("  ‚úì Blue (B02) loaded")

with rasterio.open(band_green_url) as src:
    band_green = src.read(1, window=window)
print("  ‚úì Green (B03) loaded")

with rasterio.open(band_red_url) as src:
    band_red = src.read(1, window=window)
print("  ‚úì Red (B04) loaded")

with rasterio.open(band_nir_url) as src:
    band_nir = src.read(1, window=window)
print("  ‚úì NIR (B08) loaded")

# Store for convenience
blue = band_blue
green = band_green
red = band_red
nir = band_nir

# Store dimensions
height, width = red.shape

print(f"\n‚úì Real Sentinel-2 bands loaded successfully!")
print(f"  Dimensions: {width} x {height} pixels")
print(f"  Bands: Blue (B02), Green (B03), Red (B04), NIR (B08)")
print(f"  Resolution: 10m per pixel")
print(f"  Coverage: ~{(width*10)/1000:.1f}km x {(height*10)/1000:.1f}km")
print(f"  Location: Palawan, Philippines")
print(f"  CRS: {crs}")
print(f"  Bounds: {bounds}")

In [None]:
# Save real Sentinel-2 subset as multi-band GeoTIFF
raster_path = '/tmp/palawan_sentinel2_real.tif'

# Create raster profile
profile = {
    'driver': 'GTiff',
    'height': height,
    'width': width,
    'count': 4,  # 4 bands
    'dtype': blue.dtype,
    'crs': crs,
    'transform': transform,
    'compress': 'lzw',
    'nodata': 0
}

# Write all bands to file
with rasterio.open(raster_path, 'w', **profile) as dst:
    dst.write(blue, 1)
    dst.write(green, 2)
    dst.write(red, 3)
    dst.write(nir, 4)
    
    # Set band descriptions
    dst.set_band_description(1, 'Blue (B02)')
    dst.set_band_description(2, 'Green (B03)')
    dst.set_band_description(3, 'Red (B04)')
    dst.set_band_description(4, 'NIR (B08)')

print(f"‚úì Real Sentinel-2 subset saved: {raster_path}")
print(f"  File size: {os.path.getsize(raster_path) / 1024 / 1024:.2f} MB")
print(f"  Bands: 4 (Blue, Green, Red, NIR)")
print(f"  This file demonstrates saving cloud-streamed data locally")

### 4.3 Saving Real Data Subset as Local GeoTIFF

For convenience and to demonstrate file I/O operations, let's save this real Sentinel-2 subset as a local GeoTIFF file.

In [None]:
### 4.4 Opening and Inspecting Raster Metadata

Now let's reopen the saved file and inspect its metadata, demonstrating best practices for raster file handling.

In [None]:
# We already have the band data loaded from cloud storage
# Let's verify the arrays we're working with

print("Band Arrays (loaded from Microsoft Planetary Computer):")
print(f"  Blue (B02):  shape={blue.shape}, dtype={blue.dtype}")
print(f"  Green (B03): shape={green.shape}, dtype={green.dtype}")
print(f"  Red (B04):   shape={red.shape}, dtype={red.dtype}")
print(f"  NIR (B08):   shape={nir.shape}, dtype={nir.dtype}")

print(f"\nData characteristics:")
print(f"  Dimensions: {height} rows √ó {width} columns")
print(f"  Total pixels: {height * width:,}")
print(f"  Coverage area: ~{(width*10)/1000:.1f}km √ó {(height*10)/1000:.1f}km")
print(f"  Pixel resolution: 10m")

# Note: We can also read these from the saved GeoTIFF file if needed
# with rasterio.open(raster_path) as src:
#     blue_from_file = src.read(1)
#     green_from_file = src.read(2)
#     etc.

### 4.5 Reading Raster Data as NumPy Arrays

In [None]:
# Calculate statistics for each band (using real Sentinel-2 data)
bands_dict = {
    'Blue (B02)': blue,
    'Green (B03)': green,
    'Red (B04)': red,
    'NIR (B08)': nir
}

print("="*80)
print("BAND STATISTICS - REAL SENTINEL-2 DATA")
print("(Surface Reflectance, 0-10000 scale)")
print("="*80)
print(f"{'Band':<15} {'Min':>8} {'Max':>8} {'Mean':>10} {'Median':>10} {'Std Dev':>10}")
print("-"*80)

for band_name, band_data in bands_dict.items():
    print(f"{band_name:<15} "
          f"{band_data.min():>8} "
          f"{band_data.max():>8} "
          f"{band_data.mean():>10.1f} "
          f"{np.median(band_data):>10.1f} "
          f"{band_data.std():>10.1f}")

print("="*80)

# Calculate percentiles for Red band
print("\nPercentile Analysis (Red band):")
percentiles = [5, 25, 50, 75, 95]
values = np.percentile(red, percentiles)
for p, v in zip(percentiles, values):
    print(f"  {p}th percentile: {v:.0f}")

print("\n‚úì These are real spectral values from Palawan, Philippines!")
print("  Values reflect actual surface conditions on the acquisition date")

### 4.6 Calculating Band Statistics

Let's examine the real spectral signatures from Palawan:

In [None]:
# Visualize NIR band (grayscale) - using real Sentinel-2 data
fig, ax = plt.subplots(figsize=(12, 10))

# Convert to reflectance (0-1 scale)
nir_refl = nir / 10000.0

im = ax.imshow(nir_refl, cmap='gray', vmin=0, vmax=0.6)
cbar = plt.colorbar(im, ax=ax, shrink=0.8)
cbar.set_label('NIR Reflectance', fontsize=11)

ax.set_title('Real Sentinel-2 Near-Infrared Band (B08) - Palawan', 
             fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Column (pixel)', fontsize=11)
ax.set_ylabel('Row (pixel)', fontsize=11)

# Add explanation text
explanation = (
    "NIR (Near-Infrared) - Real Data:\n"
    "‚Ä¢ Bright = High reflectance (vegetation)\n"
    "‚Ä¢ Dark = Low reflectance (water, bare soil)\n"
    "‚Ä¢ Pattern shows actual land cover"
)
ax.text(0.02, 0.98, explanation,
        transform=ax.transAxes,
        fontsize=9,
        verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))

plt.tight_layout()
plt.show()

print("‚úì This is real NIR data from Microsoft Planetary Computer!")
print("  Bright areas indicate healthy vegetation over Palawan")

### 4.7 Visualizing Single Bands

In [None]:
# Calculate statistics for each band
bands_dict = {
    'Blue (B2)': blue,
    'Green (B3)': green,
    'Red (B4)': red,
    'NIR (B8)': nir
}

print("="*80)
print("BAND STATISTICS (Sentinel-2 Reflectance, 0-10000 scale)")
print("="*80)
print(f"{'Band':<15} {'Min':>8} {'Max':>8} {'Mean':>10} {'Median':>10} {'Std Dev':>10}")
print("-"*80)

for band_name, band_data in bands_dict.items():
    print(f"{band_name:<15} "
          f"{band_data.min():>8} "
          f"{band_data.max():>8} "
          f"{band_data.mean():>10.1f} "
          f"{np.median(band_data):>10.1f} "
          f"{band_data.std():>10.1f}")

print("="*80)

# Calculate percentiles
print("\nPercentile Analysis (Red band):")
percentiles = [5, 25, 50, 75, 95]
values = np.percentile(red, percentiles)
for p, v in zip(percentiles, values):
    print(f"  {p}th percentile: {v:.0f}")

### 4.6 Visualizing Single Bands

In [None]:
### 4.8 Creating RGB True Color Composite

In [None]:
# Visualize all 4 bands in subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

bands_to_plot = [
    (blue, 'Blue (B2)', 'Blues'),
    (green, 'Green (B3)', 'Greens'),
    (red, 'Red (B4)', 'Reds'),
    (nir, 'NIR (B8)', 'gray')
]

for idx, (band, title, cmap) in enumerate(bands_to_plot):
    im = axes[idx].imshow(band / 10000.0, cmap=cmap, vmin=0, vmax=0.6)
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Column', fontsize=9)
    axes[idx].set_ylabel('Row', fontsize=9)
    plt.colorbar(im, ax=axes[idx], fraction=0.046, pad=0.04)

plt.suptitle('Sentinel-2 Multispectral Bands - Palawan', 
             fontsize=15, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

### 4.9 False Color Composites

False color composites use **non-visible** bands to highlight specific features.

In [None]:
# Calculate NDVI using real Sentinel-2 data
ndvi = calculate_ndvi(nir, red)

# Print statistics
print("="*60)
print("NDVI STATISTICS - REAL SENTINEL-2 DATA")
print("="*60)
print(f"Minimum:   {ndvi.min():.4f}")
print(f"Maximum:   {ndvi.max():.4f}")
print(f"Mean:      {ndvi.mean():.4f}")
print(f"Median:    {np.median(ndvi):.4f}")
print(f"Std Dev:   {ndvi.std():.4f}")
print("="*60)

# Calculate area by vegetation class
# Note: For real data we need to calculate pixel area from transform
res_x_meters = abs(transform[0])  # meters per pixel
res_y_meters = abs(transform[4])  # meters per pixel
pixel_area_km2 = (res_x_meters * res_y_meters) / 1e6  # Convert to km¬≤

water_pixels = np.sum(ndvi < 0)
sparse_pixels = np.sum((ndvi >= 0) & (ndvi < 0.2))
moderate_pixels = np.sum((ndvi >= 0.2) & (ndvi < 0.5))
dense_pixels = np.sum((ndvi >= 0.5) & (ndvi < 0.8))
very_dense_pixels = np.sum(ndvi >= 0.8)

print("\nVegetation Cover Analysis (Real Data from Palawan):")
print(f"  Water/Bare (<0):       {water_pixels:>6} pixels ({water_pixels * pixel_area_km2:.1f} km¬≤)")
print(f"  Sparse (0-0.2):        {sparse_pixels:>6} pixels ({sparse_pixels * pixel_area_km2:.1f} km¬≤)")
print(f"  Moderate (0.2-0.5):    {moderate_pixels:>6} pixels ({moderate_pixels * pixel_area_km2:.1f} km¬≤)")
print(f"  Dense (0.5-0.8):       {dense_pixels:>6} pixels ({dense_pixels * pixel_area_km2:.1f} km¬≤)")
print(f"  Very Dense (>0.8):     {very_dense_pixels:>6} pixels ({very_dense_pixels * pixel_area_km2:.1f} km¬≤)")

# Calculate vegetation percentage
veg_pixels = moderate_pixels + dense_pixels + very_dense_pixels
total_pixels = width * height
veg_percentage = (veg_pixels / total_pixels) * 100

print(f"\n‚úì Overall Vegetation Coverage: {veg_percentage:.1f}%")
print(f"  Based on real Sentinel-2 imagery from Microsoft Planetary Computer")

### 4.10 Calculating NDVI (Normalized Difference Vegetation Index)

**NDVI is THE most important vegetation index in remote sensing.**

$$NDVI = \frac{NIR - Red}{NIR + Red}$$

**Interpretation:**
- **-1 to 0**: Water, bare soil, snow
- **0 to 0.2**: Sparse vegetation, rock
- **0.2 to 0.5**: Shrubs, grassland
- **0.5 to 0.8**: Dense vegetation, healthy crops
- **0.8 to 1**: Very dense vegetation (tropical forest)

In [None]:
# False Color Composite: NIR-Red-Green (Vegetation appears bright red)
false_color_nrg = np.dstack([nir, red, green]) / 10000.0

# Apply stretch
p2, p98 = np.percentile(false_color_nrg, (2, 98))
false_color_nrg_stretched = np.clip((false_color_nrg - p2) / (p98 - p2), 0, 1)

# Display
fig, ax = plt.subplots(figsize=(12, 10))

ax.imshow(false_color_nrg_stretched)
ax.set_title('False Color Composite (NIR-R-G) - Vegetation Analysis',
             fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Column', fontsize=11)
ax.set_ylabel('Row', fontsize=11)

# Add legend
legend_text = (
    "False Color Interpretation:\n"
    "‚Ä¢ Bright Red = Dense vegetation\n"
    "‚Ä¢ Pink/Light Red = Moderate vegetation\n"
    "‚Ä¢ Dark Blue/Black = Water\n"
    "‚Ä¢ Gray/White = Urban, bare soil\n\n"
    "Band Assignment:\n"
    "R = NIR (B8), G = Red (B4), B = Green (B3)"
)
ax.text(1.02, 0.5, legend_text,
        transform=ax.transAxes,
        fontsize=9,
        verticalalignment='center',
        bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.9))

plt.tight_layout()
plt.show()

print("Why False Color?")
print("  ‚Ä¢ Vegetation reflects STRONGLY in NIR (invisible to human eye)")
print("  ‚Ä¢ By mapping NIR to Red channel, vegetation appears bright red")
print("  ‚Ä¢ Makes vegetation identification much easier!")
print("  ‚Ä¢ Critical for agriculture, forestry, and NRM applications")

### 4.9 Calculating NDVI (Normalized Difference Vegetation Index)

**NDVI is THE most important vegetation index in remote sensing.**

$$NDVI = \frac{NIR - Red}{NIR + Red}$$

**Interpretation:**
- **-1 to 0**: Water, bare soil, snow
- **0 to 0.2**: Sparse vegetation, rock
- **0.2 to 0.5**: Shrubs, grassland
- **0.5 to 0.8**: Dense vegetation, healthy crops
- **0.8 to 1**: Very dense vegetation (tropical forest)

In [None]:
### 4.11 NDVI Histogram and Distribution Analysis

In [None]:
# Visualize NDVI
fig, ax = plt.subplots(figsize=(12, 10))

# Use diverging colormap (red-yellow-green)
im = ax.imshow(ndvi, cmap='RdYlGn', vmin=-0.2, vmax=0.9)
cbar = plt.colorbar(im, ax=ax, shrink=0.8, extend='both')
cbar.set_label('NDVI', fontsize=12, fontweight='bold')

# Add horizontal lines for class boundaries
cbar.ax.axhline(y=0, color='blue', linewidth=2, linestyle='--', alpha=0.7)
cbar.ax.axhline(y=0.2, color='orange', linewidth=1.5, linestyle='--', alpha=0.7)
cbar.ax.axhline(y=0.5, color='yellow', linewidth=1.5, linestyle='--', alpha=0.7)
cbar.ax.axhline(y=0.8, color='darkgreen', linewidth=1.5, linestyle='--', alpha=0.7)

ax.set_title('NDVI - Normalized Difference Vegetation Index',
             fontsize=14, fontweight='bold', pad=15)
ax.set_xlabel('Column (pixel)', fontsize=11)
ax.set_ylabel('Row (pixel)', fontsize=11)

# Add interpretation legend
legend_text = (
    "NDVI Interpretation:\n\n"
    "< 0 (Red/Brown)\n"
    "  Water, bare soil\n\n"
    "0 - 0.2 (Orange/Yellow)\n"
    "  Sparse vegetation\n\n"
    "0.2 - 0.5 (Light Green)\n"
    "  Moderate vegetation\n\n"
    "0.5 - 0.8 (Green)\n"
    "  Dense vegetation\n\n"
"> 0.8 (Dark Green)\n"
    "  Very dense vegetation"
)
ax.text(1.15, 0.5, legend_text,
        transform=ax.transAxes,
        fontsize=9,
        verticalalignment='center',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.9),
        family='monospace')

plt.tight_layout()
plt.show()

### 4.10 NDVI Histogram and Distribution Analysis

In [None]:
# Create comprehensive histogram
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Histogram
ax1.hist(ndvi.flatten(), bins=100, color='green', alpha=0.7, edgecolor='darkgreen')
ax1.axvline(ndvi.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {ndvi.mean():.3f}')
ax1.axvline(np.median(ndvi), color='blue', linestyle='--', linewidth=2, label=f'Median: {np.median(ndvi):.3f}')

# Add class boundary lines
ax1.axvline(0, color='black', linestyle=':', linewidth=1.5, alpha=0.5)
ax1.axvline(0.2, color='orange', linestyle=':', linewidth=1.5, alpha=0.5)
ax1.axvline(0.5, color='yellow', linestyle=':', linewidth=1.5, alpha=0.5)
ax1.axvline(0.8, color='darkgreen', linestyle=':', linewidth=1.5, alpha=0.5)

ax1.set_xlabel('NDVI Value', fontsize=11, fontweight='bold')
ax1.set_ylabel('Frequency (pixel count)', fontsize=11, fontweight='bold')
ax1.set_title('NDVI Distribution', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot
box_data = [ndvi[ndvi < 0].flatten(),
            ndvi[(ndvi >= 0) & (ndvi < 0.2)].flatten(),
            ndvi[(ndvi >= 0.2) & (ndvi < 0.5)].flatten(),
            ndvi[(ndvi >= 0.5) & (ndvi < 0.8)].flatten(),
            ndvi[ndvi >= 0.8].flatten()]

bp = ax2.boxplot(box_data, 
                 labels=['Water\n(<0)', 'Sparse\n(0-0.2)', 'Moderate\n(0.2-0.5)', 
                        'Dense\n(0.5-0.8)', 'Very Dense\n(>0.8)'],
                 patch_artist=True)

# Color boxes
colors = ['brown', 'orange', 'yellow', 'green', 'darkgreen']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)

ax2.set_ylabel('NDVI Value', fontsize=11, fontweight='bold')
ax2.set_title('NDVI by Vegetation Class', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### üìù Exercise 2: Calculate and Visualize NDWI (Water Index)

**NDWI (Normalized Difference Water Index)** is used to detect water bodies.

$$NDWI = \frac{Green - NIR}{Green + NIR}$$

**Task:**
1. Write a function to calculate NDWI
2. Calculate NDWI from the Green and NIR bands
3. Create a visualization showing water bodies
4. Calculate statistics (min, max, mean)

**Hints:**
- NDWI > 0.3: Water
- NDWI 0 to 0.3: Wetlands/moist soil
- NDWI < 0: Dry land/vegetation

In [None]:
# YOUR CODE HERE
# Step 1: Write NDWI function

# def calculate_ndwi(green, nir):
#     """
#     Calculate Normalized Difference Water Index.
#     NDWI = (Green - NIR) / (Green + NIR)
#     """
#     # Your code here
#     pass

# Step 2: Calculate NDWI
# ndwi = calculate_ndwi(green, nir)

# Step 3: Visualize
# fig, ax = plt.subplots(figsize=(12, 10))
# im = ax.imshow(ndwi, cmap='Blues', vmin=-0.5, vmax=0.5)
# # Add colorbar, title, labels
# plt.show()

# Step 4: Calculate statistics
# print(f"NDWI Statistics:")
# print(f"  Min: {ndwi.min():.3f}")
# # ... etc

In [ ]:
from rasterio.mask import mask as rasterio_mask

# Select Palawan province
palawan_gdf = philippines_gdf[philippines_gdf['Province'] == 'Palawan']

# Get geometry in format rasterio expects (GeoJSON-like)
palawan_geom = [palawan_gdf.geometry.values[0].__geo_interface__]

# Open the saved raster file and clip to Palawan boundary
with rasterio.open(raster_path) as src:
    # Clip raster to Palawan boundary
    out_image, out_transform = rasterio_mask(src, palawan_geom, crop=True, filled=True)
    out_meta = src.meta.copy()

# Update metadata
out_meta.update({
    "height": out_image.shape[1],
    "width": out_image.shape[2],
    "transform": out_transform
})

print("‚úì Real Sentinel-2 data clipped to Palawan boundary!")
print(f"  Original size: {height} x {width} pixels")
print(f"  Clipped size:  {out_image.shape[1]} x {out_image.shape[2]} pixels")
print(f"  Reduction:     {(1 - (out_image.shape[1] * out_image.shape[2]) / (height * width)) * 100:.1f}%")

# Extract clipped bands (remember: 0-indexed, so band 1=index 0)
clipped_blue = out_image[0, :, :]
clipped_green = out_image[1, :, :]
clipped_red = out_image[2, :, :]
clipped_nir = out_image[3, :, :]

# Calculate NDVI for clipped area using real data
clipped_ndvi = calculate_ndvi(clipped_nir, clipped_red)

print(f"\nClipped NDVI statistics (real Palawan vegetation):")
print(f"  Mean: {clipped_ndvi.mean():.3f}")
print(f"  Min:  {clipped_ndvi.min():.3f}")
print(f"  Max:  {clipped_ndvi.max():.3f}")

---

## Part 5: Combined Operations - Vector and Raster Integration

**The real power of geospatial analysis comes from combining vector and raster data.**

Common workflows:
- Clip raster to administrative boundaries
- Extract statistics per province/region
- Overlay boundaries on satellite imagery
- Sample raster values at point locations

### 5.1 Clipping Raster to Vector Boundary

In [None]:
from rasterio.mask import mask as rasterio_mask

# Select Palawan province
palawan_gdf = philippines_gdf[philippines_gdf['Province'] == 'Palawan']

# Get geometry in format rasterio expects (GeoJSON-like)
palawan_geom = [palawan_gdf.geometry.values[0].__geo_interface__]

# Open raster and clip
with rasterio.open(raster_path) as src:
    # Clip raster to Palawan boundary
    out_image, out_transform = rasterio_mask(src, palawan_geom, crop=True, filled=True)
    out_meta = src.meta.copy()

# Update metadata
out_meta.update({
    "height": out_image.shape[1],
    "width": out_image.shape[2],
    "transform": out_transform
})

print("‚úì Raster clipped to Palawan boundary!")
print(f"  Original size: {height} x {width} pixels")
print(f"  Clipped size:  {out_image.shape[1]} x {out_image.shape[2]} pixels")
print(f"  Reduction:     {(1 - (out_image.shape[1] * out_image.shape[2]) / (height * width)) * 100:.1f}%")

# Extract clipped bands
clipped_red = out_image[2, :, :]
clipped_nir = out_image[3, :, :]

# Calculate NDVI for clipped area
clipped_ndvi = calculate_ndvi(clipped_nir, clipped_red)

print(f"\nClipped NDVI statistics:")
print(f"  Mean: {clipped_ndvi.mean():.3f}")
print(f"  Min:  {clipped_ndvi.min():.3f}")
print(f"  Max:  {clipped_ndvi.max():.3f}")

In [None]:
# Create combined visualization using real Sentinel-2 data
fig, ax = plt.subplots(figsize=(14, 12))

# Get bounds from the transform and dimensions
minx = bounds[0]
maxx = bounds[2]
miny = bounds[1]
maxy = bounds[3]
extent = [minx, maxx, miny, maxy]

# Display NDVI from real data as background
im = ax.imshow(ndvi, cmap='RdYlGn', vmin=-0.2, vmax=0.9,
               extent=extent, origin='upper')

# Overlay province boundaries
philippines_gdf.boundary.plot(ax=ax, edgecolor='blue', linewidth=2, label='Province Boundaries')

# Highlight Palawan
palawan_gdf.boundary.plot(ax=ax, edgecolor='red', linewidth=3, label='Palawan (highlighted)')

# Add colorbar
cbar = plt.colorbar(im, ax=ax, shrink=0.7, pad=0.02)
cbar.set_label('NDVI (Real Sentinel-2 Data)', fontsize=12)

ax.set_xlabel('Longitude (¬∞E)', fontsize=11)
ax.set_ylabel('Latitude (¬∞N)', fontsize=11)
ax.set_title('Real NDVI Data with Province Boundaries Overlay - Palawan',
             fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

print("‚úì Combined vector-raster visualization created!")
print("  Using real Sentinel-2 NDVI from Microsoft Planetary Computer")
print("  This demonstrates spatial integration of real satellite data with vector boundaries")

### 5.2 Overlay Vector Boundaries on Raster

In [None]:
from rasterio.features import rasterize
from rasterio.transform import rowcol

# Simple approach: Sample NDVI at province centroids using real Sentinel-2 data
# For full zonal statistics, use rasterstats library (not installed by default)

def sample_raster_at_point(lon, lat, raster_array, transform_obj):
    """
    Sample raster value at given coordinates.
    """
    from rasterio.transform import rowcol
    
    # Convert geographic to pixel coordinates
    row, col = rowcol(transform_obj, lon, lat)
    
    # Check bounds
    if 0 <= row < raster_array.shape[0] and 0 <= col < raster_array.shape[1]:
        return raster_array[row, col]
    else:
        return np.nan

# Sample NDVI at each province centroid
ndvi_values = []
for idx, row in philippines_gdf.iterrows():
    centroid = row['centroid']
    ndvi_val = sample_raster_at_point(centroid.x, centroid.y, ndvi, transform)
    ndvi_values.append(ndvi_val)

philippines_gdf['NDVI_Centroid'] = ndvi_values

print("="*70)
print("MEAN NDVI BY PROVINCE (sampled from real Sentinel-2 data)")
print("="*70)
print(f"{'Province':<25} {'NDVI':>10} {'Vegetation Class':>20}")
print("-"*70)

for idx, row in philippines_gdf.iterrows():
    ndvi_val = row['NDVI_Centroid']
    if np.isnan(ndvi_val):
        veg_class = "Outside raster"
    elif ndvi_val < 0:
        veg_class = "Water/Bare"
    elif ndvi_val < 0.2:
        veg_class = "Sparse"
    elif ndvi_val < 0.5:
        veg_class = "Moderate"
    elif ndvi_val < 0.8:
        veg_class = "Dense"
    else:
        veg_class = "Very Dense"
    
    print(f"{row['Province']:<25} {ndvi_val:>10.3f} {veg_class:>20}")

print("="*70)
print("\n‚úì NDVI values sampled from real Microsoft Planetary Computer data")
print("Note: For accurate zonal statistics covering full polygons,")
print("      use the rasterstats library (provides mean, median, min, max per polygon)")

### 5.3 Zonal Statistics - Calculate Mean NDVI per Province

In [None]:
# Save NDVI calculated from real Sentinel-2 data as GeoTIFF
ndvi_path = f'{work_dir}/outputs/palawan_ndvi_real.tif'

# Create metadata profile
ndvi_meta = {
    'driver': 'GTiff',
    'height': height,
    'width': width,
    'count': 1,
    'dtype': 'float32',
    'crs': crs,
    'transform': transform,
    'compress': 'lzw',
    'nodata': -9999
}

with rasterio.open(ndvi_path, 'w', **ndvi_meta) as dst:
    dst.write(ndvi.astype('float32'), 1)
    dst.set_band_description(1, 'NDVI from Real Sentinel-2 L2A')

print(f"‚úì Real NDVI saved: {ndvi_path}")

# Save updated GeoDataFrame with NDVI values
vector_path = f'{work_dir}/outputs/provinces_with_real_ndvi.geojson'

# Create a copy for saving (to avoid modifying original)
gdf_to_save = philippines_gdf.copy()

# Convert centroid geometry column to coordinates
if 'centroid' in gdf_to_save.columns:
    gdf_to_save['centroid_lon'] = gdf_to_save['centroid'].x
    gdf_to_save['centroid_lat'] = gdf_to_save['centroid'].y
    gdf_to_save = gdf_to_save.drop(columns=['centroid'])

# Save to GeoJSON
gdf_to_save.to_file(vector_path, driver='GeoJSON')

print(f"‚úì Vector data with NDVI saved: {vector_path}")
print(f"  Attributes saved: Province, Region, Island_Group, Population, Area, Density, NDVI_Centroid")
print(f"  Centroid coordinates saved as: centroid_lon, centroid_lat")
print(f"\n‚úì All outputs saved to: {work_dir}/outputs/")
print(f"  NDVI values are from REAL Sentinel-2 data acquired from Microsoft Planetary Computer!")

### 5.4 Saving Results

In [None]:
print("BEST PRACTICES FOR MEMORY MANAGEMENT:\n")

print("1. ALWAYS use context managers (with statements):")
print("   ‚úì with rasterio.open('file.tif') as src:")
print("       data = src.read()")
print("   ‚úó src = rasterio.open('file.tif')  # Don't forget to close!\n")

print("2. Read only what you need:")
print("   ‚úì band = src.read(1)  # Single band")
print("   ‚úó all_bands = src.read()  # All bands (if you only need one)\n")

print("3. Use windowed reading for large files (as we did for Palawan data):")
print("   from rasterio.windows import Window")
print("   window = Window(0, 0, 1000, 1000)  # 1000x1000 subset")
print("   data = src.read(1, window=window)")
print("   ‚úì We used this to load only 1000x1000 pixels instead of the full scene!\n")

print("4. Process in chunks for very large datasets:")
print("   for ji, window in src.block_windows(1):")
print("       data = src.read(1, window=window)")
print("       # Process chunk")
print("       # Write result\n")

print("5. Delete large arrays when done:")
print("   del large_array")
print("   import gc; gc.collect()  # Force garbage collection\n")

print("6. Use cloud-optimized data sources:")
print("   ‚úì Microsoft Planetary Computer, AWS Open Data")
print("   ‚úì Cloud-optimized GeoTIFF (COG) format")
print("   ‚úì Stream only the data you need without downloading entire files")
print("   ‚úì This is how we accessed real Sentinel-2 data in this notebook!")

---

## Part 6: Best Practices and Common Pitfalls

### 6.1 Memory Management

In [None]:
print("BEST PRACTICES FOR MEMORY MANAGEMENT:\n")

print("1. ALWAYS use context managers (with statements):")
print("   ‚úì with rasterio.open('file.tif') as src:")
print("       data = src.read()")
print("   ‚úó src = rasterio.open('file.tif')  # Don't forget to close!\n")

print("2. Read only what you need:")
print("   ‚úì band = src.read(1)  # Single band")
print("   ‚úó all_bands = src.read()  # All bands (if you only need one)\n")

print("3. Use windowed reading for large files:")
print("   from rasterio.windows import Window")
print("   window = Window(0, 0, 1000, 1000)  # 1000x1000 subset")
print("   data = src.read(1, window=window)\n")

print("4. Process in chunks for very large datasets:")
print("   for ji, window in src.block_windows(1):")
print("       data = src.read(1, window=window)")
print("       # Process chunk")
print("       # Write result\n")

print("5. Delete large arrays when done:")
print("   del large_array")
print("   import gc; gc.collect()  # Force garbage collection")

### 6.2 CRS Alignment - CRITICAL!

In [None]:
print("CRS (Coordinate Reference System) ALIGNMENT:\n")

print("ALWAYS check CRS before combining data!\n")

# Example: Check and align CRS
print("Step 1: Check CRS")
print(f"  Vector CRS: {philippines_gdf.crs}")
print(f"  Raster CRS: {src.crs}")

print("\nStep 2: Reproject if needed")
print("  if vector.crs != raster.crs:")
print("      vector = vector.to_crs(raster.crs)")
print("      print('Vector reprojected!')\n")

print("COMMON CRS IN PHILIPPINES:")
print("  EPSG:4326  - WGS84 Geographic (lat/lon in degrees)")
print("  EPSG:32651 - WGS84 / UTM Zone 51N (meters, for Luzon/Visayas)")
print("  EPSG:32652 - WGS84 / UTM Zone 52N (meters, for Mindanao)")
print("  EPSG:3123  - PRS92 / Philippines Zone I")
print("  EPSG:3124  - PRS92 / Philippines Zone II")
print("  EPSG:3125  - PRS92 / Philippines Zone III\n")

print("PRO TIP: Use UTM for accurate area/distance calculations!")

### 6.3 Handling NoData Values

In [None]:
print("HANDLING NODATA VALUES:\n")

# Check for nodata value
print(f"Current raster nodata value: {src.nodata}")

print("\nMethod 1: Read with masked=True")
print("  data = src.read(1, masked=True)  # Returns np.ma.MaskedArray")
print("  valid_mean = data.mean()  # Automatically ignores nodata")

print("\nMethod 2: Manual masking")
print("  data = src.read(1)")
print("  if src.nodata is not None:")
print("      valid_data = data[data != src.nodata]")
print("      valid_mean = valid_data.mean()")

print("\nMethod 3: NumPy masked arrays")
print("  import numpy.ma as ma")
print("  masked_data = ma.masked_equal(data, src.nodata)")
print("  valid_mean = masked_data.mean()")

print("\nWHY IT MATTERS:")
print("  NoData pixels can skew statistics if not handled!")
print("  Example: mean() of [100, 100, -9999] = -3266 (WRONG!)")
print("           mean() excluding nodata = 100 (CORRECT!)")

### 6.4 Common Errors and Solutions

In [None]:
print("COMMON ERRORS AND SOLUTIONS:\n")
print("="*70)

print("\n1. 'ValueError: cannot set EPSG:4326 CRS'")
print("   CAUSE: CRS already set or incompatible")
print("   FIX: gdf.set_crs('EPSG:4326', allow_override=True)\n")

print("2. 'IndexError: index 1 is out of bounds'")
print("   CAUSE: Trying to read band that doesn't exist")
print("   FIX: Check src.count before reading")
print("        bands = src.read([1, 2, 3])  # Read multiple\n")

print("3. 'TypeError: integer argument expected, got float'")
print("   CAUSE: Pixel coordinates must be integers")
print("   FIX: row, col = int(row), int(col)\n")

print("4. 'MemoryError: Unable to allocate array'")
print("   CAUSE: Trying to load massive raster into memory")
print("   FIX: Use windowed reading or downsample")
print("        data = src.read(1, out_shape=(500, 500))\n")

print("5. 'RuntimeWarning: invalid value encountered in divide'")
print("   CAUSE: Division by zero in NDVI/NDWI calculation")
print("   FIX: Use np.where() to handle zero denominators")
print("        ndvi = np.where(denom != 0, (nir-red)/denom, 0)\n")

print("6. 'GeoDataFrame.to_file() slow for large datasets'")
print("   CAUSE: Shapefile format is slow")
print("   FIX: Use GeoPackage or GeoJSON")
print("        gdf.to_file('data.gpkg', driver='GPKG')  # Faster!\n")

print("="*70)

---

## Summary and Key Takeaways

### What You've Learned Today:

#### 1. **GeoPandas for Vector Data**
‚úì Loading and inspecting shapefiles/GeoJSON  
‚úì Filtering by attributes and spatial queries  
‚úì CRS transformations and projections  
‚úì Creating professional maps and visualizations  
‚úì Spatial operations (buffer, intersection, union)

#### 2. **Rasterio for Raster Data**
‚úì Reading multi-band satellite imagery  
‚úì Extracting metadata and band information  
‚úì Processing bands as NumPy arrays  
‚úì Calculating statistics and percentiles  
‚úì Creating RGB and false color composites

#### 3. **Vegetation Indices**
‚úì NDVI calculation and interpretation  
‚úì NDWI for water body detection  
‚úì Histogram analysis and thresholding  
‚úì Land cover classification based on indices

#### 4. **Integrated Workflows**
‚úì Clipping rasters to vector boundaries  
‚úì Overlaying vectors on rasters  
‚úì Zonal statistics (per-province analysis)  
‚úì Saving results in multiple formats

#### 5. **Best Practices**
‚úì Memory management techniques  
‚úì CRS alignment (CRITICAL!)  
‚úì NoData value handling  
‚úì Error prevention and debugging

---

### Why This Matters for AI/ML

**These skills are ESSENTIAL for:**

1. **Data Preparation**
   - Loading training data (labeled polygons)
   - Preprocessing satellite imagery
   - Creating feature layers for models

2. **Feature Engineering**
   - Calculating spectral indices (NDVI, NDWI, etc.)
   - Extracting texture features
   - Creating multi-temporal composites

3. **Model Training**
   - Sampling training pixels
   - Creating validation datasets
   - Balancing class distributions

4. **Result Analysis**
   - Visualizing model predictions
   - Calculating accuracy metrics
   - Validating against ground truth

5. **Deployment**
   - Processing new satellite scenes
   - Generating operational products
   - Creating decision support maps

---

### Philippine EO Applications

**You can now build applications for:**

**Disaster Risk Reduction (DRR):**
- Flood extent mapping using NDWI
- Landslide susceptibility analysis
- Typhoon damage assessment

**Climate Change Adaptation (CCA):**
- Vegetation health monitoring (NDVI)
- Drought impact assessment
- Coastal erosion detection

**Natural Resource Management (NRM):**
- Forest cover monitoring
- Agricultural land mapping
- Marine protected area monitoring

---

## Next Session: Google Earth Engine Python API

**Session 4 will cover:**
- Accessing petabytes of satellite data in the cloud
- Processing Sentinel-1 and Sentinel-2 at scale
- Cloud masking and temporal compositing
- Exporting data for ML workflows
- Integrating GEE with local Python analysis

**Preview:**
```python
import ee
ee.Initialize()

# Access entire Sentinel-2 archive
s2 = ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED') \
    .filterBounds(palawan) \
    .filterDate('2024-01-01', '2024-12-31') \
    .map(mask_clouds)

# Create cloud-free composite
composite = s2.median()

# Calculate NDVI at planetary scale!
ndvi = composite.normalizedDifference(['B8', 'B4'])
```

---

## Additional Resources

### Documentation
- **GeoPandas:** https://geopandas.org/
- **Rasterio:** https://rasterio.readthedocs.io/
- **NumPy:** https://numpy.org/doc/
- **Matplotlib:** https://matplotlib.org/

### Tutorials
- **Carpentries Geospatial Python:** https://carpentries-incubator.github.io/geospatial-python/
- **Earth Data Science:** https://www.earthdatascience.org/
- **Python for Geospatial Analysis:** https://www.tomasbeuzen.com/python-for-geospatial-analysis/

### Philippine Data Sources
- **PhilSA:** https://philsa.gov.ph/
- **NAMRIA Geoportal:** https://www.geoportal.gov.ph/
- **DOST-ASTI DATOS:** https://asti.dost.gov.ph/
- **HDX Philippines:** https://data.humdata.org/group/phl
- **HazardHunterPH:** https://hazardhunter.georisk.gov.ph/

### Books
- *Geoprocessing with Python* (Garrard)
- *Learning Geospatial Analysis with Python* (Lawhead)
- *Python for Data Analysis* (McKinney)

---

## Practice Exercises (Optional Homework)

To reinforce your learning:

### Exercise A: Multi-Province Analysis
Calculate and compare NDVI statistics for all provinces in one island group.

### Exercise B: Time-Series Simulation
Create multiple synthetic images representing different seasons and analyze NDVI changes.

### Exercise C: Custom Index
Research and implement another vegetation index (EVI, SAVI, or MSAVI).

### Exercise D: Real Data
Download actual Sentinel-2 data from Copernicus Data Space and apply these techniques.

### Exercise E: Water Detection
Use NDWI to create a binary water mask and calculate total water area.

---

## Clean Up

In [None]:
# Close raster file
src.close()

# Clean up temporary files (optional)
import os
temp_files = [raster_path]

for f in temp_files:
    if os.path.exists(f):
        os.remove(f)
        print(f"Removed: {f}")

print("\n‚úì Cleanup complete!")
print(f"\nYour outputs are saved in: {work_dir}/outputs/")

---

# üéâ Congratulations!

You've completed **Day 1, Session 3** of the CoPhil AI/ML Training!

### You now have the skills to:
‚úÖ Work with vector data using GeoPandas  
‚úÖ Process satellite imagery with Rasterio  
‚úÖ Calculate vegetation indices (NDVI, NDWI)  
‚úÖ Combine vector and raster data  
‚úÖ Create professional visualizations  
‚úÖ Apply best practices for geospatial Python  

### These are the **foundational skills** for ALL AI/ML work in Earth Observation!

**Ready for Session 4?** We'll take these skills to the cloud with Google Earth Engine!

---

*ü§ñ Generated with Claude Code for CoPhil Digital Space Campus*

*EU-Philippines Copernicus Capacity Support Programme*

*Data-Centric AI for Earth Observation*

---