# Lab 3.1: Sentinel-2 Data Download from Copernicus Dataspace

**Part of the Iceland ML Course: Sentinel-2 Classification Project**

This notebook demonstrates how to acquire and download Sentinel-2 satellite imagery using the Copernicus Dataspace Ecosystem API for land classification tasks.

---

## Project Milestone Overview

| Lab | Milestone | Status |
|-----|-----------|--------|
| Lab 1 | HPC Access Setup | ‚úÖ Previous |
| Lab 2 | Jupyter-JSC & Git | ‚úÖ Previous |
| Lab 3.1 | **Data Download (Copernicus API)** | üîÑ **Current** |
| Lab 3.2 | Data Preprocessing & Patch Extraction | ‚¨ú Next |
| Lab 4 | Understanding Transformers | ‚¨ú Next |
| Lab 4.1 | Training on Sentinel-2 Data | ‚¨ú Next |
| Lab 5 | Distributed Training (Multi-GPU) | ‚¨ú Next |
| Lab 6 | Validation & Performance Metrics | ‚¨ú Next |
| Lab 7 | Foundation Models & TerraToRCH | ‚¨ú Final |

---

## What You'll Learn

By the end of this lab, you will:
- Register for Copernicus Dataspace access
- Authenticate with the Copernicus Dataspace API
- Search for Sentinel-2 Level 2A imagery by date, location, and cloud cover
- Download Sentinel-2 SAFE archives
- Organize downloaded data for preprocessing

## Quick Start

This lab focuses on the **Data Acquisition** phase of the project pipeline:

```
Data Download from Copernicus (Lab 3.1) ‚Üê You are here
    ‚Üì
Data Preprocessing & Patch Extraction (Lab 3.2)
    ‚Üì
Model Training (Lab 4+)
```

---

## Overview

This lab focuses on:
1. **Copernicus Dataspace Registration**: Setting up your account and credentials
2. **API Authentication**: Using OAuth2 to access the API
3. **Searching Sentinel-2 Data**: Finding imagery by date, location, and quality
4. **Downloading SAFE Archives**: Getting full Sentinel-2 products
5. **Data Organization**: Structuring files for Lab 3.2 preprocessing

## Part 1: Setup and Registration

First, you need to register for Copernicus Dataspace access and set up authentication credentials.

### Step 1: Register and Create OAuth2 Client

**Important**: The Copernicus Dataspace uses OAuth2 authentication. Follow these steps:

1. Go to **Copernicus Browser Dashboard**: https://browser.dataspace.copernicus.eu/
2. Login to your account (create one if needed)
3. Click your profile (top right) ‚Üí **User Settings**
4. Locate **"OAuth clients"** section
5. Click **"Create"** to register a new OAuth client
6. **Give it a name** (e.g., "Iceland-ML-Lab")
7. Set expiration or select "Never expire" (recommended for HPC)
8. Click **"Create"**
9. Copy your **Client ID** and **Client Secret** and save them securely

‚ö†Ô∏è **Security Note**: Your client secret won't be shown again! Save it immediately.

**Your credentials:**
- **Client ID**: `sh-....`
- **Client Secret**: `<YOUR SECRET>`

In [None]:
import requests
import os
import json
from datetime import datetime
import zipfile
from pathlib import Path

# For visualization later
import matplotlib.pyplot as plt

### Step 2: Configure OAuth2 Client Credentials

OAuth2 Client Credentials is the recommended authentication method for server-to-server communication and HPC jobs. Store credentials securely as environment variables.

In [None]:
# Configure your Copernicus Dataspace OAuth2 Client Credentials
# Register OAuth client in Dashboard: https://browser.dataspace.copernicus.eu/

# Option 1: Set as environment variables (recommended for HPC)
# export COPERNICUS_CLIENT_ID="your_client_id"
# export COPERNICUS_CLIENT_SECRET="your_client_secret"

# Option 2: Direct assignment (for testing only - DO NOT COMMIT TO GIT!)
COPERNICUS_CLIENT_ID = os.getenv('COPERNICUS_CLIENT_ID', '<Your ID>')
COPERNICUS_CLIENT_SECRET = os.getenv('COPERNICUS_CLIENT_SECRET', '<Your Secret>')

# Copernicus Dataspace API endpoints
AUTH_URL = "https://identity.dataspace.copernicus.eu/auth/realms/CDSE/protocol/openid-connect/token"
SEARCH_URL = "https://catalogue.dataspace.copernicus.eu/odata/v1/Products"
DOWNLOAD_URL = "https://zipper.dataspace.copernicus.eu/odata/v1/Products"

print("‚úì Credentials configured")

### Step 3: Set Environment Variables (Optional)

Store your credentials as environment variables instead of hardcoding them.

**On your HPC system:**

```bash
# Add to ~/.bashrc or ~/.bash_profile
export COPERNICUS_CLIENT_ID="<Your ID>"
export COPERNICUS_CLIENT_SECRET="<Your Secret>"

# Then reload
source ~/.bashrc
```

**Verify setup:**

```bash
# Check variables are set
echo $COPERNICUS_CLIENT_ID
echo $COPERNICUS_CLIENT_SECRET
```

In [None]:
def get_access_token(client_id, client_secret):
    """
    Get OAuth2 access token from Copernicus Dataspace using Client Credentials flow.
    
    This is the recommended method for server-to-server authentication and HPC jobs.
    
    Parameters:
    -----------
    client_id : str
        OAuth2 Client ID (starts with 'sh-')
    client_secret : str
        OAuth2 Client Secret
    
    Returns:
    --------
    str : Access token if successful, None otherwise
    """
    data = {
        "grant_type": "client_credentials",
        "client_id": client_id,
        "client_secret": client_secret,
    }
    
    try:
        response = requests.post(AUTH_URL, data=data, timeout=30)
        response.raise_for_status()
        token_data = response.json()
        
        # Extract token and expiration
        access_token = token_data["access_token"]
        expires_in = token_data.get("expires_in", 3600)
        
        print(f"‚úì Token obtained (valid for {expires_in//60} minutes)")
        return access_token
        
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 401:
            print("‚ùå Authentication failed: Invalid credentials")
            print("   ‚úó Check your CLIENT_ID and CLIENT_SECRET")
            print("   ‚úó CLIENT_ID should start with 'sh-'")
        else:
            print(f"‚ùå HTTP {e.response.status_code}: {e.response.text}")
        return None
        
    except Exception as e:
        print(f"‚ùå Authentication failed: {e}")
        return None

# Get access token
print("Authenticating with Copernicus Dataspace...")
access_token = get_access_token(COPERNICUS_CLIENT_ID, COPERNICUS_CLIENT_SECRET)

if access_token:
    print("‚úì Successfully authenticated")
    headers = {"Authorization": f"Bearer {access_token}"}
else:
    print("‚ùå Authentication failed.")
    print("\nTroubleshooting:")
    print("1. Check environment variables are set:")
    print(f"   COPERNICUS_CLIENT_ID = {COPERNICUS_CLIENT_ID[:15]}...")
    print(f"   COPERNICUS_CLIENT_SECRET = {COPERNICUS_CLIENT_SECRET[:15]}...")
    print("2. Verify credentials in Copernicus Dashboard")
    print("3. See COPERNICUS_SETUP.md for detailed instructions")
    headers = None

---

## Part 2: Define Search Parameters

Specify your study area, date range, and quality criteria for Sentinel-2 imagery.

### Define Region of Interest and Date Range

In [None]:
# Define your Region of Interest (ROI) as a bounding box
# Format: POLYGON((lon lat, lon lat, ...))
# Example: Bavarian region (Germany) - has CORINE coverage
# Note: You will be selecting an MGRS tile within this region

# Bounding box coordinates [min_lon, min_lat, max_lon, max_lat]
# Bavarian region (Central Europe)
min_lon, min_lat = 8.0, 47.0
max_lon, max_lat = 14.0, 49.5

# Create WKT POLYGON for API query
roi_polygon = f"POLYGON(({min_lon} {min_lat},{max_lon} {min_lat},{max_lon} {max_lat},{min_lon} {max_lat},{min_lon} {min_lat}))"

# Define date range
start_date = '2018-03-01T00:00:00.000Z'
end_date = '2018-10-31T23:59:59.999Z'

# Maximum cloud cover percentage (30% to get good data availability)
max_cloud_cover = 30

print(f"Search Parameters:")
print(f"  Region: Bavarian region (Central Europe with CORINE coverage)")
print(f"  Bounding Box: ({min_lon}, {min_lat}) to ({max_lon}, {max_lat})")
print(f"  Date Range: {start_date[:10]} to {end_date[:10]}")
print(f"  Max Cloud Cover: {max_cloud_cover}%")
print(f"\nNote: Sentinel-2 divides the globe into MGRS tiles (100√ó100 km each).")
print(f"Your search will find all tiles intersecting this region.")

---

## Part 3: Search for Sentinel-2 Data

Query the Copernicus Dataspace catalog for Sentinel-2 Level 2A imagery matching your criteria.

### Search Sentinel-2 Collection

In [None]:
def search_sentinel2(start_date, end_date, roi_polygon, max_cloud_cover):
    """
    Search for Sentinel-2 L2A products in Copernicus Dataspace
    """
    # Build OData filter query
    filters = [
        f"Collection/Name eq 'SENTINEL-2'",
        f"Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType' and att/OData.CSC.StringAttribute/Value eq 'S2MSI2A')",
        f"ContentDate/Start gt {start_date}",
        f"ContentDate/Start lt {end_date}",
        f"OData.CSC.Intersects(area=geography'SRID=4326;{roi_polygon}')",
        f"Attributes/OData.CSC.DoubleAttribute/any(att:att/Name eq 'cloudCover' and att/OData.CSC.DoubleAttribute/Value lt {max_cloud_cover})"
    ]
    
    filter_query = " and ".join(filters)
    
    params = {
        "$filter": filter_query,
        "$orderby": "ContentDate/Start asc",
        "$top": 1000  # Increased to get full date range across all tiles
    }
    
    try:
        response = requests.get(SEARCH_URL, params=params, timeout=60)
        response.raise_for_status()
        results = response.json()
        return results.get('value', [])
    except Exception as e:
        print(f"‚ùå Search failed: {e}")
        return []

print("‚úì Search function defined")

In [None]:
print("Searching for Sentinel-2 products...")
products = search_sentinel2(start_date, end_date, roi_polygon, max_cloud_cover)

print(f"\n‚úì Found {len(products)} Sentinel-2 L2A products")
print(f"‚úì All products have <{max_cloud_cover}% cloud cover (filtered server-side)")
print(f"\nThese products span multiple MGRS tiles over your region.")
print(f"Select one MGRS tile and download ~4 acquisitions.\n")
print(f"First 5 products:")
for i, product in enumerate(products[:5]):
    name = product.get('Name', 'Unknown')
    date = product.get('ContentDate', {}).get('Start', 'N/A')[:10]
    size = product.get('ContentLength', 0) / (1024**3)
    
    print(f"  {i+1}. {name}")
    print(f"     Date: {date}, Size: {size:.2f} GB")

### Group Products by MGRS Tile

Before downloading, let's organize products by MGRS tile to ensure we download multiple acquisitions of the same tile.

In [None]:
# Group products by MGRS tile
from collections import defaultdict

tiles = defaultdict(list)
for product in products:
    product_name = product.get('Name', '')
    # Extract MGRS tile from product name (e.g., T32UPD from S2A_MSIL2A_..._T32UPD_...)
    tile_id = product_name.split('_')[5] if len(product_name.split('_')) > 5 else 'Unknown'
    tiles[tile_id].append(product)

# Display available tiles and their acquisition counts
print("Available MGRS Tiles and Acquisition Counts:")
print("=" * 50)
for tile_id, tile_products in sorted(tiles.items(), key=lambda x: len(x[1]), reverse=True):
    print(f"\nTile {tile_id}: {len(tile_products)} acquisitions")
    for i, product in enumerate(tile_products[:5]):  # Show first 5
        date = product.get('ContentDate', {}).get('Start', 'N/A')[:10]
        size = product.get('ContentLength', 0) / (1024**3)
        print(f"  {i+1}. {date} - {size:.2f} GB")
    if len(tile_products) > 5:
        print(f"  ... and {len(tile_products) - 5} more")

print("\n" + "=" * 50)
print(f"\nRecommendation: Choose a tile with 4+ acquisitions for training data diversity.")

In [None]:
# Select a tile to work with (choose the one with most acquisitions, or specify manually)
# Option 1: Automatic - select tile with most acquisitions
selected_tile = max(tiles.items(), key=lambda x: len(x[1]))[0] if tiles else None

# Option 2: Manual selection - uncomment and specify tile ID
# selected_tile = "T32UPD"  # Replace with your chosen tile

if selected_tile:
    tile_products = tiles[selected_tile]
    num_acquisitions = len(tile_products)
    
    print(f"Selected MGRS Tile: {selected_tile}")
    print(f"Total acquisitions available: {num_acquisitions}")
    
    # Select 4 evenly spaced acquisitions for temporal diversity
    num_to_select = 4
    if num_acquisitions >= num_to_select:
        # Calculate indices for evenly spaced selection
        indices = [int(i * (num_acquisitions - 1) / (num_to_select - 1)) for i in range(num_to_select)]
        selected_products = [tile_products[i] for i in indices]
    else:
        # If fewer than 4 acquisitions, use all of them
        selected_products = tile_products
        indices = list(range(len(tile_products)))
    
    print(f"\nSelected {len(selected_products)} evenly-spaced acquisitions for temporal diversity:")
    print("=" * 70)
    for i, (idx, product) in enumerate(zip(indices, selected_products)):
        name = product.get('Name', 'Unknown')
        date = product.get('ContentDate', {}).get('Start', 'N/A')[:10]
        size = product.get('ContentLength', 0) / (1024**3)
        print(f"  {i+1}. [{idx+1}/{num_acquisitions}] {date} - {size:.2f} GB")
        print(f"      {name}")
    print("=" * 70)
    print("\nThese acquisitions span the full date range for better training data diversity.")
else:
    print("‚ùå No tiles found. Adjust your search parameters.")
    selected_products = []

---

## Part 4: Download Sentinel-2 Products

Download selected Sentinel-2 SAFE archives to your local or HPC storage.

### Download Function

In [None]:
def download_product(product, output_dir, access_token):
    """
    Download a Sentinel-2 product from Copernicus Dataspace
    
    Parameters:
    -----------
    product : dict
        Product metadata from search results
    output_dir : str
        Directory to save downloaded file
    access_token : str
        OAuth2 access token
    
    Returns:
    --------
    str : Path to downloaded file, or None if failed
    """
    product_id = product['Id']
    product_name = product['Name']
    
    # Create download directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Output file path
    output_file = os.path.join(output_dir, f"{product_name}.zip")
    
    # Check if already downloaded
    if os.path.exists(output_file):
        print(f"‚ö† File already exists: {product_name}.zip")
        return output_file
    
    # Build download URL
    download_url = f"{DOWNLOAD_URL}({product_id})/$value"
    
    headers = {"Authorization": f"Bearer {access_token}"}
    
    try:
        print(f"Downloading: {product_name}")
        print(f"  Size: {product['ContentLength'] / (1024**3):.2f} GB")
        
        # Stream download with progress
        with requests.get(download_url, headers=headers, stream=True, timeout=300) as response:
            response.raise_for_status()
            
            total_size = int(response.headers.get('content-length', 0))
            block_size = 8192
            downloaded = 0
            
            with open(output_file, 'wb') as f:
                for chunk in response.iter_content(chunk_size=block_size):
                    if chunk:
                        f.write(chunk)
                        downloaded += len(chunk)
                        
                        # Print progress every 100 MB
                        if downloaded % (100 * 1024 * 1024) < block_size:
                            progress = (downloaded / total_size) * 100 if total_size > 0 else 0
                            print(f"  Progress: {progress:.1f}% ({downloaded / (1024**3):.2f} GB)")
        
        print(f"‚úì Download complete: {output_file}")
        return output_file
        
    except Exception as e:
        print(f"‚ùå Download failed: {e}")
        # Clean up partial download
        if os.path.exists(output_file):
            os.remove(output_file)
        return None

print("‚úì Download function defined")

### Download A Selected Acquisition

In [None]:
# Define download directory (modify for your HPC storage)
download_dir = "/p/scratch/training2600/YOUR_USERNAME/data"

# Download the selected product
if products and access_token:
    downloaded_file = download_product(selected_product, download_dir, access_token)
    if downloaded_file:
        print(f"\n‚úì Product saved to: {downloaded_file}")
else:
    print("‚ùå Cannot download: No products found or authentication failed")

### Download Multiple Acquisitions

In [None]:
# Download the evenly-spaced acquisitions from the selected tile
download_dir = "/p/scratch/training2600/YOUR_USERNAME/data"

if selected_tile and access_token and selected_products:
    print(f"Downloading {len(selected_products)} evenly-spaced acquisitions from tile {selected_tile}...")
    print("=" * 60)
    
    for i, product in enumerate(selected_products):
        date = product.get('ContentDate', {}).get('Start', 'N/A')[:10]
        print(f"\n--- Downloading {i+1}/{len(selected_products)} ({date}) ---")
        download_product(product, download_dir, access_token)
    
    print("\n" + "=" * 60)
    print(f"‚úì Downloaded {len(selected_products)} acquisitions from tile {selected_tile}")
    print(f"‚úì Acquisitions are evenly spaced across the time range")
    print(f"‚úì All files saved to: {download_dir}")
else:
    print("‚ùå Cannot download: No tile/products selected or authentication failed")

---

## Part 5: (Optional) Visualize with GEE

If you want to quickly visualize search results before downloading, you can optionally use Google Earth Engine. This is optional and only for reconnaissance.

### Optional: Visualize ROI with GEE

If you want to preview your ROI using Google Earth Engine (optional):

In [None]:
import ee
import geemap

ee.Authenticate()
ee.Initialize()

# Bounding box coordinates [min_lon, min_lat, max_lon, max_lat]
# Bavarian region (Central Europe)
min_lon, min_lat = 8.0, 47.0
max_lon, max_lat = 14.0, 49.5

# Create ee.Geometry.Rectangle (simpler approach)
roi = ee.Geometry.Rectangle([min_lon, min_lat, max_lon, max_lat])

# OR create from WKT string:
# roi_polygon = f"POLYGON(({min_lon} {min_lat},{max_lon} {min_lat},{max_lon} {max_lat},{min_lon} {max_lat},{min_lon} {min_lat}))"
# roi = ee.Geometry(roi_polygon, proj='EPSG:4326', geodesic=False)

Map = geemap.Map(center=[48.0, 11.0], zoom=6)
Map.addLayer(roi, {'color': 'FF0000'}, 'ROI')
Map

---

## Part 6: Data Organization

Organize your downloaded data for efficient preprocessing in Lab 3.2.

In [None]:
# Recommended directory structure for the project
import os

base_dir = "/p/scratch/training2600/YOUR_USERNAME/data"

directory_structure = {
    "sentinel2_data": "Downloaded Sentinel-2 ZIP files",
    "sentinel2_extracted": "Extracted SAFE directories",
    "sentinel2_geotiff": "Converted GeoTIFF files",
    "corine_data": "CORINE land cover maps",
    "training_data": "Preprocessed training patches"
}

print("Recommended Directory Structure:")
print(f"\n{base_dir}/")
for dir_name, description in directory_structure.items():
    dir_path = os.path.join(base_dir, dir_name)
    print(f"  ‚îú‚îÄ‚îÄ {dir_name}/  # {description}")
    # Uncomment to create directories:
    os.makedirs(dir_path, exist_ok=True)

print("\nTip: Create these directories before starting Lab 3.2")

---

## Summary

This notebook demonstrated how to:

1. **Register for Copernicus Dataspace**: Set up access via the Dataspace Browser
2. **Authenticate via API**: Use OAuth2 to obtain access tokens
3. **Search for Sentinel-2 Data**: Query the catalog by date, location, and cloud cover
4. **Download SAFE Archives**: Retrieve full Sentinel-2 products as ZIP files
5. **Organize Data**: Structure files for efficient preprocessing

The downloaded Sentinel-2 ZIP files are ready for preprocessing in **Lab 3.2**.

---

## What's Next?

### Before Moving to Lab 3.2

**1. Verify Downloads**
   - Check that Sentinel-2 ZIP files are complete (5-10 GB each)
   - Note file paths for Lab 3.2

**2. Extract ZIP Files** (Optional)
   - Extract `.SAFE` directories from ZIP files
   - Or let Lab 3.2 handle extraction automatically

**3. Prepare Storage**
   - Ensure sufficient disk space (~50 GB per tile including extracted data)
   - Create directory structure as shown above

### Next Lab: Lab 3.2 - Data Preprocessing & Patch Extraction

In **Lab 3.2**, you'll:
- Extract Sentinel-2 bands from SAFE format
- Convert data to GeoTIFF for analysis
- Extract 3x3 patches around LUCAS ground truth points
- Create training data for model development

### Data Pipeline Recap

```python
# Lab 3.1 produces:
Sentinel-2 ZIP files (SAFE format)
    ‚Üì
# Lab 3.2 processes:
Extract bands from SAFE archives
Convert to GeoTIFF
Extract patches around ground truth points
    ‚Üì
# Lab 4 trains:
Transformer model on patches
```

---

## Resources & References

- **Copernicus Dataspace Browser**: https://browser.dataspace.copernicus.eu/
- **Copernicus Dataspace API Docs**: https://documentation.dataspace.copernicus.eu/APIs.html
- **Sentinel-2 Product Specification**: https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2
- **OData Protocol**: https://www.odata.org/

---

## Troubleshooting & FAQ

**Q: I can't find the old Copernicus Hub (scihub.copernicus.eu) - where is it?**
- The old SciHub was decommissioned. Use the new **Copernicus Dataspace Ecosystem** at https://dataspace.copernicus.eu/

**Q: How do I get API credentials?**
- Register at https://identity.dataspace.copernicus.eu/
- Use the same credentials for both the browser and API access
- No separate API key needed - OAuth2 tokens are generated on-the-fly

**Q: Download is very slow - what can I do?**
- Use HPC systems with high-bandwidth connections
- Download during off-peak hours
- Consider downloading multiple tiles in parallel with separate jobs

**Q: What if I hit data limits?**
- Copernicus Dataspace has generous free-tier limits
- For large-scale downloads, request increased quotas

**Q: Can I use the browser to download instead of the API?**
- Yes! Use https://browser.dataspace.copernicus.eu/
- Search for products visually
- Download via the web interface
- Better for small numbers of tiles

**Q: How do I know which tile covers my area?**
- Search by coordinates in the Dataspace Browser
- The search results will show the MGRS tile ID

---

**Course Contact**: Refer to course materials for instructor email and office hours  
**Last Updated**: February 2026