# Efficiently Querying and Processing Mapillary Data – Project Overview

## Goal

I aim to efficiently query, process, and enrich large volumes of Mapillary image data (over 1'200'000 images) using the Mapillary API. The goal is to extract metadata and save the results as GeoPackage files (GPKG) for further spatial analysis.

Along the way, I encountered several bottlenecks — mainly due to API limitations (max 100 images per query, rate limits) and the sheer volume of data to process (Big Data Problems).

## General Setup

I use:

- `ThreadPoolExecutor` for parallel execution
- The Mapillary API:
  - `images_in_bbox()` for spatial queries
  - `graph.mapillary.com/{image_id}` for metadata
- `GeoPandas` for geospatial data processing and saving as GPKG
- Retry logic and HTTP session pooling for stable requests

Input: Bounding box (GeoJSON or coordinates)  
Output: Image and metadata stored as `.json` and `.gpkg`

## Learning Steps – My Iterative Approach

### General Benchmark – Single vs. Parallel Processing

I tested both single-threaded and multi-threaded image processing using Python's `ThreadPoolExecutor`.

| Mode       | Images | Duration     |
|------------|--------|--------------|
| Single     | 98     | 66.95 sec     |
| Parallel   | 100    | 12.33 sec     |

Parallel processing was up to 5 times faster. Threading works well here because the task is I/O-bound (API calls). However, rate limits and error handling needed careful management. I used this approach for the further analysis.

### 1) Brute Force Grid with 10m Resolution

**What I tried:**  
I generated a fine grid (10m x 10m) and queried `images_in_bbox()` for each cell using multiprocessing.

**What I learned:**  
This approach caused an extreme number of API requests, even in empty areas. Since each tile returned max 100 images, it required a huge number of calls and quickly hit rate limits.

**Problems encountered:**
- Frequent HTTP 429 (too many requests)
- Many unnecessary queries in empty areas
- API only returns a maximum of 100 results, making fine-grained discovery incomplete
- Extremely slow: would take **days** for the entire area

### 2) Distributed Access Using Multiple API Tokens

**What I tried:**  
I assigned a separate API access token to each worker/thread to try increasing throughput.

**What I learned:**  
This did not help much. Even with multiple tokens, each tile still took ~0.3 seconds. For hundreds of thousands of tiles, the process remained unacceptably slow.

**Problems encountered:**
- Difficult token management
- Rate limits still applied per token
- No significant performance gain
- Still made unnecessary calls to empty areas

### 3) Hierarchical Grid ("WMS Pyramid" Strategy)

**What I tried:**  
Inspired by how WMS tiles work, I implemented a zoom-based grid system:

1. Start with large grids (e.g. 1 km x 1 km)
2. If image data exists: subdivide further (500m, 250m, …)
3. Stop refining when no data is found
4. For empty parent tiles, generate empty `.json` and skip children

**What I learned:**  
This drastically reduced the number of requests. I avoided unnecessary queries in empty areas, while still achieving fine resolution where needed.

**Problems encountered:**
- Slightly more complex to implement
- Need to manage and track already-visited tiles
- Edge tiles may still require careful merging if overlapping
- The process time would be over 1 day

### 4) Metadata Download via `graph.mapillary.com`

**What I tried:**  
After further research, I found the `Tiled Dataset` API, which could fetch basic information e.g. Image-ID and Sequence-ID over a certain area. This was very quick, approx 1 minute. After I had the necessary and reduced dataset with the geographical coorinates in a `.gpkg`, I could implement a request for each image for further metadata fetching.

**Implementation steps:**

1. Load image IDs from `.gpkg`
2. For each ID, request metadata from  
   `https://graph.mapillary.com/{image_id}?access_token=...`
3. Collect results in a `pandas.DataFrame`
4. Merge metadata with `GeoDataFrame`
5. Save as final GPKG file

**Optimizations:**

- Retry logic with `urllib3` and `HTTPAdapter`
- Parallelized via `ThreadPoolExecutor`
- Request rate control:

```python
requests_per_minute = 50000
safety_factor = 0.9
min_delay = 60.0 / requests_per_minute
max_workers = max(1, int(requests_per_minute * safety_factor * min_delay))
```

**What I learned:**  
Controlling the request rate was essential. Even with 10,000+ threads, I was able to finish metadata downloading in just over 2 hours without hitting rate limits.

**Problems encountered:**
- If rate control was not correctly configured: immediate HTTP 429
- Some image IDs failed to resolve even after retries (handled via logging)



## Final Overview Table – Advantages and Issues per Step

| Step | Strategy                                  | Advantages                                                                 | Problems / Errors Encountered                                                        |
|------|-------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
|  -   | General Benchmark (Parallel vs. Single)   | Fast metadata processing for known images                                  | Only works if image IDs are already available                                       |
| 1    | Brute Force Grid (10m)                    | Simple to implement                                                        | API overload, rate limits, unnecessary calls, incomplete due to 100 image cap       |
| 2    | Multiple Tokens per Worker                | Distributed access, theoretical speed gain                                 | No significant improvement, token management issues, still slow per tile            |
| 3    | Hierarchical Grid (WMS-style refinement)  | Highly efficient, scalable, minimal API usage in empty areas               | More complex logic, need to track hierarchy, still needs fallback for edge cases    |
| 4    | Metadata Download via Graph API           | Rich metadata, stable, fast with parallelization and throttling            | Sensitive to request rate, some failed IDs, needs retry mechanism                   |


## Approximate Processing Times (Ryzen 7900X / RTX 4080 / 64 GB RAM / SSD)

| Step | Strategy                                  | Estimated Time (Approx.) <- canceled after certain time                  |
|------|-------------------------------------------|-------------------------------------------|
| -    | General Benchmark (Parallel Mode)         | ~12 seconds for 100 images                |
| 1    | Brute Force Grid (10m tiles, full area)   | Several days (infeasible)                 |
| 2    | Multiple Tokens per Worker                | Still >24 hours for 900,000+ tiles        |
| 3    | Hierarchical Grid (WMS-style refinement)  | ~6-9 hours for full area             |
| 4    | Metadata Download via Graph API           | ~2 hours 8 minutes (actual, measured)     |


## Code Overview – Fetching and Enriching Mapillary Data

This script demonstrates how to use a custom module (`Mapillary_Fetch_Metadata.py`) to:

1. Fetch basic Mapillary image data within a bounding polygon (GeoJSON-style coordinates)
2. Enrich this data by downloading full metadata for each image
3. Save the results as GeoPackages (`.gpkg`) and optionally log failed image IDs


### Notes

* Both steps automatically **skip processing** if the output file already exists
* `failed_ids.txt` logs any image IDs for which metadata download failed (e.g. due to API errors)

In [1]:
import sys
from pathlib import Path

# Relativer Pfad zum Script-Verzeichnis
script_path = Path("03_Model/Scripts/2_Feature_Geolocation/2_0_Fetch_Mapillary")
sys.path.append(str(script_path.resolve()))

# Relativer Pfad zum Export-Verzeichnis
root_path = Path(r"./.").resolve()


import geopandas as gpd
import fiona

def read_gpkg_limited(path, limit=None):
    if limit is None:
        return gpd.read_file(path)
    
    with fiona.open(path, layer=0) as src:
        features = [feature for _, feature in zip(range(limit), src)]
        return gpd.GeoDataFrame.from_features(features, crs=src.crs)


In [2]:
from Mapillary_Fetch_Metadata import fetch_and_convert_to_gdf
from Mapillary_Fetch_Metadata import load_and_fetch_metadata
import geopandas as gpd

# Example bounding box coordinates for testing

bbox_extent = [
        [8.457720822843683, 47.390649130286448], 
        [8.456825005154048, 47.37424440909907], 
        [8.513957476598119, 47.334770736179166], 
        [8.538369416730317, 47.347789816857421], 
        [8.566987407296029, 47.345832805990639], 
        [8.578446117577144, 47.351059452649729], 
        [8.615089558171336, 47.363347841537582], 
        [8.593951036370918, 47.38175404740641], 
        [8.598868228275775, 47.407016330860856], 
        [8.574228138587589, 47.412600163192216], 
        [8.55660690453244, 47.419692963652167], 
        [8.556685022028232, 47.437291328914533], 
        [8.514688737203723, 47.435910038955718], 
        [8.476965065062107, 47.42933398525328], 
        [8.466429431361696, 47.419442231388643], 
        [8.457720822843683, 47.390649130286448]
    ]
    
    # File paths
basic_output_path = root_path / "data" / "images_bbox_basic.gpkg"
full_output_path = root_path/ "data" / "images_bbox_fullmeta.gpkg"
failed_ids_path = root_path/ "data" / "failed_ids.txt"

# Fetch basic metadata and save to GeoPackage
if not basic_output_path.exists():
    print(f"Fetching basic metadata for bounding box: {bbox_extent}")
    # Fetch basic metadata and convert to GeoDataFrame
    fetch_and_convert_to_gdf(
    bbox=bbox_extent, 
    output_path=basic_output_path
    )
else:
    print(f"Basic metadata already exists at {basic_output_path}. Skipping fetch.")

    # Fetch full metadata (with skip check and custom failed_ids path)
merged_gdf = load_and_fetch_metadata(
        basic_output_path, 
        full_output_path, 
        limit=1000,
        failed_ids_path=failed_ids_path
    )

del merged_gdf
import gc
gc.collect()

Basic metadata already exists at C:\Users\claud\Documents\Studium\Masterarbeit\03_Model\Scripts\2_Feature_Geolocation\2_0_Fetch_Mapillary\data\images_bbox_basic.gpkg. Skipping fetch.
Metadata-Datei C:\Users\claud\Documents\Studium\Masterarbeit\03_Model\Scripts\2_Feature_Geolocation\2_0_Fetch_Mapillary\data\images_bbox_fullmeta.gpkg existiert bereits - wird übersprungen.


0

### Blur Detection of Images

After enriching metadata, I want to **check each image for blur**, using both Laplacian and Sobel methods.

* The images will be **downloaded permantly** and stored in RAM or a temp folder. (in an extern SSD)

The following imported script `detect_blury_image.py` provides the blur detection logic:

* `laplacian_variance_blur(...)`: Laplacian-based blur detection
* `sobel_variance_blur(...)`: Sobel-based blur detection
* `detect_blurry_images_in_folder(...)`: Batch blur detection in a folder (optionally with visualization)
* `detect_blury_image_single(...)`: For single image evaluation

**Date:** 2025-06-26  

#### Overview

Started with a script that:
- Loaded a `.gpkg` GeoDataFrame with image URLs (`thumb_1024_url`)
- Processed images in chunks of 1000
- Detected blur using Laplacian and Sobel methods
- Implemented a basic version **without multiprocessing**
- Images were downloaded sequentially
- Blur detection ran in a single-threaded loop

#### Revisions:
- ✅ **Chunk-wise processing** for memory-safe handling of large GeoPackages
- ✅ **Parallel image download** using `ThreadPoolExecutor`
- ✅ **Parallel blur detection** using `ThreadPoolExecutor` inside `detect_blurry_images_in_folder()`
- ✅ **Switch to `aiohttp` + `asyncio`** for faster, asynchronous image downloads (maximizing network throughput)
- ✅ **Temporary folders** per chunk to store images and reduce disk/RAM pressure
- ✅ **Efficient in-place updates** to the GeoDataFrame (`blurry_image = True/False`)
- ✅ **Interim statistics** after each chunk (blurred / not blurred / failed)
- ✅ **Final `.gpkg` export** with updated blur status
- ✅ **Save failed download** URLs to retry later
- ✅ **Changed blur detection algorithm**: from 39.2 seconds to 27 seconds
- ✅ **Added GPU support**: from 27 seconds to 24 seconds for blur detection
- ✅ **Added asynchronous processing**: from 57 seconds to 27 seconds for downloading **and** blur detection of 10,000 images (~54 min for 1.2 million images) 
- ✅ **Optimized GPU-based blur detection pipeline** (→ GPU runs in parallel with image downloading, .loc instead of .at for faster GeoDataFrame updates, GPU memory freed after each chunk) -> from 27 seconds to 19 seconds  -> but it "hungers" after the first iteration
- ✅ **Changed image download from temporary to definitive**: Blur detection is massively increased by it.
- ✅ **Changed blur detection output**: Changed that the values for the blur detection are in the gpkg.
- ✅ **Added Testing Method**: Added a way to test only with a subsample of the data


In [3]:
# === Standard Library Imports ===
import os
import asyncio
import time
import aiohttp
import aiofiles
import geopandas as gpd
from tqdm.asyncio import tqdm as tqdm_asyncio
from tqdm import tqdm
import nest_asyncio
import importlib
import config
importlib.reload(config)
from config import (
    IMAGE_COLUMN,
    BLURRY_COL,
    GPKG_PATH,
    OUTPUT_GPKG,
    FAILED_DOWNLOADS_PATH,
    TMP_BLUR_PATH,
    IMAGE_DIR,
    TEST_LIMIT,
    LAPLACIAN_THRESHOLD,
    USE_GPU
)
from blur_gpu_utils import compute_laplacian_variance



# === Initialisierung ===
nest_asyncio.apply()
TMP_BLUR_PATH.mkdir(exist_ok=True)

# === Download-Funktionen ===
async def download_image(session, sem, image_id, url, folder):
    filename = f"{image_id}.jpg"
    save_path = os.path.join(folder, filename)
    try:
        async with sem:
            async with session.get(url, timeout=0.5) as resp:
                if resp.status == 200:
                    async with aiofiles.open(save_path, mode='wb') as f:
                        await f.write(await resp.read())
                    return None
    except:
        pass
    return (image_id, url)

async def download_all_images(idx_url_list, folder, max_connections=100):
    sem = asyncio.Semaphore(max_connections)
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit_per_host=max_connections)) as session:
        tasks = [download_image(session, sem, image_id, url, folder) for image_id, url in idx_url_list]
        results = await tqdm_asyncio.gather(*tasks, desc="Download", total=len(tasks))
    return [r for r in results if r is not None]

async def download_all_images_to_directory(gdf, directory):
    os.makedirs(directory, exist_ok=True)
    gdf["already_downloaded"] = gdf["id"].map(lambda image_id: os.path.exists(os.path.join(directory, f"{image_id}.jpg")))
    already_downloaded_count = gdf["already_downloaded"].sum()
    print(f"✅ {already_downloaded_count} Bilder sind bereits gespeichert.")

    remaining_gdf = gdf[~gdf["already_downloaded"]]

    if remaining_gdf.empty:
        print("✅ Alle Bilder sind bereits heruntergeladen.")
        return gdf

    idx_url_list = list(zip(remaining_gdf["id"], remaining_gdf[IMAGE_COLUMN]))
    await download_all_images(idx_url_list, directory)

    gdf.drop(columns=["already_downloaded"], inplace=True)
    return gdf

# === Hauptausführung ===
if __name__ == "__main__":
    start_time = time.time()
    gdf = read_gpkg_limited(GPKG_PATH, TEST_LIMIT)
    gdf[BLURRY_COL] = None

    gdf = asyncio.run(download_all_images_to_directory(gdf, IMAGE_DIR))

    elapsed = time.time() - start_time
    print(f"⏱️ Gesamtdauer für den Download: {elapsed:.2f} Sekunden")



✅ 1194448 Bilder sind bereits gespeichert.


Download: 100%|██████████| 40103/40103 [00:21<00:00, 1867.87it/s]


⏱️ Gesamtdauer für den Download: 89.20 Sekunden


In [4]:
# Lade GeoDataFrame
gdf = read_gpkg_limited(GPKG_PATH, TEST_LIMIT)


print("🔍 Starte Blur Detection...")
start_time = time.time()

# Spalten vorbereiten
gdf["blur_value"] = None
gdf["is_blurry"] = None

def get_image_path(image_id):
    return os.path.join(IMAGE_DIR, f"{image_id}.jpg")

# Hauptloop
from blur_gpu_utils import batch_compute_blur

# Bildpfade vorbereiten
image_paths = [
    (str(image_id), os.path.join(IMAGE_DIR, f"{image_id}.jpg"))
    for image_id in gdf["id"]
]
# Parallele Berechnung
results = batch_compute_blur(image_paths, use_gpu=USE_GPU, max_workers=16)

# Ergebnisse zuordnen
blur_map = {img_id: val for img_id, val in results}
gdf["blur_value"] = gdf["id"].astype(str).map(blur_map)
gdf["is_blurry"] = gdf["blur_value"] < LAPLACIAN_THRESHOLD


print(f"✅ Blur Detection abgeschlossen in {time.time() - start_time:.2f} Sekunden.")

# Speichern
gdf.to_file(OUTPUT_GPKG, driver="GPKG")
print(f"💾 GPKG gespeichert: {OUTPUT_GPKG}")


🔍 Starte Blur Detection...


📸 Blur Detection (parallel): 100%|██████████| 1234551/1234551 [25:53<00:00, 794.52it/s]


✅ Blur Detection abgeschlossen in 1579.35 Sekunden.
💾 GPKG gespeichert: data\images_bbox_fullmeta_with_blur.gpkg
