# Create dataset from OpenStreetMap data and Mapbox tiles

## Dependencies

In [None]:
%pip install --quiet git+https://github.com/mozilla-ai/osm-ai-helper.git

## Download data from OpenStreetMap

In [None]:
from osm_ai_helper.download_osm import download_osm

- `TRAIN_AREA` / `VAL_AREA`

Pick values that don't geographically overlap.

Can be city, state, country, etc.

Uses the [Nominatim API](https://nominatim.org/release-docs/develop/api/Search/).

- `SELECTOR`

OpenStreetMap tag to select elements.
Check some examples of [OpenStreetMap tags](https://wiki.openstreetmap.org/wiki/Map_features).

Uses the [Overpass API](https://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide).

The example uses ["leisure=swimming_pool"](https://wiki.openstreetmap.org/wiki/Tag:leisure%3Dswimming_pool)

- `DISCARD`

Elements matching any of the given key/value pairs will be discarded.

The chosen `SELECTOR` might pull elements that are not revelant for training, so
you can use this to filter unwanted elements.

The example uses `{"location": "indoor"}` to filter swimming pools that are not
visible in the satellite image.


In [None]:
TRAIN_AREA = "Galicia"
VAL_AREA = "Viana do Castelo"
SELECTOR = "leisure=swimming_pool"
DISCARD = {"location": "indoor"}
CLASS_NAME = "swimming_pool"

In [None]:
download_osm(
    area=TRAIN_AREA,
    output_dir="datasets",
    selector=SELECTOR,
    discard=DISCARD,
)

In [None]:
download_osm(
    area=VAL_AREA,
    output_dir="datasets",
    selector=SELECTOR,
    discard=DISCARD,
)

## Download tiles from Mapbox

In [None]:
from google.colab import userdata

from osm_ai_helper.group_elements_and_download_tiles import (
    group_elements_and_download_tiles,
)

- `ZOOM`

An appropriate [`zoom` level](https://docs.mapbox.com/help/glossary/zoom-level/)

There is a tradeoff between easier detection (higher zoom levels) and covering a wider area on each tile (lower zoom levels).

The example uses `18` for swimming pools.

In [None]:
ZOOM = 18

You need to set the `MAPBOX_TOKEN` Colab secret:

- Create an account: https://console.mapbox.com/
- Follow this guide to obtain your [Default Public Token](https://docs.mapbox.com/help/getting-started/access-tokens/#your-default-public-token).

In [None]:
group_elements_and_download_tiles(
    f"datasets/{TRAIN_AREA}.json",
    f"datasets/{TRAIN_AREA}",
    userdata.get("MAPBOX_TOKEN"),
    zoom=ZOOM,
)

In [None]:
group_elements_and_download_tiles(
    f"datasets/{VAL_AREA}.json",
    f"datasets/{VAL_AREA}",
    userdata.get("MAPBOX_TOKEN"),
    zoom=ZOOM,
)

## Convert to YOLO dataset

We are going to use the dataset to train a YOLO model from https://www.ultralytics.com/ so we need to convert the dataset to [the expected format](https://docs.ultralytics.com/datasets/detect/)

In [None]:
from osm_ai_helper.convert_to_yolo_dataset import convert_to_yolo_dataset

In [None]:
convert_to_yolo_dataset(f"datasets/{TRAIN_AREA}")

In [None]:
convert_to_yolo_dataset(f"datasets/{VAL_AREA}")

# Check out of the box predictions

In [None]:
from pathlib import Path
from ultralytics import YOLO

In [None]:
yolo = YOLO("yolo11m.pt")

In [None]:
yolo.predict(list(Path(f"datasets/{VAL_AREA}").glob("*.jpg"))[0], save=True)

In [None]:
from PIL import Image

Image.open(list(Path("runs/detect/predict").glob("*.jpg"))[0])

# Upload Dataset

The dataset will be uploaded to the [HuggingFace Hub Datasets](https://huggingface.co/docs/hub/datasets).

You need to set the `HF_TOKEN` Colab secret:

- Create an account: https://huggingface.co/join
- Follow this guide about [`User Access Tokens`](https://huggingface.co/docs/hub/security-tokens)



In [None]:
!rm "datasets/{TRAIN_AREA}"/*.json

In [None]:
!rm "datasets/{VAL_AREA}"/*.json

In [None]:
!zip -r -q train.zip "datasets/{TRAIN_AREA}"

In [None]:
!zip -r -q val.zip "datasets/{VAL_AREA}"

In [None]:
USER = "mozilla-ai"
REPO = "osm-swimming-pools"

Create the yaml config used by YOLO

In [None]:
Path("yolo_dataset.yaml").write_text(
    f"""
path: .
train: {TRAIN_AREA}
val: {VAL_AREA}

names:
  0: {CLASS_NAME}
"""
)

In [None]:
Path("README.md").write_text(
    f"""
---
task_categories:
- object-detection
---

# {REPO}

Detect {CLASS_NAME}s in satellite images.

Created with [osm-ai-helper](https://github.com/mozilla-ai/osm-ai-helper).

## Ground Truth Bounding Boxes

Downloaded from [OpenStreetMap](https://www.openstreetmap.org). LICENSE: https://www.openstreetmap.org/copyright

Used the `{SELECTOR}` [OpenStreetMap tags](https://wiki.openstreetmap.org/wiki/Map_features). Discarded the elements matching `{DISCARD}`.

## Satellite Images

Downloaded from [Mapbox](https://www.mapbox.com/). LICENSE: https://docs.mapbox.com/data/tilesets/guides/imagery/#trace-satellite-imagery

Used a [zoom level](https://docs.mapbox.com/help/glossary/zoom-level/) of `{ZOOM}`.
"""
)

In [None]:
from huggingface_hub import HfApi

In [None]:
api = HfApi()

In [None]:
api.create_repo(f"{USER}/{REPO}", token=userdata.get("HF_TOKEN"), repo_type="dataset")

In [None]:
api.upload_file(
    token=userdata.get("HF_TOKEN"),
    path_or_fileobj="train.zip",
    path_in_repo="train.zip",
    repo_id=f"{USER}/{REPO}",
    repo_type="dataset",
)

In [None]:
api.upload_file(
    token=userdata.get("HF_TOKEN"),
    path_or_fileobj="val.zip",
    path_in_repo="val.zip",
    repo_id=f"{USER}/{REPO}",
    repo_type="dataset",
)

In [None]:
api.upload_file(
    token=userdata.get("HF_TOKEN"),
    path_or_fileobj="yolo_dataset.yaml",
    path_in_repo="yolo_dataset.yaml",
    repo_id=f"{USER}/{REPO}",
    repo_type="dataset",
)

In [None]:
api.upload_file(
    token=userdata.get("HF_TOKEN"),
    path_or_fileobj="README.md",
    path_in_repo="README.md",
    repo_id=f"{USER}/{REPO}",
    repo_type="dataset",
)