## Residential Areas - Preprocessing

This notebook describes the preprocessing of data about residential areas in Germany for the pharmalink project. \
The goal is to create a custom GeoPackage with one layer per Bundesland (state) containing precise boundaries for all residential areas. \
Said GeoPackage is included in the pharmalink package as an essential part of its internal data.

### Source: [Basemap.de Open Data](https://basemap.de/open_data/) 
The basemap.de ecosystem provides easy access to aggregated federal and state government geodata. \
The Open Data service aggregates the "Basis-DLM"s (basic digital landscape model) of all 16 states and exposes a download service for these datasets.

The source was last accessed on 2024-09-10. \
Please note that there currently is an issue with the Bremen file due to uncompleted migrations, resulting in no data for Bremen for the time being.

### Licenses and original Sources

##### **Baden-Württemberg:** Datenquelle: LGL, [www.lgl-bw.de](www.lgl-bw.de), [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Bayern:** Bayerische Vermessungsverwaltung – [www.geodaten.bayern.de](www.geodaten.bayern.de), [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), (Daten verändert)

##### **Berlin:** Geoportal Berlin / ATKIS Basis-DLM, [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Brandenburg:** © GeoBasis-DE/LGB (2024), [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), Daten geändert

##### **Bremen:** © GeoBasis-DE / Landesamt GeoInformation Bremen (2024), [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), (Daten verändert)

##### **Hamburg:** Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Hessen:** no source, no license, [§18 Hessisches Vermessungs- und Geoinformationsgesetz - HVGG](https://www.rv.hessenrecht.hessen.de/bshe/document/jlr-VermGeoInfGHEV6P18) (Daten verändert) 

##### **Mecklenburg-Vorpommern:** © GeoBasis-DE/M-V 2024, [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), (Daten verändert)

##### **Niedersachsen:** LGLN (2024), [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), (Daten verändert)

##### **Nordrhein-Westfalen:** [dl-de/zero-2-0](https://www.govdata.de/dl-de/zero-2-0), (Daten verändert)

##### **Rheinland-Pfalz:** ©GeoBasis-DE / LVermGeoRP2024, [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), [www.lvermgeo.rlp.de](www.lvermgeo.rlp.de) [Daten bearbeitet]

##### **Saarland:** © GeoBasis DE/LVGL-SL (2024), [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Sachsen:** Landesamt für Geobasisinformation Sachsen (GeoSN), [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Sachsen-Anhalt:**  © GeoBasis-DE / LVermGeo ST, [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

##### **Schleswig-Holstein:** ©GeoBasis-DE/LVermGeo SH/[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) (Quelle verändert)

##### **Thüringen:** © GDI-Th, [dl-de/by-2-0](https://www.govdata.de/dl-de/by-2-0), (Daten verändert)

In [1]:
import pathlib as path
import warnings
import time
import datetime
import geopandas as gpd
import aiohttp
import asyncio
from tqdm.asyncio import tqdm
from io import BytesIO
import lzma

In [2]:
# Establish notebook path for handling relative paths in the notebook
notebook_path = path.Path().resolve()

if notebook_path.stem != "residential_areas":
    raise Exception(
        "Notebook file root must be set to parent directory of the notebook. Please resolve and re-run."
    )

In [3]:
# Admin areas GeoPackage in admin_areas sister directory
admin_areas_path = notebook_path.parent.joinpath("admin_areas", "admin_areas.gpkg.xz")

# Filter RuntimeWarnings from pyogrio. The GDAL driver for GeoPackage expects a .gpkg filename,
# but the virtual file it receives from lzma cannot comply with the file standard in this regard.
warnings.filterwarnings("ignore", category=RuntimeWarning, module="pyogrio")

with lzma.open(admin_areas_path, "rb") as archive:
    admin_areas = gpd.read_file(archive, engine="pyogrio")

bundeslaender = admin_areas[admin_areas["level"] == "land"]

bundeslaender = bundeslaender.set_index("regkey")


two_letters = {
    "Baden-Württemberg": "BW",
    "Bayern": "BY",
    "Berlin": "BE",
    "Brandenburg": "BB",
    "Bremen": "HB",
    "Hamburg": "HH",
    "Hessen": "HE",
    "Mecklenburg-Vorpommern": "MV",
    "Niedersachsen": "NI",
    "Nordrhein-Westfalen": "NW",
    "Rheinland-Pfalz": "RP",
    "Saarland": "SL",
    "Sachsen": "SN",
    "Sachsen-Anhalt": "ST",
    "Schleswig-Holstein": "SH",
    "Thüringen": "TH",
}

# Add two letter abbreviations to the bundeslaender GeoDataFrame
bundeslaender["two_letters"] = bundeslaender["geo_name"].map(two_letters)

In [4]:
# Download the files asynchronously to speed up the process and include a progress bar
async def download_file(session, url, file_path):
    async with session.get(url) as response:
        with open(file_path, "wb") as f:
            while True:
                chunk = await response.content.read(1024)
                if not chunk:
                    break
                f.write(chunk)


async def download_all_missing_files():
    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout()) as session:
        tasks = []

        today = datetime.date.today().isoformat()

        for land in bundeslaender.iterrows():

            # Get the two-letter name abbreviation name and the two-digit regkey for the Bundesland
            two_letters = land[1]["two_letters"]

            regkey = land[0]
            two_digits = f"{regkey[:2]}"

            # For source information, see above
            file_path = notebook_path.joinpath("sources", f"{two_digits}.gpkg")
            url = f"https://basemap.de/dienste/opendata/basisviews/basisviews_bdlm_{two_letters}_EPSG:4326_{today}.gpkg"

            if not file_path.exists():
                tasks.append(download_file(session, url, file_path))

        for task in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
            await task

In [5]:
%autoawait
# Allow for some time after midnight to avoid downloading old/unavailable data during the nightly update
# Check if after 00:30
if time.localtime().tm_hour == 0 and time.localtime().tm_min < 30:
    raise Exception(
        "For safety reasons, wait until after 00:30 to download data. By then, all files should have been updated."
    )


# Ensure that all source files have been downloaded
await download_all_missing_files()

output_path = notebook_path.joinpath("residential_areas")
output_path.mkdir(parents=True, exist_ok=True)

# Remove all existing files in the output directory
for file in output_path.glob("*.gpkg.xz"):
    file.unlink()

# Extract all residential areas from the source files
for land in bundeslaender.iterrows():

    regkey = land[0]
    land_two_digits = f"{regkey[:2]}"

    land_geo_name = land[1]["geo_name"]

    input_file = notebook_path.joinpath("sources", f"{land_two_digits}.gpkg")
    output_file = output_path.joinpath(f"{land_two_digits}.gpkg.xz")

    # Query filters for all pure and mixed residential areas
    query = "objektart = 'Wohnbauflaeche' OR objektart = 'FlaecheGemischterNutzung'"

    # Import the source file with the query filter
    land_data = gpd.read_file(
        input_file,
        engine="pyogrio",
        layer="siedlungsflaeche_bdlm",
        where=query,
        use_arrow=True,
    )
    # Reset the index for a cleaner GeoDataFrame upon re-import of the output file
    land_data.reset_index(drop=True, inplace=True)

    # Keep only the geometry column
    land_data = land_data.filter(["geometry"], axis="columns")

    # Write the GeoDataFrame to a compressed GeoPackage file using a BytesIO buffer
    with BytesIO() as buffer:

        land_data.to_file(buffer, layer=land_geo_name, driver="GPKG", promote_to_multi=True)

        buffer.seek(0)

        with lzma.open(output_file, "wb", preset=9) as archive:
            archive.write(buffer.read())

    print(f"Extracted {len(land_data)} residential areas from {land_geo_name}")

IPython autoawait is `on`, and set to use `asyncio`


0it [00:00, ?it/s]


Extracted 82002 residential areas from Schleswig-Holstein
Extracted 12971 residential areas from Hamburg
Extracted 270259 residential areas from Niedersachsen
Extracted 0 residential areas from Bremen
Extracted 313646 residential areas from Nordrhein-Westfalen
Extracted 133296 residential areas from Hessen
Extracted 120255 residential areas from Rheinland-Pfalz
Extracted 277621 residential areas from Baden-Württemberg
Extracted 505431 residential areas from Bayern
Extracted 24907 residential areas from Saarland
Extracted 17371 residential areas from Berlin
Extracted 76392 residential areas from Brandenburg
Extracted 57799 residential areas from Mecklenburg-Vorpommern
Extracted 119534 residential areas from Sachsen
Extracted 74897 residential areas from Sachsen-Anhalt
Extracted 80878 residential areas from Thüringen
