## Population Grid - Preprocessing

This notebook describes the preprocessing of census data rendered as a population grid for Germany for the pharmalink project. \
The goal is to create a custom GeoPackage with one layer per Bundesland (state) containing all of its grid cells with a population > 0. \
Said GeoPackage is included in the pharmalink package as an essential part of its internal data.

### Source: [Zensus 2022](https://www.zensus2022.de) 
The source is available as Open Data from the 2022 census conducted by the federal and state statistical offices. 

© Statistische Ämter des Bundes und der Länder, 2024, [Data license Germany – attribution – version 2.0](https://www.govdata.de/dl-de/by-2-0) (Daten verändert)

The source was last accessed on 2024-09-10.

### Description:
Census data rendered as an INSPIRE-conforming 100mx100m grid with an integer value representing the number of people living within each cell. \
For further information, see the "Datensatzbeschreibung_Bevoelkerungszahl_Gitterzellen.xlsx" file.

[Website](https://www.zensus2022.de/DE/Ergebnisse-des-Zensus/_inhalt.html#Gitterdaten2022) and [File](https://www.zensus2022.de/static/Zensus_Veroeffentlichung/Zensus2022_Bevoelkerungszahl.zip)

In [1]:
import pathlib as path
import warnings
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon
from io import BytesIO
import lzma

In [2]:
# Establish notebook path for handling relative paths in the notebook
notebook_path = path.Path().resolve()

if notebook_path.stem != "population_grids":
    raise Exception(
        "Notebook file root must be set to parent directory of the notebook. Please resolve and re-run."
    )

In [3]:
# For source information, see above
pop_cells_file = notebook_path.joinpath(
    "Zensus2022_Bevoelkerungszahl", "Zensus2022_Bevoelkerungszahl_100m-Gitter.csv"
)

# Create a DataFrame from the population grid file
pop_cells = pd.read_csv(
    pop_cells_file,
    sep=";",
    header=0,
    names=["id", "x_mp", "y_mp", "population"],
    dtype={"id": str, "x_mp": int, "y_mp": int, "population": int},
    index_col="id",
)

# Add the polygon described by the centroid x and y coordinates to the DataFrame
pop_cells["geometry"] = pop_cells.apply(
    lambda row: Polygon(
        [
            (row["x_mp"] - 50, row["y_mp"] - 50),
            (row["x_mp"] + 50, row["y_mp"] - 50),
            (row["x_mp"] + 50, row["y_mp"] + 50),
            (row["x_mp"] - 50, row["y_mp"] + 50),
        ]
    ),
    axis=1,
)

# Clean up the DataFrame
pop_cells = pop_cells.reset_index(drop=True)
pop_cells = pop_cells[["geometry", "population"]]

# Create a GeoDataFrame from the DataFrame. Data source uses EPSG:3035
pop_cells = gpd.GeoDataFrame(pop_cells, crs="EPSG:3035")

# Transform the GeoDataFrame to EPSG:4326
pop_cells = pop_cells.to_crs("EPSG:4326")

In [4]:
# Import the Bundesländer areas from sources/admin_areas
admin_areas_file = notebook_path.parent.joinpath("admin_areas", "admin_areas.gpkg.xz")

# Filter RuntimeWarnings from pyogrio. The GDAL driver for GeoPackage expects a .gpkg filename,
# but the virtual file it receives from lzma cannot comply with the file standard in this regard.
warnings.filterwarnings("ignore", category=RuntimeWarning, module="pyogrio")

# Decompress with lzma, then access with geopandas.
# Output is a GeoDataFrame
with lzma.open(admin_areas_file, "rb") as archive:
    admin_areas = gpd.read_file(archive, engine="pyogrio")

bundeslaender = admin_areas[admin_areas["level"] == "land"]

bundeslaender = bundeslaender.set_index("regkey")

In [5]:
# Ensure the output directory exists
output_path = notebook_path.joinpath("population_grids")
output_path.mkdir(parents=True, exist_ok=True)

# Remove all existing files in the output directory
for file in output_path.glob("*.gpkg.xz"):
    file.unlink()


# WARNING: This process can easily take an hour or more to complete, depending on the machines single-core performance.
for land in bundeslaender.iterrows():

    regkey = land[0]
    land_two_digits = f"{regkey[:2]}"

    land_geo_name = land[1]["geo_name"]
    land_geometry = land[1]["geometry"]

    # One output file per Bundesland to get around file size limits
    output_file = output_path.joinpath(f"{land_two_digits}.gpkg.xz")

    # Clip the population grid to the Bundesland geometry
    land_cells = pop_cells.clip(mask=land_geometry)

    # Write the GeoDataFrame to a compressed GeoPackage file using a BytesIO buffer
    with BytesIO() as buffer:

        land_cells.to_file(buffer, layer=land_geo_name, driver="GPKG")

        buffer.seek(0)

        with lzma.open(output_file, "wb", preset=9) as archive:
            archive.write(buffer.read())

    print(f"Extracted {len(land_cells)} cells for {land_geo_name}")

Extracted 138894 cells for Schleswig-Holstein
Extracted 29195 cells for Hamburg
Extracted 416520 cells for Niedersachsen
Extracted 12730 cells for Bremen
Extracted 548405 cells for Nordrhein-Westfalen
Extracted 191956 cells for Hessen
Extracted 164209 cells for Rheinland-Pfalz
Extracted 346891 cells for Baden-Württemberg
Extracted 564846 cells for Bayern
Extracted 39650 cells for Saarland
Extracted 41053 cells for Berlin
Extracted 139081 cells for Brandenburg
Extracted 88185 cells for Mecklenburg-Vorpommern
Extracted 175398 cells for Sachsen
Extracted 101430 cells for Sachsen-Anhalt
Extracted 92199 cells for Thüringen
