## Administrative Areas - Preprocessing

This notebook describes the preprocessing of both census (Zensus 2022) and geospatial (VG250) data for the pharmalink project. \
The goal is to create a custom GeoPackage containing population and area information for the main administrative levels  in Germany (Staat, Land, Kreis, Gemeinde). \
Said GeoPackage is included in the pharmalink package as an essential part of its internal data.

### Source: [Zensus 2022](https://www.zensus2022.de) 
Both sources are available as Open Data from the 2022 census conducted by the federal and state statistical offices. \
Both sources were last accessed on 2024-08-28.

##### Population Data (sources/zensus2022_population):
Population count broken down by administrative levels. \
Multiple entries for units belonging to multiple levels (eg. Hamburg: Land, Kreis and Gemeinde). \
The file format is an Excel Spreadsheet with a sheet (CSV-Einwohnerzahl) designed to be parsed by eg. pandas.

[Website](https://www.zensus2022.de/DE/Aktuelles/Bevoelkerung_VOE.html) and
[File](https://www.zensus2022.de/static/Zensus_Veroeffentlichung/Regionaltabelle_Bevoelkerung.xlsx)

© Statistische Ämter des Bundes und der Länder 2024, [Data license Germany – attribution – version 2.0](https://www.govdata.de/dl-de/by-2-0) (Daten verändert)

##### Area Data (sources/vg250_shapefile):
Special version of the VG250 dataset published by the Bundesamt für Kartografie und Geodäsie. \
The cut-off date is the same as for the entire census data (2022-05-15) to ensure compatibility.

[Website](https://www.zensus2022.de/DE/Presse/Grafik/shapefile.html) and
[File](https://www.zensus2022.de/static/DE/gitterzellen/Shapefile_Zensus2022.zip)

[© GeoBasis-DE / BKG 2023 (Daten verändert)](https://www.bkg.bund.de), [Data license Germany – attribution – version 2.0](https://www.govdata.de/dl-de/by-2-0).

In [1]:
import pathlib as path
import pandas as pd
import geopandas as gpd
import shapely as shp
import lzma

In [2]:
# Establish notebook path for handling relative paths in the notebook
notebook_path = path.Path().resolve()

if notebook_path.stem != "admin_areas":
    raise Exception(
        "Notebook file root must be set to parent directory of the notebook. Please resolve and re-run."
    )

In [3]:
population_file = notebook_path.joinpath(
    "sources", "zensus2022_population", "Regionaltabelle_Bevoelkerung.xlsx"
)

population = pd.read_excel(
    io=population_file,
    engine="openpyxl",
    sheet_name="CSV-Einwohnerzahl",
    names=[
        "BERICHTSZEITPUNKT",
        "regkey",
        "name",
        "level",
        "population",
        "FORTSCHREIBUNG",
    ],
    # usecols=["regkey", "name", "level", "population"], # use for debugging, contains names and levels
    usecols=["regkey", "population"],
    dtype={
        "regkey": "str",  # String because of leading zeros
        "name": "str",
        "level": "str",
        "population": "int64",
    },
    skiprows=2,  # Skip unnecessary rows distinguishing between Germans in and outside of Germany
)

# Adjust the key for the Bundesrepublik Deutschland from "00" to "000000000000" to match entry in shapefile
population.loc[population["regkey"] == "00", "regkey"] = "000000000000"

# Set the regkey as index
population = population.set_index("regkey")

population

Unnamed: 0_level_0,population
regkey,Unnamed: 1_level_1
000000000000,82711282
01,2927542
01001,95015
010010000000,95015
01002,249132
...,...
160775051023,1722
160775051036,7023
160775052,14118
160775052003,418


In [4]:
shapes = gpd.GeoDataFrame()

# All relevant administrative levels (STAat, LANd, KReiS, GEMeinde)
for level in ["STA", "LAN", "KRS", "GEM"]:
    # For source information, see above
    input = gpd.read_file(f"sources/vg250_shapefile/EPSG_25832/VG250_{level}.shp")

    friendly_level = {"STA": "staat", "LAN": "land", "KRS": "kreis", "GEM": "gemeinde"}[
        level
    ]

    # Add a column for the administrative level
    input["level"] = friendly_level

    # Add the input to the shapes GeoDataFrame
    shapes = pd.concat([shapes, input])


# Set the index to the "ARS_0" column containing the regkeys
shapes = shapes.set_index("ARS_0")

# rename the index to "regkey"
shapes.index.name = "regkey"

# Filter out bodies of water to get landmass
shapes = shapes[shapes["GF"] == 4]

# Drop all areas which are "Gemeindefreie Gebiete" (IBZ == 65)
shapes = shapes[shapes["IBZ"] != 65]

# Tramsform all polygons to multipolygons for uniformity
shapes["geometry"] = shapes["geometry"].apply(
    lambda geom: shp.MultiPolygon([geom]) if geom.geom_type == "Polygon" else geom
)

# Transform "NBD" column from german "ja/nein" to boolean
shapes["NBD"] = shapes["NBD"] == "ja"

# Build the complete name for each administrative area ("BEZ GEN" if NBD is True, "GEN" otherwise)
shapes["full_name"] = shapes.apply(
    lambda row: f"{row['BEZ']} {row['GEN']}" if row["NBD"] else row["GEN"], axis=1
)

# Drop all columns which are not needed
# shapes = shapes[["ARS", "ADE", "full_name", "GEN", "BEZ", "NBD", "geometry"]]
shapes = shapes[["ARS", "level", "full_name", "GEN", "BEZ", "NBD", "geometry"]]

# Rename the columns for better readability
shapes = shapes.rename(
    columns={
        "ARS": "shortest_regkey",
        # "ADE": "level",
        "GEN": "geo_name",
        "BEZ": "title",
        "NBD": "titleEname",
    }
)

# Sort by shortest regkey as its an easy way to implicitly sort by levels as well
shapes = shapes.sort_values(by="shortest_regkey")


shapes

Unnamed: 0_level_0,shortest_regkey,level,full_name,geo_name,title,titleEname,geometry
regkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
000000000000,000000000000,staat,Bundesrepublik Deutschland,Deutschland,Bundesrepublik,True,"MULTIPOLYGON (((702598.302 6006620.218, 702532..."
010000000000,01,land,Land Schleswig-Holstein,Schleswig-Holstein,Land,True,"MULTIPOLYGON (((546232.578 5934903.089, 546341..."
010010000000,01001,kreis,Kreisfreie Stadt Flensburg,Flensburg,Kreisfreie Stadt,True,"MULTIPOLYGON (((526513.753 6075133.412, 526547..."
010010000000,010010000000,gemeinde,Stadt Flensburg,Flensburg,Stadt,True,"MULTIPOLYGON (((526513.753 6075133.412, 526547..."
010020000000,01002,kreis,Kreisfreie Stadt Kiel,Kiel,Kreisfreie Stadt,True,"MULTIPOLYGON (((575841.569 6032148.032, 575869..."
...,...,...,...,...,...,...,...
160775051011,160775051011,gemeinde,Gemeinde Göpfersdorf,Göpfersdorf,Gemeinde,True,"MULTIPOLYGON (((752591.111 5648274.147, 752860..."
160775051023,160775051023,gemeinde,Gemeinde Langenleuba-Niederhain,Langenleuba-Niederhain,Gemeinde,True,"MULTIPOLYGON (((747005.276 5655670.767, 747160..."
160775051036,160775051036,gemeinde,Gemeinde Nobitz,Nobitz,Gemeinde,True,"MULTIPOLYGON (((745477.589 5655903.318, 745520..."
160775052003,160775052003,gemeinde,Gemeinde Dobitschen,Dobitschen,Gemeinde,True,"MULTIPOLYGON (((730841.395 5651399.075, 731066..."


In [5]:
# Add the population data to the shapes GeoDataFrame by joining on the regkey
admin_areas = shapes.join(population, on="shortest_regkey")

# Drop the shortest_regkey column as it is not needed anymore
admin_areas = admin_areas.drop(columns=["shortest_regkey"])

admin_areas

Unnamed: 0_level_0,level,full_name,geo_name,title,titleEname,geometry,population
regkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
000000000000,staat,Bundesrepublik Deutschland,Deutschland,Bundesrepublik,True,"MULTIPOLYGON (((702598.302 6006620.218, 702532...",82711282
010000000000,land,Land Schleswig-Holstein,Schleswig-Holstein,Land,True,"MULTIPOLYGON (((546232.578 5934903.089, 546341...",2927542
010010000000,kreis,Kreisfreie Stadt Flensburg,Flensburg,Kreisfreie Stadt,True,"MULTIPOLYGON (((526513.753 6075133.412, 526547...",95015
010010000000,gemeinde,Stadt Flensburg,Flensburg,Stadt,True,"MULTIPOLYGON (((526513.753 6075133.412, 526547...",95015
010020000000,kreis,Kreisfreie Stadt Kiel,Kiel,Kreisfreie Stadt,True,"MULTIPOLYGON (((575841.569 6032148.032, 575869...",249132
...,...,...,...,...,...,...,...
160775051011,gemeinde,Gemeinde Göpfersdorf,Göpfersdorf,Gemeinde,True,"MULTIPOLYGON (((752591.111 5648274.147, 752860...",215
160775051023,gemeinde,Gemeinde Langenleuba-Niederhain,Langenleuba-Niederhain,Gemeinde,True,"MULTIPOLYGON (((747005.276 5655670.767, 747160...",1722
160775051036,gemeinde,Gemeinde Nobitz,Nobitz,Gemeinde,True,"MULTIPOLYGON (((745477.589 5655903.318, 745520...",7023
160775052003,gemeinde,Gemeinde Dobitschen,Dobitschen,Gemeinde,True,"MULTIPOLYGON (((730841.395 5651399.075, 731066...",418


In [6]:
output_file = notebook_path.joinpath("admin_areas.gpkg.xz")

if output_file.exists():
    output_file.unlink()

with lzma.open(output_file, "wb", preset=9) as archive:
    admin_areas.to_file(archive, layer="admin_areas", driver="GPKG")