## Verwaltungsgebiete Preprocessing

This notebook describes the preprocessing of both census (Zensus 2022) and geospatial (VG250) data for the pharmalink project. \
The goal is to create a custom shapefile containing population and area information for the main administrative levels (Staat, Land, Kreis, Gemeinde). \
Said file is included in the pharmalink package as an essential part of its internal data.

### Source: [Zensus 2022](https://www.zensus2022.de) 
Both sources are available as Open Data from the 2022 census conducted by the federal and state statistical offices. \
Both sources are subject to the [Data licence Germany – attribution – version 2.0](https://www.govdata.de/dl-de/by-2-0). \
Both sources were last accessed on 2024-08-28.

##### Population Data (sources/zensus2022_population):
Population count broken down by administrative levels. \
Multiple entries for units belonging to multiple levels (eg. Hamburg: Land, Kreis and Gemeinde) \
The file format is an Excel Spreadsheet with a sheet (CSV-Einwohnerzahl) designed to be parsed by eg. pandas.

[Website](https://www.zensus2022.de/DE/Aktuelles/Bevoelkerung_VOE.html) and
[File](https://www.zensus2022.de/static/Zensus_Veroeffentlichung/Regionaltabelle_Bevoelkerung.xlsx)

© Statistische Ämter des Bundes und der Länder 2024

##### Area Data (sources/vg250_shapefile):
Special version of the VG250 dataset published by the Bundesamt für Kartografie und Geodäsie. \
The cut-off date is the same as for the entire census data (2022-05-15) to ensure compatibility.

[Website](https://www.zensus2022.de/DE/Presse/Grafik/shapefile.html) and
[File](https://www.zensus2022.de/static/DE/gitterzellen/Shapefile_Zensus2022.zip)

[© GeoBasis-DE / BKG 2023 (Daten verändert)](https://www.bkg.bund.de)

In [1]:
import os
import pandas as pd
import geopandas as gpd

In [3]:
# See source information above
population = pd.read_excel(
    io=f"{os.getcwd()}/Quellen/verwaltungsgebiete/sources/zensus2022_population/Regionaltabelle_Bevoelkerung.xlsx",
    engine="openpyxl",
    sheet_name="CSV-Einwohnerzahl",
    names=[
        "BERICHTSZEITPUNKT",
        "regkey",
        "name",
        "level",
        "population",
        "FORTSCHREIBUNG",
    ],
    # usecols=["regkey", "name", "level", "population"], # use for debugging, contains names and levels
    usecols=["regkey", "population"],
    dtype={
        "regkey": "str",  # String because of leading zeros
        "name": "str",
        "level": "str",
        "population": "int64",
    },
    skiprows=3,
)

# Adjust the key for the Bundesrepublik Deutschland from "00" to "000000000000" to match entry in shapefile
population.loc[population["regkey"] == "00", "regkey"] = "000000000000"

# Set the regkey as index
population = population.set_index("regkey")

population

Unnamed: 0_level_0,population
regkey,Unnamed: 1_level_1
01,2927542
01001,95015
010010000000,95015
01002,249132
010020000000,249132
...,...
160775051023,1722
160775051036,7023
160775052,14118
160775052003,418


In [126]:
shapes = gpd.GeoDataFrame()

# All relevant administrative levels (STAat, LANd, KReiS, GEMeinde)
for level in ["STA", "LAN", "KRS", "GEM"]:
    # See source information above
    input = gpd.read_file(
        f"{os.getcwd()}/Quellen/verwaltungsgebiete/sources/vg250_shapefile/EPSG_25832/VG250_{level}.shp"
    )

    # Add the input to the shapes GeoDataFrame
    shapes = pd.concat([shapes, input])


# Set the index to the "ARS_0" column containing the regkeys
shapes = shapes.set_index("ARS_0")

# rename the index to "regkey"
shapes.index.name = "regkey"

# Filter out bodies of water to get landmass
shapes = shapes[shapes["GF"] == 4]

# Transform "NBD" column from german "ja/nein" to boolean
shapes["NBD"] = shapes["NBD"] == "ja"

# Build the complete name for each administrative area ("BEZ GEN" if NBD is True, "GEN" otherwise)
shapes["name"] = shapes.apply(
    lambda row: f"{row['BEZ']} {row['GEN']}" if row["NBD"] else row["GEN"], axis=1
)

# Drop all areas which are "Gemeindefreie Gebiete" (IBZ == 65)
shapes = shapes[shapes["IBZ"] != 65]

# Drop all columns which are not needed
shapes = shapes[["ARS", "ADE", "name", "geometry"]]

# Rename the columns for better readability
shapes = shapes.rename(columns={"ARS": "shortest_regkey", "ADE": "level"})

# Sort by shortest regkey as its an easy way to implicitly sort by levels as well
shapes = shapes.sort_values(by="shortest_regkey")

shapes

Unnamed: 0_level_0,shortest_regkey,level,name,geometry
regkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000000000000,000000000000,1,Bundesrepublik Deutschland,"MULTIPOLYGON (((702598.302 6006620.218, 702532..."
010000000000,01,2,Land Schleswig-Holstein,"MULTIPOLYGON (((546232.578 5934903.089, 546341..."
010010000000,01001,4,Kreisfreie Stadt Flensburg,"POLYGON ((526513.753 6075133.412, 526547.941 6..."
010010000000,010010000000,6,Stadt Flensburg,"POLYGON ((526513.753 6075133.412, 526547.941 6..."
010020000000,01002,4,Kreisfreie Stadt Kiel,"POLYGON ((575841.569 6032148.032, 575869.668 6..."
...,...,...,...,...
160775051011,160775051011,6,Gemeinde Göpfersdorf,"POLYGON ((752591.111 5648274.147, 752860.195 5..."
160775051023,160775051023,6,Gemeinde Langenleuba-Niederhain,"POLYGON ((747005.276 5655670.767, 747160.622 5..."
160775051036,160775051036,6,Gemeinde Nobitz,"POLYGON ((745477.589 5655903.318, 745520.885 5..."
160775052003,160775052003,6,Gemeinde Dobitschen,"POLYGON ((730841.395 5651399.075, 731066.733 5..."


In [127]:
# Add the population data to the shapes GeoDataFrame by joining on the regkey
verwaltungsgebiete = shapes.join(population, on="shortest_regkey")

# Drop the shortest_regkey column as it is not needed anymore
verwaltungsgebiete = verwaltungsgebiete.drop(columns=["shortest_regkey"])

verwaltungsgebiete

Unnamed: 0_level_0,level,name,geometry,population
regkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000000000000,1,Bundesrepublik Deutschland,"MULTIPOLYGON (((702598.302 6006620.218, 702532...",82711282
010000000000,2,Land Schleswig-Holstein,"MULTIPOLYGON (((546232.578 5934903.089, 546341...",2927542
010010000000,4,Kreisfreie Stadt Flensburg,"POLYGON ((526513.753 6075133.412, 526547.941 6...",95015
010010000000,6,Stadt Flensburg,"POLYGON ((526513.753 6075133.412, 526547.941 6...",95015
010020000000,4,Kreisfreie Stadt Kiel,"POLYGON ((575841.569 6032148.032, 575869.668 6...",249132
...,...,...,...,...
160775051011,6,Gemeinde Göpfersdorf,"POLYGON ((752591.111 5648274.147, 752860.195 5...",215
160775051023,6,Gemeinde Langenleuba-Niederhain,"POLYGON ((747005.276 5655670.767, 747160.622 5...",1722
160775051036,6,Gemeinde Nobitz,"POLYGON ((745477.589 5655903.318, 745520.885 5...",7023
160775052003,6,Gemeinde Dobitschen,"POLYGON ((730841.395 5651399.075, 731066.733 5...",418


In [130]:
# Save the processed data to a new shapefile
verwaltungsgebiete.to_file(
    f"{os.getcwd()}/Quellen/verwaltungsgebiete/verwaltungsgebiete"
)

In [131]:
# Test if the shapefile can be read
test = gpd.read_file(
    f"{os.getcwd()}/Quellen/verwaltungsgebiete/verwaltungsgebiete/verwaltungsgebiete.shp"
)

test

Unnamed: 0,regkey,level,name,population,geometry
0,000000000000,1,Bundesrepublik Deutschland,82711282,"MULTIPOLYGON (((702598.302 6006620.218, 702532..."
1,010000000000,2,Land Schleswig-Holstein,2927542,"MULTIPOLYGON (((546232.578 5934903.089, 546341..."
2,010010000000,4,Kreisfreie Stadt Flensburg,95015,"POLYGON ((526513.753 6075133.412, 526547.941 6..."
3,010010000000,6,Stadt Flensburg,95015,"POLYGON ((526513.753 6075133.412, 526547.941 6..."
4,010020000000,4,Kreisfreie Stadt Kiel,249132,"POLYGON ((575841.569 6032148.032, 575869.668 6..."
...,...,...,...,...,...
11196,160775051011,6,Gemeinde Göpfersdorf,215,"POLYGON ((752591.111 5648274.147, 752860.195 5..."
11197,160775051023,6,Gemeinde Langenleuba-Niederhain,1722,"POLYGON ((747005.276 5655670.767, 747160.622 5..."
11198,160775051036,6,Gemeinde Nobitz,7023,"POLYGON ((745477.589 5655903.318, 745520.885 5..."
11199,160775052003,6,Gemeinde Dobitschen,418,"POLYGON ((730841.395 5651399.075, 731066.733 5..."
