---
title: Research of existing geojson files for Germany on municipality level
date: now
author: Jan Cap
---

We found one data source of geojson files for Germany on opendatalab.de: https://opendatalab.de/projects/geojson-utilities/#
Lets try to load the state level boundaries first. Then we will try the municipality level boundaries.
Data source is not official statistical office, so we will do some data comparison with official data.

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

## Municipality boundaries load

Data from this source are not from official statistical office, so we will do some data comparison with official data.
Dataset contains a lot of metadata about each municipality, including following fields:
- RS (Regional key): 12-digit
- AGS (Official municipality key): 8-digit
- GEN (Geographical name): official name of the administrative unit
- BEZ (Official designation): official designation of the administrative unit like "Stadt", "Landkreis", etc.
- destatis (Destatis data): includes area, population numbers (men, women, total), density, zip codes, etc. in stringified dictionary format

In [None]:
# Try to load municipality-level data (Gemeinden)
# Try to load German municipality boundaries from a common source

try:
    # Load GeoJSON data using geopandas
    gdf_mun = gpd.read_file("../data/gemeinden_simplify200.geojson")

    print(f"Loaded GeoDataFrame with {len(gdf_mun)} rows")
    print(f"Columns: {list(gdf_mun.columns)}")
    print(f"CRS: {gdf_mun.crs}")

    # Display first few rows
    print("\nFirst 3 rows:")
    display(gdf_mun.head(3))

except Exception as e:
    print(f"Error loading from URL: {e}")
    print("Let's try a different approach...")

In [None]:
print("Rows with different RS and RS_0:", len(gdf_mun[gdf_mun["RS"] != gdf_mun["RS_0"]]))
print("Rows with different RS and SDV_RS:", len(gdf_mun[gdf_mun["RS"] != gdf_mun["SDV_RS"]]))
print("Rows with different AGS and AGS_0:", len(gdf_mun[gdf_mun["AGS"] != gdf_mun["AGS_0"]]))

RS_0, SDV_RS and AGS_0 are identical to RS and AGS columns. We can drop them.

In [None]:
gdf_mun = gdf_mun.drop(columns=["RS_0", "SDV_RS", "AGS_0"])

In [None]:
gdf_mun[gdf_mun["AGS"].str.startswith("0200")]

In [None]:
# Visualize the GeoJSON data
# Create a simple plot
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
gdf_mun.plot(ax=ax, color="lightblue", edgecolor="black", linewidth=0.5)
ax.set_title("German Municipalities (Gemeinden)")
ax.set_axis_off()
plt.tight_layout()
plt.show()

# Show some basic statistics
print("\nGeoDataFrame Info:")
print(f"Shape: {gdf_mun.shape}")
print(f"Geometry type: {gdf_mun.geometry.geom_type.unique()}")
print(f"Bounds: {gdf_mun.total_bounds}")

## Link to municipality data

Now we can try to compare loaded geojson data with data from census 2022 municipality data.
We are mainly interested if all municipalities from census data are also present in geojson data and vice versa.

### Using RS (Regional key) for matching

In [None]:
from geoscore_de.data_flow.municipality import load_municipality_data

df_muni = load_municipality_data("../data/raw/municipalities_2022.csv")

print("Count of unique municipalities from geo data:", gdf_mun["RS"].nunique())
print("Count of unique municipalities from muni data:", df_muni["MU_ID"].nunique())

In [None]:
df_merged = gdf_mun.merge(df_muni, left_on="RS", right_on="MU_ID", how="outer", indicator=True)

In [None]:
df_merged.drop_duplicates(subset=["RS", "MU_ID"])["_merge"].value_counts()

There is a lot of unmapped municipalities in the data. This is probably because of the Verbandsgemeinde level in RS key. We also have AGS key in the data, which does not have Verbandsgemeinde level. So lets try mapping with AGS key instead of RS key.

### Using AGS (Official municipality key) for matching

In [None]:
df_merged = gdf_mun.merge(df_muni, left_on="AGS", right_on="AGS", how="outer", indicator=True)

In [None]:
df_merged.drop_duplicates(subset=["AGS"])["_merge"].value_counts()

The counts are much better now. There is still 472 municipalities that are only in geojson data. We will use AGS key for further mapping.

### Municipality only in GeoJSON format

In [None]:
df_merged[df_merged["_merge"] == "left_only"][["RS", "AGS", "GEN", "BEZ"]]

### Municipality only in municipality format

In [None]:
df_merged[df_merged["_merge"] == "right_only"][
    ["AGS", "Municipality", "Persons", "Area", "Population density", "_merge"]
].sort_values("AGS")

## Details comparison

Compare data about area and population from both datasets for matched municipalities.

In [None]:
import ast

import pandas as pd

df_merged["geo_area"] = df_merged["destatis"].apply(lambda x: ast.literal_eval(x)["area"] if pd.notnull(x) else None)
df_merged["geo_population"] = df_merged["destatis"].apply(
    lambda x: ast.literal_eval(x)["population"] if pd.notnull(x) else None
)

In [None]:
df_merged[["geo_area", "Area", "geo_population", "Persons"]]

In [None]:
# create difference columns
df_merged["area_diff"] = df_merged["geo_area"] - df_merged["Area"].astype(float)
df_merged["population_diff"] = df_merged["geo_population"] - df_merged["Persons"].astype(float)

In [None]:
import plotnine as gg

# Plot area differences histogram

(
    gg.ggplot(df_merged, gg.aes(x="area_diff"))
    + gg.geom_histogram(binwidth=5, color="black", alpha=0.7)
    + gg.labs(title="Histogram of Area Differences", x="Area Difference (geo_area - Area)", y="Municipality Count")
    + gg.theme_minimal()
).draw()

In [None]:
(
    gg.ggplot(df_merged[abs(df_merged["area_diff"]) > 1], gg.aes(x="area_diff"))
    + gg.geom_histogram(binwidth=5, color="black", alpha=0.7)
    + gg.labs(
        title="Histogram of Area Differences (|area_diff| > 1)",
        x="Area Difference (geo_area - Area)",
        y="Municipality Count",
    )
    + gg.theme_minimal()
).draw()

Area is basically identical. there are only small differences probably because of missing values or rounding.

In [None]:
(
    gg.ggplot(df_merged, gg.aes(x="population_diff"))
    + gg.geom_histogram(binwidth=200, color="black", alpha=0.7)
    + gg.labs(
        title="Histogram of Population Differences",
        x="Population Difference (geo_population - Persons)",
        y="Municipality Count",
    )
    + gg.theme_minimal()
).draw()

In [None]:
df_merged[["AGS", "Persons", "geo_population", "population_diff"]].sort_values("population_diff")

There are big differences in population numbers for some municipalities. It is possible that the geojson data is outdated and uses 2011 census data instead of 2022 census data.
For that reason we cannot use geojson data for population analysis.