## Explore birth data

In [None]:
import pandas as pd


def load_data(path="../../data/raw/features/12612-91-01-5-births.csv"):
    """Load birth data from a CSV file."""
    df = pd.read_csv(
        path,
        skiprows=5,
        skipfooter=4,
        engine="python",
        encoding="latin-1",
        na_values="-",
        sep=";",
        header=None,
        names=["MU_ID", "MU_name", "births"],
    )
    df["AGS"] = df["MU_ID"].str.ljust(8, "0")

    return df

In [None]:
raw_df = load_data()
raw_df

In [None]:
# load municipality and filter
from geoscore_de.data_flow.features.municipality import MunicipalityFeature

muni_df = MunicipalityFeature("../../data/raw/municipalities_2022.csv").load()
filtered_df = raw_df[raw_df["AGS"].isin(muni_df["AGS"])]

In [None]:
import plotnine as gg

(
    gg.ggplot(filtered_df, gg.aes(x="births"))
    + gg.geom_histogram(bins=30)
    + gg.scale_x_log10()
    + gg.labs(x="Number of Births (log scale)", y="Count", title="Distribution of Births")
)

There are big differences in the number of births across municipalities due to population size. 
We need to weight the birth counts by population to get a more meaningful feature.

In [None]:
# merge muni_df with filtered_df to get Persons column
merged_df = filtered_df.merge(muni_df[["AGS", "Persons"]], on="AGS", how="left")

col = "births"
merged_df[f"{col}_per_capita"] = merged_df[col] / merged_df["Persons"]

merged_df

In [None]:
(
    gg.ggplot(merged_df, gg.aes(x=f"{col}_per_capita"))
    + gg.geom_histogram(bins=30)
    + gg.labs(x=f"{col} per capita", y="Count", title=f"Distribution of {col} per capita")
)

Plot with births per capita looks like normal distribution. We can use that as good feature for our model.