# Source data: extraction and preprocessing

This notebook is part of the municipal dataset update pipeline. It performs pre-processing of the raw data to make it suitable for transformation in sector-specific notebooks. The notebook outputs intermediate csv files and stores them in the data/intermediate folder.

The flowchart below visualizes the processes in this notebook. 
![Preprocess notebook data flow](config/mermaid_flow_diagrams/preprocessing_overall_data_flow.png)


## Setup

In [1]:
# ───────── LIBRARIES ─────────
import pandas as pd
import numpy as np
import xlwings as xw
from pathlib import Path
import importlib
import src.transform
import src.extract_helper
import src.helper
importlib.reload(src.transform)
importlib.reload(src.extract_helper)
importlib.reload(src.helper)
from src.transform import TransformerV2

## General

We first specify general parameters to be used throughout this notebook.

In [2]:
# Select the parent data set. Make sure to use the geo ID. It should be existing in the Dataset Manager.
parent = "nl"

# Specify the year and the reference year for the ETM
year = 2023
year_etm = 2023

# Specify the CSV-separator (presumably either "," or ";")
sep=","

## CBS

### Extract

#### List of municipalities

We import the list of municipalities from an Excel file provided by the CBS

In [3]:
path_cbs = Path("data", "raw", f"Gemeenten alfabetisch {year}.xlsx")
wb_cbs = xw.Book(str(path_cbs))
ws_cbs_municipalities = wb_cbs.sheets["Gemeenten_alfabetisch"]

df_cbs_source_data = pd.DataFrame(ws_cbs_municipalities.used_range.value)
df_cbs_source_data.columns = df_cbs_source_data.iloc[0] # set column headers
df_cbs_source_data = df_cbs_source_data[1:] # remove superfluous column headers
df_cbs_source_data = df_cbs_source_data.set_index(df_cbs_source_data.columns[1]) # take municipal code as index

# Preview the data
df_cbs_source_data.index

Index(['GM1680', 'GM0358', 'GM0197', 'GM0059', 'GM0482', 'GM0613', 'GM0361',
       'GM0141', 'GM0034', 'GM0484',
       ...
       'GM0355', 'GM0299', 'GM0637', 'GM0638', 'GM1892', 'GM0879', 'GM0301',
       'GM1896', 'GM0642', 'GM0193'],
      dtype='object', name='GemeentecodeGM', length=342)

In [4]:
# Define a list of municipalities
municipalities = df_cbs_source_data.index.to_list()

#### Car data

The CBS also reports data on number of cars. This information is used to calculate CO2 emissions of different types of cars and fuel types. Let's import this data here too. 
Sources:
- Personenauto's actief: https://opendata.cbs.nl/#/CBS/nl/dataset/85237NED/table?dl=C09C3
- Verkeersprestaties bedrijfsvoertuigen: https://opendata.cbs.nl/#/CBS/nl/dataset/85239NED/table?dl=C23E5

In [180]:
# Import CBS - personenauto's actief 2023 from raw folder
path_cbs_pa = Path("data", "raw", f"CBS - personenauto's actief {year}.csv")
df_cbs_pa = pd.read_csv(path_cbs_pa, sep=';', encoding='utf-8', index_col=0)
# Preview the data
df_cbs_pa

Unnamed: 0_level_0,Perioden,Brandstofsoort van personenauto's/Totaal (aantal),Brandstofsoort van personenauto's/Benzine (aantal),Brandstofsoort van personenauto's/Diesel (aantal),Brandstofsoort van personenauto's/LPG (aantal),Brandstofsoort van personenauto's/Elektriciteit (aantal),Brandstofsoort van personenauto's/CNG (aantal),Brandstofsoort van personenauto's/Overig/Onbekend (aantal)
Bouwjaar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Totaal alle bouwjaren,2023,8917107,6992291,860021,94857,957886,8276,3776


In [181]:
# Import CBS - verkeersprestaties bedrijfsvoertuigen 20233 from raw folder
path_cbs_ba = Path("data", "raw", f"CBS - verkeersprestaties bedrijfsvoertuigen {year}.csv")
df_cbs_ba = pd.read_csv(path_cbs_ba, sep=';', encoding='utf-8', index_col=0)
# Preview the data
df_cbs_ba

Unnamed: 0_level_0,Bouwjaren,Perioden,Bedrijfsvoertuigen naar brandstofsoort/Totaal (aantal),Bedrijfsvoertuigen naar brandstofsoort/Benzine (aantal),Bedrijfsvoertuigen naar brandstofsoort/Diesel (aantal),Bedrijfsvoertuigen naar brandstofsoort/LPG (aantal),Bedrijfsvoertuigen naar brandstofsoort/Elektriciteit (aantal),Bedrijfsvoertuigen naar brandstofsoort/CNG (aantal),Bedrijfsvoertuigen naar brandstofsoort/Geen/overige/onbekende brandstof (aantal)
Voertuigtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Totaal bedrijfsvoertuigen,Totaal alle bouwjaren,2023,2456150,36934.0,1119261.0,21729.0,17075.0,4979.0,1256172
Bestelauto,Totaal alle bouwjaren,2023,989841,34062.0,916950.0,19935.0,14945.0,3934.0,15
Vrachtauto (excl. trekker voor oplegger),Totaal alle bouwjaren,2023,60811,724.0,59319.0,278.0,284.0,154.0,52
Trekker voor oplegger,Totaal alle bouwjaren,2023,85679,51.0,84322.0,148.0,72.0,106.0,980
Speciaal voertuig,Totaal alle bouwjaren,2023,55974,2089.0,52123.0,1364.0,76.0,288.0,34
Bus,Totaal alle bouwjaren,2023,8756,8.0,6547.0,4.0,1698.0,497.0,2
Aanhangwagen,Totaal alle bouwjaren,2023,1069827,,,,,,1069827
Oplegger,Totaal alle bouwjaren,2023,185262,,,,,,185262


## ETLocal: prepare empty ETLocal template

We then prepare a template for the ETLocal interface elements that the other notebooks are going to fill. 

### Extract
We first read all ETLocal interface elements from a CSV file.

For now we make two versions of the ETLocal template
- etlocal_template_empty: based on the etlocal_interface_elements from 2023 (Roos' work)
- etlocal_template_empty_2025: based on the latest etlocal_interface_elements

In [7]:
# ────────── ETLocal Interface Elements (2023) ──────────
path = Path("config","etlocal_interface_elements.csv")
empty_template = pd.read_csv(path, header=[0], sep=sep)

# Add columns geo_id, value and commit to the template
for column in ['geo_id', 'value', 'commit']:
    empty_template[column] = pd.NA # pandas native NaN suitable for both strings and floats
    
# Fill the geo_id columns with all relevant municipal geo IDs
templates = []
for municipality in municipalities:
    template_to_add = empty_template.copy()
    template_to_add['geo_id'] = municipality
    templates.append(template_to_add)

template = pd.concat(templates)

# Transform the templates into a multi-index dataframe
index = pd.MultiIndex.from_frame(template[['geo_id', 'group', 'subgroup', 'key']])
template = template.drop(columns=['geo_id', 'group', 'subgroup', 'key'])
template.index = index

# Preview merged template
template.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,unit,value,commit
geo_id,group,subgroup,key,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_electricity_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_network_gas_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_steam_hot_water_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_wood_pellets_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,input_agriculture_final_demand_crude_oil_demand,TJ,,


In [8]:
# ────────── ETLocal Interface Elements (2025) ──────────
path = Path("config","250606_etlocal_interface_elements.csv")
empty_template = pd.read_csv(path, header=[0], sep=sep)

# Add columns geo_id, value and commit to the template
for column in ['geo_id', 'value', 'commit']:
    empty_template[column] = pd.NA # pandas native NaN suitable for both strings and floats
    
# Fill the geo_id columns with all relevant municipal geo IDs
templates = []
for municipality in municipalities:
    template_to_add = empty_template.copy()
    template_to_add['geo_id'] = municipality
    templates.append(template_to_add)

template = pd.concat(templates)

# Transform the templates into a multi-index dataframe
index = pd.MultiIndex.from_frame(template[['geo_id', 'group', 'subgroup', 'key']])
template = template.drop(columns=['geo_id', 'group', 'subgroup', 'key'])
template.index = index

# Preview merged template
template.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,unit,value,commit
geo_id,group,subgroup,key,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GM1680,agriculture,agriculture_heat_chp,agriculture_chp_engine_network_gas_dispatchable_demand,TJ,,
GM1680,agriculture,agriculture_heat_chp,agriculture_chp_engine_biogas_demand,TJ,,
GM1680,agriculture,agriculture_heat_chp,agriculture_chp_wood_pellets_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_electricity_demand,TJ,,
GM1680,agriculture,agriculture_energy_demand,agriculture_final_demand_network_gas_demand,TJ,,


### Export

We now export the template to the data/intermediate folder.

In [None]:
# template.to_csv("data/intermediate/etlocal_template_empty.csv", sep=sep, index=True) # no longer used as of August 2025
template.to_csv("data/intermediate/etlocal_template_empty_2025.csv", sep=sep, index=True)

## Klimaatmonitor (municipal data)

### Extract

We start by collecting the relevant data from the Klimaatmonitor data export. This results in a dataframe with key/value combinations for all Dutch municipalities.

At the moment we rely on two KM data dumps:
1. 'Gemeenten van Nederland', containing most energy-related variables
2. 'Gemeenten - Transport, containing emission-related variables

Both data dumps are the 2023 version of the 2019 Klimaatmonitor table (i.e. the same table only with year = 2023). This has a few consequences:
1. Some variables are missing because KM no longer reports them. 
2. We might also not be utilizing any new variables that KM now reports.
3. The original 2019 table already included variables that KM no longer reported on, or were not used in the original pipeline at all. 

TODO: create 1 complete data dump using most up-to-date version of KM, and clean up any unused or unavailable KM variables. Possibly via API

#### Energy data

We import both the data and the relevant meta data for each key (or "topic"). This provides us with information about, among others, the topic and unit for each key.

Source: https://klimaatmonitor.databank.nl/Jive?workspace_guid=2f1f517e-0c72-4b61-8e65-e01832e4ee6a

The 2019 to 2022 data can be downloaded by changing the year in the link above. 

The last download of all 2019 - 2023 data took place on 28-08-2025.

Here we extract and clean the KM data in one go. The `helper.fill_missing_KM_data` method does the following:
1. Import the KM 2023 data (`main_csv_path`) and convert all missing/?/- etc. data to NaN
2. Fill in these NaNs with KM data from 2020-2022 (`backup_csv_paths`)
3. For all remaining NaN, fill with 0

The method exports the result to an Excel file. This Excel file is subsequently imported again to be transformed in the next stage of this pipeline.

In [137]:
# Setup variables
path_output = Path("data", "pre_processing")
output_name = "KM_backfilled_to_2019.csv"
main_csv_path = Path("data", "raw", "Klimaatmonitor - 2023 - Gemeenten van Nederland.xlsx")
backup_csv_paths=[
        Path("data", "raw", "Klimaatmonitor - 2022 - Gemeenten van Nederland.xlsx"),
        Path("data", "raw", "Klimaatmonitor - 2021 - Gemeenten van Nederland.xlsx"),
        Path("data", "raw", "Klimaatmonitor - 2020 - Gemeenten van Nederland.xlsx"),
        Path("data", "raw", "Klimaatmonitor - 2019 - Gemeenten van Nederland.xlsx")
    ]

In [138]:
result = src.helper.fill_missing_KM_data(
    main_csv_path=main_csv_path,
    backup_csv_paths=backup_csv_paths,
    output_path=path_output,
    output_filename="Klimaatmonitor - 2023 - Gemeenten van Nederland - backfilled.xlsx",
    main_year="2023",
    convert_remaining_to_zero=True
)

result

Initial missing: 4,885
Successfully filled: 1,007
Converted to 0: 3,878


{'excel_path': 'data/pre_processing/Klimaatmonitor - 2023 - Gemeenten van Nederland - backfilled.xlsx',
 'initial_missing': np.int64(4885),
 'total_fills': np.int64(1007),
 'zero_conversions': np.int64(3878),
 'final_missing': np.int64(0),
 'fills_by_source': {'Klimaatmonitor - 2022 - Gemeenten van Nederland': np.int64(490),
  'Klimaatmonitor - 2021 - Gemeenten van Nederland': np.int64(279),
  'Klimaatmonitor - 2020 - Gemeenten van Nederland': np.int64(92),
  'Klimaatmonitor - 2019 - Gemeenten van Nederland': np.int64(146)},
 'color_mapping': {'Klimaatmonitor - 2022 - Gemeenten van Nederland': 'CC2828',
  'Klimaatmonitor - 2021 - Gemeenten van Nederland': '7ACC28',
  'Klimaatmonitor - 2020 - Gemeenten van Nederland': '28CCCC',
  'Klimaatmonitor - 2019 - Gemeenten van Nederland': '7A28CC'},
 'legend_created': True}

In [139]:
# re-import the cleaned Klimaatmonitor data
# from the 'Data' tab from Klimaatmonitor - 2023 - Gemeenten van Nederland - backfilled.xlsx
path_km_cleaned = Path("data", "pre_processing", "Klimaatmonitor - 2023 - Gemeenten van Nederland - backfilled.xlsx")
wb_km_cleaned = xw.Book(str(path_km_cleaned))
ws_km_cleaned_municipal_source_data = wb_km_cleaned.sheets["Data"]
df_km_municipal_source_data_cleaned = pd.DataFrame(ws_km_cleaned_municipal_source_data.used_range.value)

# Set the first row as the header and remove it from the data
df_km_municipal_source_data_cleaned.columns = df_km_municipal_source_data_cleaned.iloc[0]
df_km_municipal_source_data_cleaned = df_km_municipal_source_data_cleaned[1:]

# convert 'Code' column to integers to remove decimal zeros. Then convert to strings and set as index
df_km_municipal_source_data_cleaned['Code'] = df_km_municipal_source_data_cleaned['Code'].astype(int).astype(str)
df_km_municipal_source_data_cleaned = df_km_municipal_source_data_cleaned.set_index('Code')

df_km_municipal_source_data_cleaned


Unnamed: 0_level_0,Gebieden,inwoners_2023,woningen_2023,energie_totaal_combi_2023,verk_totaal_2023,elektra_totaal_combi_2023,warm_totaal_combi_2023,zonpvachtermeter_kwh_2023,hern_warm_tot_2023,gas_woningen_tj_2023,...,efacgas_2023,vbrzg_tot_tj_2023,vbrze_tot_tj_2023,zonpvtj_2023,warelektr_2023,hern_warm_tot_ex_groengas_2023,windmw_2023,efacel_2023,warwarmte_2023,ondieptj_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1680,Nederland: Aa en Hunze,25724.0,11428.0,3021.0,1282.0,327.0,0.0,13.7,76.0,435.0,...,0.001782,0.0,163.0,142.0,0.0,76.0,62.4,0.00022,0.0,0.0
358,Nederland: Aalsmeer,33063.0,13460.0,2566.0,513.0,762.0,1291.0,15.6,43.0,416.0,...,0.001782,0.0,577.0,154.0,0.0,43.0,0.0,0.00022,0.0,6.0
197,Nederland: Aalten,27244.0,12164.0,1411.0,452.0,367.0,592.0,14.0,52.0,382.0,...,0.001782,0.0,203.0,143.0,0.0,52.0,16.0,0.00022,0.0,0.0
59,Nederland: Achtkarspelen,28149.0,12289.0,1896.0,569.0,445.0,882.0,11.2,87.0,420.0,...,0.001782,395.0,295.0,140.0,0.0,87.0,0.7,0.00022,0.0,0.0
482,Nederland: Alblasserdam,20356.0,8449.0,1512.0,456.0,425.0,631.0,5.3,14.0,211.0,...,0.001782,404.0,332.0,51.0,0.0,14.0,0.0,0.00022,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,Nederland: Zutphen,48510.0,23381.0,2410.0,633.0,734.0,1043.0,15.1,37.0,626.0,...,0.001782,379.0,496.0,184.0,0.0,37.0,6.0,0.00022,0.0,1.0
1896,Nederland: Zwartewaterland,23368.0,9196.0,2250.0,453.0,711.0,1086.0,11.6,83.0,293.0,...,0.001782,0.0,581.0,116.0,0.0,83.0,0.0,0.00022,0.0,2.0
642,Nederland: Zwijndrecht,45018.0,20828.0,3304.0,794.0,710.0,1800.0,26.9,25.0,553.0,...,0.001782,1221.0,441.0,278.0,0.0,25.0,0.0,0.00022,0.0,6.0
193,Nederland: Zwolle,132411.0,60475.0,8415.0,2955.0,2290.0,3170.0,50.3,228.0,1509.0,...,0.001782,1399.0,1633.0,581.0,0.0,228.0,9.9,0.00022,0.0,87.0


We also import the metadata based on the 2023 Klimaatmonitor export.

In [140]:
df_km_municipal_meta_data = src.extract_helper.load_km_metadata(main_csv_path)
df_km_municipal_meta_data

Unnamed: 0_level_0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
inwoners,Aantal inwoners,aantal,CBS - Kerncijfers Wijken en buurten,,,Numeriek,6-8-2025 08:21:16
woningen,Aantal woningen per 1 januari,aantal,CBS - Kerncijfers Wijken en buurten,,,Numeriek,18-7-2025 10:36:01
energie_totaal_combi,"Totaal bekend energieverbruik (aardgas, elektr...",TJ,Berekening (sub)totalen energieverbruik,,,Numeriek,27-8-2025 05:46:19
verk_totaal,Totaal bekend energieverbruik Verkeer en vervo...,TJ,Berekening energieverbruik brandstof,,,Numeriek,18-8-2025 10:53:36
elektra_totaal_combi,"Totaal bekend elektriciteitsverbruik, incl. zo...",TJ,Berekening (sub)totalen energieverbruik,,,Numeriek,27-8-2025 05:46:20
...,...,...,...,...,...,...,...
hern_warm_tot_ex_groengas,Totaal bekende hernieuwbare warmte exclusief a...,TJ,Verdeling regionale gegevens hernieuwbare ener...,,,Numeriek,27-8-2025 05:47:26
windmw,Wind op land fysiek opgesteld vermogen,MW,Windstats,,,Numeriek,6-5-2025 10:31:31
efacel,Emissiefactor elektriciteit (ton/kWh),ton/kWh,CBS - Jaarlijkse publicatie CO₂-emissiefactor ...,,,Gemiddelde,23-7-2025 13:46:52
warwarmte,"Doorgeleverde warmte afvalverbranding (AVI, fo...",TJ,"Werkgroep Afvalregistratie, onderdeel afvalver...",,,Numeriek,20-8-2025 12:05:57


#### Transport data

We import the transport data dump from Klimaatmonitor: both the data itself and relevant meta data.

Source:
https://klimaatmonitor.databank.nl/Jive?workspace_guid=eff703ad-6af3-4939-8439-bdbd272458ef

The last download of this file took place on 28-08-2025.

In [141]:

# Import the 'transport' data dump
path_km_transport = Path("data", "raw", f"Klimaatmonitor - {year} - Gemeenten - Transport.xlsx")
wb_km_transport = xw.Book(str(path_km_transport))
ws_km_municipal_transport_source_data = wb_km_transport.sheets["Data"]
df_km_municipal_transport_source_data = pd.DataFrame(ws_km_municipal_transport_source_data.used_range.value)
df_km_municipal_transport_source_data.columns = df_km_municipal_transport_source_data.iloc[0]
df_km_municipal_transport_source_data = df_km_municipal_transport_source_data[1:]
df_km_municipal_transport_source_data = df_km_municipal_transport_source_data.set_index(df_km_municipal_transport_source_data.columns[1])

# import the metadata
ws_km_municipal_transport_meta_data = wb_km_transport.sheets["Onderwerp Informatie"]
df_km_municipal_transport_meta_data = pd.DataFrame(ws_km_municipal_transport_meta_data.used_range.value)
df_km_municipal_transport_meta_data.columns = df_km_municipal_transport_meta_data.iloc[0]
df_km_municipal_transport_meta_data = df_km_municipal_transport_meta_data[1:]
df_km_municipal_transport_meta_data = df_km_municipal_transport_meta_data.set_index(df_km_municipal_transport_meta_data.columns[0])

# Close the Excel workbook
wb_km_transport.close()

In [142]:
# Preview the source data
df_km_municipal_transport_source_data.head()

Unnamed: 0_level_0,Gebieden,energie_bebouwde_kom_2023,energie_buitenweg_2023,energie_snelweg_2023,co2_bk_pa_2023,co2_bk_ba_2023,co2_bk_ab_2023,co2_bw_pa_2023,co2_bw_ba_2023,co2_bw_ab_2023,co2_sw_pa_2023,co2_sw_ba_2023,co2_sw_ab_2023,efac_benz_g_mj_2023,efac_diesel_g_mj_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1680,Aa en Hunze,122.0,300.0,628.0,5711.0,915.0,807.0,13561.0,3297.0,691.0,21120.0,6879.0,203.0,72.11342,72.61941
358,Aalsmeer,248.0,212.0,0.0,12104.0,1939.0,601.0,8178.0,1989.0,98.0,,,0.0,72.11342,72.61941
197,Aalten,79.0,286.0,0.0,3789.0,607.0,93.0,12436.0,3024.0,82.0,,,0.0,72.11342,72.61941
59,Achtkarspelen,110.0,351.0,0.0,5469.0,876.0,330.0,16154.0,3928.0,255.0,,,0.0,72.11342,72.61941
482,Alblasserdam,107.0,56.0,210.0,5420.0,868.0,254.0,2666.0,648.0,69.0,7099.0,2312.0,33.0,72.11342,72.61941


In [143]:
# Preview the meta data
df_km_municipal_transport_meta_data.tail()

Unnamed: 0_level_0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
co2_sw_pa,CO₂-uitstoot Uitlaatgassen personenauto auto(s...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_sw_ba,CO₂-uitstoot Uitlaatgassen bestelauto auto(sne...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_sw_ab,CO₂-uitstoot Uitlaatgassen autobus auto(snel)w...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,7-2-2025 19:13:08
efac_benz_g_mj,Emissiefactor benzine,ton/TJ,Emissieregistratie - CO₂-uitstoot,,,Gemiddelde,8-2-2025 00:02:13
efac_diesel_g_mj,Emissiefactor diesel,ton/TJ,Emissieregistratie - CO₂-uitstoot,,,Gemiddelde,8-2-2025 00:02:13


Klimaatmonitor reports emission factors for gasoline (benzine) and diesel but not for LPG or CNG. These are available from the RIVM - Methodology report Transport ER 1990-2023, [available here](https://www.emissieregistratie.nl/documentatie/methoderapporten/verkeer-en-vervoer).

We therefore update all emission factors with those reported by the RIVM. See tables 2.2 and 2.7 from the report above

In [144]:
# Define emission factors manually from the RIVM report
# Table 2.2
efac_lpg_g_mj_2023 = 66.7 # g/MJ

# Table 2.7, 2023 values
efac_benz_g_mj_2023 = 72.12023506 # g/MJ
efac_diesel_g_mj_2023 = 74.87000000 # g/MJ
efac_cng_g_mj_2023 = 56.4 # g/MJ

# Update the emission factors in the km transport source data  
df_km_municipal_transport_source_data['efac_lpg_g_mj_2023'] = efac_lpg_g_mj_2023
df_km_municipal_transport_source_data['efac_cng_g_mj_2023'] = efac_cng_g_mj_2023
df_km_municipal_transport_source_data['efac_benz_g_mj_2023'] = efac_benz_g_mj_2023
df_km_municipal_transport_source_data['efac_diesel_g_mj_2023'] = efac_diesel_g_mj_2023

# Preview the data
df_km_municipal_transport_source_data

Unnamed: 0_level_0,Gebieden,energie_bebouwde_kom_2023,energie_buitenweg_2023,energie_snelweg_2023,co2_bk_pa_2023,co2_bk_ba_2023,co2_bk_ab_2023,co2_bw_pa_2023,co2_bw_ba_2023,co2_bw_ab_2023,co2_sw_pa_2023,co2_sw_ba_2023,co2_sw_ab_2023,efac_benz_g_mj_2023,efac_diesel_g_mj_2023,efac_lpg_g_mj_2023,efac_cng_g_mj_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1680,Aa en Hunze,122.0,300.0,628.0,5711.0,915.0,807.0,13561.0,3297.0,691.0,21120.0,6879.0,203.0,72.120235,74.87,66.7,56.4
358,Aalsmeer,248.0,212.0,0.0,12104.0,1939.0,601.0,8178.0,1989.0,98.0,,,0.0,72.120235,74.87,66.7,56.4
197,Aalten,79.0,286.0,0.0,3789.0,607.0,93.0,12436.0,3024.0,82.0,,,0.0,72.120235,74.87,66.7,56.4
59,Achtkarspelen,110.0,351.0,0.0,5469.0,876.0,330.0,16154.0,3928.0,255.0,,,0.0,72.120235,74.87,66.7,56.4
482,Alblasserdam,107.0,56.0,210.0,5420.0,868.0,254.0,2666.0,648.0,69.0,7099.0,2312.0,33.0,72.120235,74.87,66.7,56.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,Zutphen,235.0,325.0,0.0,11611.0,1860.0,450.0,14793.0,3597.0,45.0,,,0.0,72.120235,74.87,66.7,56.4
1896,Zwartewaterland,110.0,256.0,0.0,5036.0,807.0,149.0,11053.0,2687.0,274.0,,,0.0,72.120235,74.87,66.7,56.4
642,Zwijndrecht,194.0,63.0,379.0,10090.0,1616.0,290.0,3024.0,735.0,21.0,15149.0,4934.0,39.0,72.120235,74.87,66.7,56.4
193,Zwolle,681.0,673.0,1350.0,34966.0,5601.0,2342.0,29890.0,7268.0,312.0,45526.0,14828.0,210.0,72.120235,74.87,66.7,56.4


In [145]:

# Update the meta data for gasoline and diesel with the source of the emission factors
df_km_municipal_transport_meta_data.loc['efac_benz_g_mj', 'Bron'] = "Overwritten manually by Quintel with RIVM data"
df_km_municipal_transport_meta_data.loc['efac_benz_g_mj', 'Beschrijving'] = 'RIVM - Methodology report Transport ER 1990-2023, table 2.7'

df_km_municipal_transport_meta_data.loc['efac_diesel_g_mj', 'Bron'] = "Overwritten manually by Quintel with RIVM data"
df_km_municipal_transport_meta_data.loc['efac_diesel_g_mj', 'Beschrijving'] = 'RIVM - Methodology report Transport ER 1990-2023, table 2.7'

# Add the emission factor meta data for LPG and CNG
df_km_municipal_transport_meta_data.loc['efac_lpg_g_mj', 'Onderwerp'] = "Emission factor LPG"
df_km_municipal_transport_meta_data.loc['efac_lpg_g_mj', 'Eenheid'] = "tonne/TJ"
df_km_municipal_transport_meta_data.loc['efac_lpg_g_mj', 'Bron'] = "Overwritten manually by Quintel with RIVM data"
df_km_municipal_transport_meta_data.loc['efac_lpg_g_mj', 'Beschrijving'] = 'RIVM - Methodology report Transport ER 1990-2023, table 2.2'

df_km_municipal_transport_meta_data.loc['efac_cng_g_mj', 'Onderwerp'] = "Emission factor CNG"
df_km_municipal_transport_meta_data.loc['efac_cng_g_mj', 'Eenheid'] = "tonne/TJ"
df_km_municipal_transport_meta_data.loc['efac_cng_g_mj', 'Bron'] = "Overwritten manually by Quintel with RIVM data"
df_km_municipal_transport_meta_data.loc['efac_cng_g_mj', 'Beschrijving'] = 'RIVM - Methodology report Transport ER 1990-2023, table 2.7'

df_km_municipal_transport_meta_data


Unnamed: 0_level_0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
energie_bebouwde_kom,Energieverbruik wegverkeer bebouwde kom (diese...,TJ,Berekening energieverbruik brandstof,,,Numeriek,18-8-2025 10:53:36
energie_buitenweg,Energieverbruik wegverkeer buitenwegen (diese...,TJ,Berekening energieverbruik brandstof,,,Numeriek,18-8-2025 10:53:36
energie_snelweg,Energieverbruik wegverkeer auto(snel)wegen (di...,TJ,Berekening energieverbruik brandstof,,,Numeriek,18-8-2025 10:53:36
co2_bk_pa,CO₂-uitstoot Uitlaatgassen personenauto bebouw...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_bk_ba,CO₂-uitstoot Uitlaatgassen bestelauto bebouwde...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_bk_ab,CO₂-uitstoot Uitlaatgassen autobus bebouwde kom,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,7-2-2025 19:13:07
co2_bw_pa,CO₂-uitstoot Uitlaatgassen personenauto buiten...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_bw_ba,CO₂-uitstoot Uitlaatgassen bestelauto buitenwegen,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35
co2_bw_ab,CO₂-uitstoot Uitlaatgassen autobus buitenwegen,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,7-2-2025 19:13:07
co2_sw_pa,CO₂-uitstoot Uitlaatgassen personenauto auto(s...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,18-8-2025 10:53:35


### Transform

#### Clean and preprocess

We first check whether the Klimaatmonitor data refers to the same municipalities as the CBS municipality list. This list is imported [here](#list-of-municipalities).

In [148]:
# Trim GM from cbs list of municipalities
municipalities_cbs = df_cbs_source_data.index.to_list()
municipalities_cbs = [x.replace("GM", "") for x in municipalities_cbs]

# Ensure all municipalities have consistent formatting (remove leading zeros)
municipalities_cbs = [x.lstrip('0') for x in municipalities_cbs]

# Obtain list of municipalities from transport data
municipalities_transport = df_km_municipal_transport_source_data.index.to_list()
# Make sure they're all strings and remove leading zeros for consistent comparison
municipalities_transport = [str(x).lstrip('0') for x in municipalities_transport]

# Check if there are municipalities in transport data not in CBS data
missing_municipalities_in_cbs = set(municipalities_transport) - set(municipalities_cbs)
print(f"Municipalities in transport data but not in CBS data: {missing_municipalities_in_cbs}")

# Check if there are municipalities in CBS data not in transport data
missing_municipalities_in_transport = set(municipalities_cbs) - set(municipalities_transport)
print(f"Municipalities in CBS data but not in transport data: {missing_municipalities_in_transport}")

# Do the same for the km municipal source data.
municipalities_km = df_km_municipal_source_data_cleaned.index.to_list()

# # Check if there are municipalities in km data not in CBS data
missing_municipalities_in_cbs_km = set(municipalities_km) - set(municipalities_cbs)
print(f"Municipalities in km data but not in CBS data: {missing_municipalities_in_cbs_km}")
# Check if there are municipalities in CBS data not in km data
missing_municipalities_in_km = set(municipalities_cbs) - set(municipalities_km)
print(f"Municipalities in CBS data but not in km data: {missing_municipalities_in_km}")


Municipalities in transport data but not in CBS data: {'2000'}
Municipalities in CBS data but not in transport data: set()
Municipalities in km data but not in CBS data: set()
Municipalities in CBS data but not in km data: set()


The KM exports apparently contain a municipality with code 2000 (GM2000) that does not exist. This outlier needs to be removed.

In [149]:
# remove missing_municipalities_in_cbs from df_km_municipal_transport_source_data
df_km_municipal_transport_source_data_cleaned = df_km_municipal_transport_source_data.copy()
df_km_municipal_transport_source_data_cleaned = df_km_municipal_transport_source_data_cleaned.loc[~df_km_municipal_transport_source_data_cleaned.index.isin(missing_municipalities_in_cbs)]
# df_km_municipal_transport_source_data_cleaned.tail()

# do the same for df_km_municipal_source_data
df_km_municipal_source_data_cleaned = df_km_municipal_source_data_cleaned.loc[~df_km_municipal_source_data_cleaned.index.isin(missing_municipalities_in_cbs)]
df_km_municipal_source_data_cleaned.tail()


Unnamed: 0_level_0,Gebieden,inwoners_2023,woningen_2023,energie_totaal_combi_2023,verk_totaal_2023,elektra_totaal_combi_2023,warm_totaal_combi_2023,zonpvachtermeter_kwh_2023,hern_warm_tot_2023,gas_woningen_tj_2023,...,efacgas_2023,vbrzg_tot_tj_2023,vbrze_tot_tj_2023,zonpvtj_2023,warelektr_2023,hern_warm_tot_ex_groengas_2023,windmw_2023,efacel_2023,warwarmte_2023,ondieptj_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
879,Nederland: Zundert,22518.0,9940.0,1545.0,448.0,380.0,717.0,12.1,52.0,347.0,...,0.001782,0.0,231.0,119.0,0.0,52.0,29.0,0.00022,0.0,0.0
301,Nederland: Zutphen,48510.0,23381.0,2410.0,633.0,734.0,1043.0,15.1,37.0,626.0,...,0.001782,379.0,496.0,184.0,0.0,37.0,6.0,0.00022,0.0,1.0
1896,Nederland: Zwartewaterland,23368.0,9196.0,2250.0,453.0,711.0,1086.0,11.6,83.0,293.0,...,0.001782,0.0,581.0,116.0,0.0,83.0,0.0,0.00022,0.0,2.0
642,Nederland: Zwijndrecht,45018.0,20828.0,3304.0,794.0,710.0,1800.0,26.9,25.0,553.0,...,0.001782,1221.0,441.0,278.0,0.0,25.0,0.0,0.00022,0.0,6.0
193,Nederland: Zwolle,132411.0,60475.0,8415.0,2955.0,2290.0,3170.0,50.3,228.0,1509.0,...,0.001782,1399.0,1633.0,581.0,0.0,228.0,9.9,0.00022,0.0,87.0


#### Transport data

This part of the pipeline transforms the Klimaatmonitor transport data to arrive at CO2 emissions per road type, vehicle type and fuel type. These emissions are used as distribution keys in the Transport sector notebook to calculate specific carrier 
demands, among other applications.

The input data for this process is as follows:
- **Municipal identifiers:** Area name, CBS municipality code
- **Energy consumption by road type:** Built environment, rural roads, highways (TJ)
- **CO2 emissions by vehicle × road type:** 9 combinations (PA/BA/AB × BK/BW/SW) in tons
  - **PA** = Passenger cars ('passagierauto'), **BA** = Delivery vans ('bestelauto'), **AB** = Buses ('autobus')
  - **BK** = Built-up areas ('bebouwde kom'), **BW** = Rural roads ('buitenwegen'), **SW** = Highways ('snelwegen')
- **Emission factors:** Gasoline, diesel, LPG (tons CO2/TJ)

The diagram below represents the process of transforming transport data from the Klimaatmonitor. Starting with raw source data, we process it through several key steps to calculate detailed CO2 emissions by road type, vehicle type, and fuel type.

**Note**: the diagram is rendered using the code in `config/mermaid_flow_diagrams/preprocessing_km_transport_data_transformations_code.mmd`

This transformation expands the original aggregated data into a detailed breakdown that allows for more precise analysis of transport-related emissions in each municipality. The process:

1. Aggregates CO2 emissions by road type
2. Calculates vehicle share percentages within each road type
3. Distributes energy consumption using CO2 emission shares
4. Breaks down energy by fuel type using CBS-based fuel mix data
5. Calculates final CO2 emissions by applying RIVM emission factors

![Transport Data Flow](config/mermaid_flow_diagrams/preprocessing_km_transport_data_transformation.png)

First we analyze the dataset

In [150]:
# =====================================================
# FUEL MIX CALCULATION FROM CBS DATA
# =====================================================
def calculate_fuel_mix_from_cbs_improved(df_cbs_pa, df_cbs_ba):
    """Calculate fuel mix percentages from CBS vehicle registration data"""
    fuel_mix = {}
    
    print("Analyzing CBS datasets for fuel mix calculation...")
    
    # PASSENGER CARS (PA) - from improved actief dataset
    row = df_cbs_pa.iloc[0]  # Get the data row
    
    total_all_cars = int(row['Brandstofsoort van personenauto\'s/Totaal (aantal)'])
    gasoline_cars = int(row['Brandstofsoort van personenauto\'s/Benzine (aantal)'])
    diesel_cars = int(row['Brandstofsoort van personenauto\'s/Diesel (aantal)'])
    lpg_cars = int(row['Brandstofsoort van personenauto\'s/LPG (aantal)'])
    electric_cars = int(row['Brandstofsoort van personenauto\'s/Elektriciteit (aantal)'])
    cng_cars = int(row['Brandstofsoort van personenauto\'s/CNG (aantal)'])
    other_cars = int(row['Brandstofsoort van personenauto\'s/Overig/Onbekend (aantal)'])
    
    # Calculate fossil fuel total (excluding electric)
    fossil_total_pa = gasoline_cars + diesel_cars + lpg_cars + cng_cars + other_cars
    
    print(f"\nPASSENGER CARS (PA) analysis:")
    print(f"  Total cars: {total_all_cars:,}")
    print(f"  Electric cars: {electric_cars:,} ({electric_cars/total_all_cars*100:.1f}%)")
    print(f"  Fossil fuel cars: {fossil_total_pa:,} ({fossil_total_pa/total_all_cars*100:.1f}%)")
    
    fuel_mix['pa'] = {
        'gasoline': gasoline_cars / fossil_total_pa,
        'diesel': diesel_cars / fossil_total_pa,
        'lpg': lpg_cars / fossil_total_pa,
        'cng': cng_cars / fossil_total_pa
    }
    
    print(f"  Fuel mix (fossil vehicles only):")
    for fuel, share in fuel_mix['pa'].items():
        print(f"    {fuel.capitalize()}: {share*100:.1f}%")
    
    # DELIVERY VANS (BA) - similar analysis
    ba_row = df_cbs_ba.loc['Bestelauto']
    total_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Totaal (aantal)'])
    gasoline_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Benzine (aantal)']) if pd.notna(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Benzine (aantal)']) else 0
    diesel_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Diesel (aantal)']) if pd.notna(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Diesel (aantal)']) else 0
    lpg_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/LPG (aantal)']) if pd.notna(ba_row['Bedrijfsvoertuigen naar brandstofsoort/LPG (aantal)']) else 0
    electric_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Elektriciteit (aantal)']) if pd.notna(ba_row['Bedrijfsvoertuigen naar brandstofsoort/Elektriciteit (aantal)']) else 0
    cng_ba = int(ba_row['Bedrijfsvoertuigen naar brandstofsoort/CNG (aantal)']) if pd.notna(ba_row['Bedrijfsvoertuigen naar brandstofsoort/CNG (aantal)']) else 0
    
    fossil_total_ba = gasoline_ba + diesel_ba + lpg_ba + cng_ba
    
    print(f"\nDELIVERY VANS (BA) analysis:")
    print(f"  Total vans: {total_ba:,}")
    print(f"  Electric vans: {electric_ba:,} ({electric_ba/total_ba*100:.1f}%)")
    print(f"  Fossil fuel vans: {fossil_total_ba:,} ({fossil_total_ba/total_ba*100:.1f}%)")
    
    fuel_mix['ba'] = {
        'gasoline': gasoline_ba / fossil_total_ba if fossil_total_ba > 0 else 0,
        'diesel': diesel_ba / fossil_total_ba if fossil_total_ba > 0 else 0,
        'lpg': lpg_ba / fossil_total_ba if fossil_total_ba > 0 else 0,
        'cng': cng_ba / fossil_total_ba if fossil_total_ba > 0 else 0
    }
    
    print(f"  Fuel mix (fossil vehicles only):")
    for fuel, share in fuel_mix['ba'].items():
        print(f"    {fuel.capitalize()}: {share*100:.1f}%")
    
    # BUSES (AB) - similar analysis
    ab_row = df_cbs_ba.loc['Bus']
    total_ab = int(ab_row['Bedrijfsvoertuigen naar brandstofsoort/Totaal (aantal)'])
    diesel_ab = int(ab_row['Bedrijfsvoertuigen naar brandstofsoort/Diesel (aantal)']) if pd.notna(ab_row['Bedrijfsvoertuigen naar brandstofsoort/Diesel (aantal)']) else 0
    electric_ab = int(ab_row['Bedrijfsvoertuigen naar brandstofsoort/Elektriciteit (aantal)']) if pd.notna(ab_row['Bedrijfsvoertuigen naar brandstofsoort/Elektriciteit (aantal)']) else 0
    cng_ab = int(ab_row['Bedrijfsvoertuigen naar brandstofsoort/CNG (aantal)']) if pd.notna(ab_row['Bedrijfsvoertuigen naar brandstofsoort/CNG (aantal)']) else 0
    
    fossil_total_ab = diesel_ab + cng_ab
    
    print(f"\nBUSES (AB) analysis:")
    print(f"  Total buses: {total_ab:,}")
    print(f"  Electric buses: {electric_ab:,} ({electric_ab/total_ab*100:.1f}%)")
    print(f"  Fossil fuel buses: {fossil_total_ab:,} ({fossil_total_ab/total_ab*100:.1f}%)")
    
    fuel_mix['ab'] = {
        'diesel': diesel_ab / fossil_total_ab if fossil_total_ab > 0 else 0,
        'cng': cng_ab / fossil_total_ab if fossil_total_ab > 0 else 0
    }
    
    print(f"  Fuel mix (fossil vehicles only):")
    for fuel, share in fuel_mix['ab'].items():
        print(f"    {fuel.capitalize()}: {share*100:.1f}%")
    
    return fuel_mix

# Calculate the fuel mix
fuel_mix = calculate_fuel_mix_from_cbs_improved(df_cbs_pa, df_cbs_ba)

Analyzing CBS datasets for fuel mix calculation...

PASSENGER CARS (PA) analysis:
  Total cars: 8,917,107
  Electric cars: 957,886 (10.7%)
  Fossil fuel cars: 7,959,221 (89.3%)
  Fuel mix (fossil vehicles only):
    Gasoline: 87.9%
    Diesel: 10.8%
    Lpg: 1.2%
    Cng: 0.1%

DELIVERY VANS (BA) analysis:
  Total vans: 989,841
  Electric vans: 14,945 (1.5%)
  Fossil fuel vans: 974,881 (98.5%)
  Fuel mix (fossil vehicles only):
    Gasoline: 3.5%
    Diesel: 94.1%
    Lpg: 2.0%
    Cng: 0.4%

BUSES (AB) analysis:
  Total buses: 8,756
  Electric buses: 1,698 (19.4%)
  Fossil fuel buses: 7,044 (80.4%)
  Fuel mix (fossil vehicles only):
    Diesel: 92.9%
    Cng: 7.1%


**STEP 1: CO2 Aggregation by Road Type**
**Purpose:** Calculate total CO2 emissions per road type across all vehicle types

**Calculations:**
```
total_co2_built_up = co2_bk_pa + co2_bk_ba + co2_bk_ab
total_co2_rural_roads = co2_bw_pa + co2_bw_ba + co2_bw_ab  
total_co2_highways = co2_sw_pa + co2_sw_ba + co2_sw_ab
```

**Output:**
- `total_co2_built_environment_mix`
- `total_co2_road_mix`
- `total_co2_motorway_mix`

In [151]:
# =====================================================
# STEP 1: CO2 AGGREGATION BY ROAD TYPE
# =====================================================
print("STEP 1: Aggregating CO2 emissions by road type...")

df_result = df_km_municipal_transport_source_data_cleaned.copy()

# Calculate total CO2 emissions per road type across all vehicle types
df_result['total_co2_built_environment_mix'] = (
    df_result['co2_bk_pa_2023'] + 
    df_result['co2_bk_ba_2023'] + 
    df_result['co2_bk_ab_2023']
)

df_result['total_co2_road_mix'] = (
    df_result['co2_bw_pa_2023'] + 
    df_result['co2_bw_ba_2023'] + 
    df_result['co2_bw_ab_2023']
)

df_result['total_co2_motorway_mix'] = (
    df_result['co2_sw_pa_2023'] + 
    df_result['co2_sw_ba_2023'] + 
    df_result['co2_sw_ab_2023']
)

print(f"✓ Added 3 total CO2 columns")
print(f"  Built-up areas: {df_result['total_co2_built_environment_mix'].sum():.0f} tons total")
print(f"  Rural roads: {df_result['total_co2_road_mix'].sum():.0f} tons total") 
print(f"  Highways: {df_result['total_co2_motorway_mix'].sum():.0f} tons total")

STEP 1: Aggregating CO2 emissions by road type...
✓ Added 3 total CO2 columns
  Built-up areas: 5354046 tons total
  Rural roads: 6047898 tons total
  Highways: 8512347 tons total


**STEP 2: CO2 Share Calculations by Vehicle Type**
**Purpose:** Calculate relative distribution of CO2 emissions per vehicle type within each road type

**Calculations:**
```
share_co2_bk_pa = co2_bk_pa / total_co2_built_up
share_co2_bk_ba = co2_bk_ba / total_co2_built_up
share_co2_bk_ab = co2_bk_ab / total_co2_built_up
(+ equivalent for rural roads and highways)
```

**Output:**
- **AE-AG:** Built-up area shares (PA, BA, AB)
- **AH-AJ:** Rural road shares (PA, BA, AB)
- **AK-AM:** Highway shares (PA, BA, AB)

In [152]:
# =====================================================
# STEP 2: CO2 SHARE CALCULATIONS BY VEHICLE TYPE
# =====================================================
print("STEP 2: Calculating vehicle type shares within each road type...")

def safe_divide(numerator, denominator):
    """Handle division by zero for municipalities without certain road types"""
    # Use where() instead of replace() to avoid FutureWarning
    return numerator / denominator.where(denominator != 0, np.nan)

# Built-up area shares
df_result['share_co2_bk_pa'] = safe_divide(df_result['co2_bk_pa_2023'], df_result['total_co2_built_environment_mix'])
df_result['share_co2_bk_ba'] = safe_divide(df_result['co2_bk_ba_2023'], df_result['total_co2_built_environment_mix'])
df_result['share_co2_bk_ab'] = safe_divide(df_result['co2_bk_ab_2023'], df_result['total_co2_built_environment_mix'])

# Rural road shares
df_result['share_co2_bw_pa'] = safe_divide(df_result['co2_bw_pa_2023'], df_result['total_co2_road_mix'])
df_result['share_co2_bw_ba'] = safe_divide(df_result['co2_bw_ba_2023'], df_result['total_co2_road_mix'])
df_result['share_co2_bw_ab'] = safe_divide(df_result['co2_bw_ab_2023'], df_result['total_co2_road_mix'])

# Highway shares  
df_result['share_co2_sw_pa'] = safe_divide(df_result['co2_sw_pa_2023'], df_result['total_co2_motorway_mix'])
df_result['share_co2_sw_ba'] = safe_divide(df_result['co2_sw_ba_2023'], df_result['total_co2_motorway_mix'])
df_result['share_co2_sw_ab'] = safe_divide(df_result['co2_sw_ab_2023'], df_result['total_co2_motorway_mix'])

print(f"✓ Added 9 share columns")
print(f"  Shares sum to 1.0 for municipalities with traffic on that road type")
print(f"  NaN for municipalities without traffic on specific road types")

STEP 2: Calculating vehicle type shares within each road type...
✓ Added 9 share columns
  Shares sum to 1.0 for municipalities with traffic on that road type
  NaN for municipalities without traffic on specific road types


**STEP 3: Energy Distribution by Vehicle Type**
**Purpose:** Distribute total energy consumption across vehicle types based on CO2 emission shares

**Calculations:**
```
energy_bk_pa = share_co2_bk_pa × energie_bebouwde_kom_mix
energy_bk_ba = share_co2_bk_ba × energie_bebouwde_kom_mix
energy_bk_ab = share_co2_bk_ab × energie_bebouwde_kom_mix
(+ equivalent for rural roads and highways)
```

**Output:**
- Built-up area energy (PA, BA, AB)
- Rural road energy (PA, BA, AB)
- Highway energy (PA, BA, AB)

In [153]:
# =====================================================
# STEP 3: ENERGY DISTRIBUTION BY VEHICLE TYPE
# =====================================================
print("STEP 3: Distributing energy consumption across vehicle types...")

# Built-up areas
df_result['energy_bk_pa'] = df_result['share_co2_bk_pa'] * df_result['energie_bebouwde_kom_2023']
df_result['energy_bk_ba'] = df_result['share_co2_bk_ba'] * df_result['energie_bebouwde_kom_2023']
df_result['energy_bk_ab'] = df_result['share_co2_bk_ab'] * df_result['energie_bebouwde_kom_2023']

# Rural roads
df_result['energy_bw_pa'] = df_result['share_co2_bw_pa'] * df_result['energie_buitenweg_2023']
df_result['energy_bw_ba'] = df_result['share_co2_bw_ba'] * df_result['energie_buitenweg_2023']
df_result['energy_bw_ab'] = df_result['share_co2_bw_ab'] * df_result['energie_buitenweg_2023']

# Highways
df_result['energy_sw_pa'] = df_result['share_co2_sw_pa'] * df_result['energie_snelweg_2023']
df_result['energy_sw_ba'] = df_result['share_co2_sw_ba'] * df_result['energie_snelweg_2023']
df_result['energy_sw_ab'] = df_result['share_co2_sw_ab'] * df_result['energie_snelweg_2023']

print(f"✓ Added 9 energy distribution columns")
total_energy_distributed = df_result[['energy_bk_pa', 'energy_bk_ba', 'energy_bk_ab', 'energy_bw_pa', 'energy_bw_ba', 'energy_bw_ab', 'energy_sw_pa', 'energy_sw_ba', 'energy_sw_ab']].sum().sum()
print(f"  Total energy distributed: {total_energy_distributed:.0f} TJ")

STEP 3: Distributing energy consumption across vehicle types...
✓ Added 9 energy distribution columns
  Total energy distributed: 369593 TJ


**STEP 4: Energy by Fuel Type**
**Purpose:** Break down energy consumption by specific fuel types per vehicle category

**Fuel Mix (Fixed Percentages):**
- **Passenger cars:** 83.7% gasoline, 14.9% diesel, 1.4% LPG
- **Delivery vans:** Similar mix with different percentages
- **Buses:** 99.7% diesel, 0.3% other

**Calculations:**
```
energy_bk_pa_gasoline = energy_bk_pa × fuel_mix_pa_gasoline
energy_bk_pa_diesel = energy_bk_pa × fuel_mix_pa_diesel
energy_bk_pa_lpg = energy_bk_pa × fuel_mix_pa_lpg
(for all vehicle × road × fuel combinations)
```

**Output:**
24 energy variables covering all combinations of:
- 3 vehicle types × 3 road types × 2-3 fuel types each

In [154]:
# =====================================================
# STEP 4: ENERGY BY FUEL TYPE USING CBS FUEL MIX
# =====================================================
print("STEP 4: Breaking down energy by specific fuel types...")

road_types = ['bk', 'bw', 'sw']
vehicle_types = ['pa', 'ba', 'ab']

energy_columns_added = 0

for road in road_types:
    for vehicle in vehicle_types:
        energy_col = f'energy_{road}_{vehicle}'
        
        if vehicle == 'pa':
            # Passenger cars: gasoline, diesel, lpg, cng
            for fuel in ['gasoline', 'diesel', 'lpg', 'cng']:
                fuel_share = fuel_mix[vehicle][fuel]
                df_result[f'{energy_col}_{fuel}'] = df_result[energy_col] * fuel_share
                energy_columns_added += 1
        elif vehicle == 'ba':
            # Delivery vans: gasoline, diesel, lpg, cng  
            for fuel in ['gasoline', 'diesel', 'lpg', 'cng']:
                fuel_share = fuel_mix[vehicle][fuel]
                df_result[f'{energy_col}_{fuel}'] = df_result[energy_col] * fuel_share
                energy_columns_added += 1
        elif vehicle == 'ab':
            # Buses: diesel, cng
            for fuel in ['diesel', 'cng']:
                fuel_share = fuel_mix[vehicle][fuel]
                df_result[f'{energy_col}_{fuel}'] = df_result[energy_col] * fuel_share
                energy_columns_added += 1

print(f"✓ Added {energy_columns_added} fuel-specific energy columns")
print(f"  Using CBS-derived fuel mix percentages")
print(f"  Excludes electric vehicles to avoid energy loss")

STEP 4: Breaking down energy by specific fuel types...
✓ Added 30 fuel-specific energy columns
  Using CBS-derived fuel mix percentages
  Excludes electric vehicles to avoid energy loss


**STEP 5: CO2 Emissions by Fuel Type**
**Purpose:** Calculate final CO2 emissions per specific fuel using emission factors

**Emission Factors:**
Taken from the df_km_municipal_transport_source_data_cleaned, based on RIVM - Methodology report Transport ER 1990-2023.

**Calculations:**
```
co2_bk_pa_gasoline = energy_bk_pa_gasoline × emission_factor_gasoline
co2_bk_pa_diesel = energy_bk_pa_diesel × emission_factor_diesel
co2_bk_pa_lpg = energy_bk_pa_lpg × emission_factor_lpg
(for all combinations)
```

**Output:**
24 CO2 emission with fuel-specific granularity

In [155]:
# =====================================================
# STEP 5: CO2 EMISSIONS BY FUEL TYPE
# =====================================================
print("STEP 5: Converting energy to CO2 emissions using emission factors...")

# Extract emission factors from klimaatmonitor data
emission_factors = {
    'gasoline': df_km_municipal_transport_source_data_cleaned['efac_benz_g_mj_2023'].iloc[0],
    'diesel': df_km_municipal_transport_source_data_cleaned['efac_diesel_g_mj_2023'].iloc[0],
    'lpg': df_km_municipal_transport_source_data_cleaned['efac_lpg_g_mj_2023'].iloc[0],
    'cng': df_km_municipal_transport_source_data_cleaned['efac_cng_g_mj_2023'].iloc[0]
}

print(f"Emission factors (tons CO2/TJ):")
for fuel, factor in emission_factors.items():
    print(f"  {fuel.capitalize()}: {factor:.2f}")

co2_columns_added = 0
co2_columns = []

# Calculate final CO2 emissions for each fuel type
for road in road_types:
    for vehicle in vehicle_types:
        if vehicle == 'pa':
            # Passenger cars: all fuel types
            for fuel in ['gasoline', 'diesel', 'lpg', 'cng']:
                energy_col = f'energy_{road}_{vehicle}_{fuel}'
                co2_col = f'co2_{road}_{vehicle}_{fuel}_{year}'
                df_result[co2_col] = df_result[energy_col] * emission_factors[fuel]
                co2_columns.append(co2_col)
                co2_columns_added += 1
        elif vehicle == 'ba':
            # Delivery vans: all fuel types
            for fuel in ['gasoline', 'diesel', 'lpg', 'cng']:
                energy_col = f'energy_{road}_{vehicle}_{fuel}'
                co2_col = f'co2_{road}_{vehicle}_{fuel}_{year}'
                df_result[co2_col] = df_result[energy_col] * emission_factors[fuel]
                co2_columns.append(co2_col)
                co2_columns_added += 1
        elif vehicle == 'ab':
            # Buses: diesel and cng only
            for fuel in ['diesel', 'cng']:
                energy_col = f'energy_{road}_{vehicle}_{fuel}'
                co2_col = f'co2_{road}_{vehicle}_{fuel}_{year}'
                df_result[co2_col] = df_result[energy_col] * emission_factors[fuel]
                co2_columns.append(co2_col)
                co2_columns_added += 1

print(f"\n✓ TRANSFORMATION COMPLETE!")
print(f"✓ Added {co2_columns_added} fuel-specific CO2 columns")
print(f"✓ Total columns in result: {len(df_result.columns)}")

# Horizontally append co2_columns from df_result to df_km_municipal_transport_data_transformed
df_km_municipal_transport_data_transformed = df_km_municipal_transport_source_data_cleaned.copy()
df_km_municipal_transport_data_transformed = df_km_municipal_transport_data_transformed.join(df_result[co2_columns])

# Preview result
df_km_municipal_transport_data_transformed.head()

STEP 5: Converting energy to CO2 emissions using emission factors...
Emission factors (tons CO2/TJ):
  Gasoline: 72.12
  Diesel: 74.87
  Lpg: 66.70
  Cng: 56.40

✓ TRANSFORMATION COMPLETE!
✓ Added 30 fuel-specific CO2 columns
✓ Total columns in result: 98


Unnamed: 0_level_0,Gebieden,energie_bebouwde_kom_2023,energie_buitenweg_2023,energie_snelweg_2023,co2_bk_pa_2023,co2_bk_ba_2023,co2_bk_ab_2023,co2_bw_pa_2023,co2_bw_ba_2023,co2_bw_ab_2023,...,co2_sw_pa_gasoline_2023,co2_sw_pa_diesel_2023,co2_sw_pa_lpg_2023,co2_sw_pa_cng_2023,co2_sw_ba_gasoline_2023,co2_sw_ba_diesel_2023,co2_sw_ba_lpg_2023,co2_sw_ba_cng_2023,co2_sw_ab_diesel_2023,co2_sw_ab_cng_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1680,Aa en Hunze,122.0,300.0,628.0,5711.0,915.0,807.0,13561.0,3297.0,691.0,...,29797.4921,3804.696113,373.850792,27.580534,385.994145,10787.155187,208.927454,34.863172,314.562248,17.988374
358,Aalsmeer,248.0,212.0,0.0,12104.0,1939.0,601.0,8178.0,1989.0,98.0,...,,,,,,,,,,
197,Aalten,79.0,286.0,0.0,3789.0,607.0,93.0,12436.0,3024.0,82.0,...,,,,,,,,,,
59,Achtkarspelen,110.0,351.0,0.0,5469.0,876.0,330.0,16154.0,3928.0,255.0,...,,,,,,,,,,
482,Alblasserdam,107.0,56.0,210.0,5420.0,868.0,254.0,2666.0,648.0,69.0,...,10001.532637,1277.046814,125.483074,9.25741,129.546838,3620.370586,70.119953,11.700731,51.063207,2.920071


In [157]:
# Data validation: Check missing values in highway-related columns
print("=== DATA VALIDATION: Highway CO2 Missing Values ===")

# Check the pattern of missing highway energy
highway_energy_zero = (df_km_municipal_transport_source_data_cleaned['energie_snelweg_2023'] == 0)
num_zero_highway = highway_energy_zero.sum()
total_municipalities = len(df_km_municipal_transport_source_data_cleaned)

print(f"\nHighway energy analysis:")
print(f"  Total municipalities: {total_municipalities}")
print(f"  Municipalities with 0 highway energy: {num_zero_highway} ({num_zero_highway/total_municipalities*100:.1f}%)")
print(f"  Municipalities with highway energy: {total_municipalities - num_zero_highway} ({(total_municipalities - num_zero_highway)/total_municipalities*100:.1f}%)")

# Check highway CO2 emissions by vehicle type
print(f"\nHighway CO2 emissions (original data):")
for vehicle in ['pa', 'ba', 'ab']:
    col = f'co2_sw_{vehicle}_2023'
    co2_zero = (df_km_municipal_transport_source_data_cleaned[col] == 0).sum()
    print(f"  {vehicle}: {co2_zero} municipalities with 0 CO2 ({co2_zero/total_municipalities*100:.1f}%)")

# Now check the transformed data for missing values
print(f"\nTransformed fuel-specific CO2 missing values:")
for vehicle in ['pa', 'ba', 'ab']:
    for fuel in (['gasoline', 'diesel', 'lpg', 'cng'] if vehicle in ['pa', 'ba'] else ['diesel', 'cng']):
        col = f'co2_sw_{vehicle}_{fuel}'
        if col in df_km_municipal_transport_data_transformed.columns:
            missing_values = df_km_municipal_transport_data_transformed[col].isna().sum()
            if missing_values > 0:
                print(f"  {col}: {missing_values} missing values ({missing_values/total_municipalities*100:.1f}%)")

# Show examples of municipalities without highways
print(f"\nExamples of municipalities without highways:")
no_highway = df_km_municipal_transport_source_data_cleaned[highway_energy_zero]
if len(no_highway) > 0:
    # Get municipality names from the gebieden column or index
    examples = no_highway[['Gebieden', 'energie_snelweg_2023', 'co2_sw_pa_2023', 'co2_sw_ba_2023', 'co2_sw_ab_2023']].head(10)
    print(examples)

# Verify the logic: municipalities with 0 highway energy should have NaN in fuel-specific highway CO2
print(f"\nValidation: Zero highway energy vs NaN fuel-specific CO2:")
sample_col = 'co2_sw_pa_gasoline'  # Use this as representative
if sample_col in df_km_municipal_transport_data_transformed.columns:
    zero_energy_municipalities = df_km_municipal_transport_data_transformed[highway_energy_zero]
    nan_fuel_co2 = zero_energy_municipalities[sample_col].isna().sum()
    total_zero_energy = len(zero_energy_municipalities)
    
    print(f"  Municipalities with 0 highway energy: {total_zero_energy}")
    print(f"  Of these, have NaN {sample_col}: {nan_fuel_co2}")
    print(f"  Logic check: {'✓ CORRECT' if nan_fuel_co2 == total_zero_energy else '✗ ISSUE - some have values when they should be NaN'}")

=== DATA VALIDATION: Highway CO2 Missing Values ===

Highway energy analysis:
  Total municipalities: 342
  Municipalities with 0 highway energy: 65 (19.0%)
  Municipalities with highway energy: 277 (81.0%)

Highway CO2 emissions (original data):
  pa: 0 municipalities with 0 CO2 (0.0%)
  ba: 0 municipalities with 0 CO2 (0.0%)
  ab: 60 municipalities with 0 CO2 (17.5%)

Transformed fuel-specific CO2 missing values:

Examples of municipalities without highways:
0          Gebieden energie_snelweg_2023 co2_sw_pa_2023 co2_sw_ba_2023  \
Code                                                                     
358        Aalsmeer                  0.0           None           None   
197          Aalten                  0.0           None           None   
59    Achtkarspelen                  0.0           None           None   
1723   Alphen-Chaam                  0.0           17.0            5.0   
60          Ameland                  0.0           None           None   
744   Baarle-Nass

Apparently there are still a few municipalities that have no motorway-related energy consumption (probably because they don't have a motorway) but _do have emissions_ related to motorways. This highlights an inconsistency in the Klimaatmonitor data, part of which (the emissions) are ultimately based on Emissieregistratie, judging from the meta data frame.

We therefore have to decide: do we trust the energy consumption data or the emissions data?
1. Energy data: 0 energy consumption = 0 emissions. That means we set the motorway emissions to zero for all municipalities without a motorway.
2. Emissions data: assign nonzero emissions despite there not being a physical source.

We go for option 1 and manually set motorway emissions for municipalities without motorway energy consumption to 0.

In [159]:
# Find municipalities with zero motorway energy consumption
zero_motorway_energy = df_km_municipal_transport_source_data_cleaned['energie_snelweg_2023'] == 0

# For these municipalities, set all highway-related CO2 values to 0 instead of NaN
motorway_co2_columns = [col for col in df_km_municipal_transport_data_transformed.columns if col.startswith('co2_sw_')]

for col in motorway_co2_columns:
    # Replace NaN with 0 for municipalities with zero motorway energy
    df_km_municipal_transport_data_transformed.loc[zero_motorway_energy, col] = 0

# Verify the fix
print(f"Municipalities with zero motorway energy: {zero_motorway_energy.sum()}")
print(f"Now all have zero (not NaN) emissions values: {(df_km_municipal_transport_data_transformed.loc[zero_motorway_energy, motorway_co2_columns] == 0).all().all()}")

# Display sample of fixed data
df_km_municipal_transport_data_transformed.loc[zero_motorway_energy, motorway_co2_columns].head(10)

Municipalities with zero motorway energy: 65
Now all have zero (not NaN) emissions values: True


Unnamed: 0_level_0,co2_sw_pa_2023,co2_sw_ba_2023,co2_sw_ab_2023,co2_sw_pa_gasoline_2023,co2_sw_pa_diesel_2023,co2_sw_pa_lpg_2023,co2_sw_pa_cng_2023,co2_sw_ba_gasoline_2023,co2_sw_ba_diesel_2023,co2_sw_ba_lpg_2023,co2_sw_ba_cng_2023,co2_sw_ab_diesel_2023,co2_sw_ab_cng_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
358,0,0,0,0,0,0,0,0,0,0,0,0,0
197,0,0,0,0,0,0,0,0,0,0,0,0,0
59,0,0,0,0,0,0,0,0,0,0,0,0,0
1723,0,0,0,0,0,0,0,0,0,0,0,0,0
60,0,0,0,0,0,0,0,0,0,0,0,0,0
744,0,0,0,0,0,0,0,0,0,0,0,0,0
1945,0,0,0,0,0,0,0,0,0,0,0,0,0
1724,0,0,0,0,0,0,0,0,0,0,0,0,0
373,0,0,0,0,0,0,0,0,0,0,0,0,0
377,0,0,0,0,0,0,0,0,0,0,0,0,0


We export this intermediate transport data for reference to the intermediate folder.

In [None]:
# df_km_municipal_transport_data_transformed.to_csv("data/intermediate/km_municipal_transport_data_transformed.csv", sep=sep, index=True) # no longer used as of August 2025

#### Combine

We now merge the municipal and transport data horizontally into one dataframe. Let's start with the meta data

In [161]:
# Combine the two metadata dataframes vertically
df_km_meta_data = pd.concat([df_km_municipal_meta_data, df_km_municipal_transport_meta_data], axis=0)
# List duplicates in the index and remove them
duplicates = df_km_meta_data.index[df_km_meta_data.index.duplicated(keep=False)]
print(f"Duplicates in the index: {duplicates}")
df_km_meta_data = df_km_meta_data.loc[~df_km_meta_data.index.duplicated(keep='last')]

# Preview data
df_km_meta_data.tail()

Duplicates in the index: Index(['efac_benz_g_mj', 'efac_diesel_g_mj', 'efac_benz_g_mj',
       'efac_diesel_g_mj'],
      dtype='object', name='Code')


Unnamed: 0_level_0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
co2_sw_ab,CO₂-uitstoot Uitlaatgassen autobus auto(snel)w...,ton,Emissieregistratie - CO₂-uitstoot verkeer en v...,,,Numeriek,7-2-2025 19:13:08
efac_benz_g_mj,Emissiefactor benzine,ton/TJ,Overwritten manually by Quintel with RIVM data,,RIVM - Methodology report Transport ER 1990-20...,Gemiddelde,8-2-2025 00:02:13
efac_diesel_g_mj,Emissiefactor diesel,ton/TJ,Overwritten manually by Quintel with RIVM data,,RIVM - Methodology report Transport ER 1990-20...,Gemiddelde,8-2-2025 00:02:13
efac_lpg_g_mj,Emission factor LPG,tonne/TJ,Overwritten manually by Quintel with RIVM data,,RIVM - Methodology report Transport ER 1990-20...,,
efac_cng_g_mj,Emission factor CNG,tonne/TJ,Overwritten manually by Quintel with RIVM data,,RIVM - Methodology report Transport ER 1990-20...,,


Now we combine the actual data. 

In [162]:
# Check for identical columns between municipal and transport data
# Get the columns of both dataframes
columns_municipal = set(df_km_municipal_source_data_cleaned.columns)
columns_transport = set(df_km_municipal_transport_data_transformed.columns)
# Find the common columns
common_columns = columns_municipal.intersection(columns_transport)
# Print the common columns
print(f"Common columns between municipal and transport data: {common_columns}")
# Remove common columns from the municipal data
df_km_municipal_source_data_cleaned = df_km_municipal_source_data_cleaned.drop(columns=common_columns)

# Merge the two dataframes on the index
df_km_source_data = df_km_municipal_source_data_cleaned.merge(df_km_municipal_transport_data_transformed, left_index=True, right_index=True)
# Move 'Gebieden' to the first column
column = 'Gebieden'
df_km_source_data = df_km_source_data[[column] + [col for col in df_km_source_data.columns if col != column]]
# Preview the data
df_km_source_data.head()   

Common columns between municipal and transport data: {'efac_diesel_g_mj_2023', 'Gebieden', 'efac_benz_g_mj_2023'}


Unnamed: 0_level_0,Gebieden,inwoners_2023,woningen_2023,energie_totaal_combi_2023,verk_totaal_2023,elektra_totaal_combi_2023,warm_totaal_combi_2023,zonpvachtermeter_kwh_2023,hern_warm_tot_2023,gas_woningen_tj_2023,...,co2_sw_pa_gasoline_2023,co2_sw_pa_diesel_2023,co2_sw_pa_lpg_2023,co2_sw_pa_cng_2023,co2_sw_ba_gasoline_2023,co2_sw_ba_diesel_2023,co2_sw_ba_lpg_2023,co2_sw_ba_cng_2023,co2_sw_ab_diesel_2023,co2_sw_ab_cng_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1680,Aa en Hunze,25724.0,11428.0,3021.0,1282.0,327.0,0.0,13.7,76.0,435.0,...,29797.4921,3804.696113,373.850792,27.580534,385.994145,10787.155187,208.927454,34.863172,314.562248,17.988374
358,Aalsmeer,33063.0,13460.0,2566.0,513.0,762.0,1291.0,15.6,43.0,416.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
197,Aalten,27244.0,12164.0,1411.0,452.0,367.0,592.0,14.0,52.0,382.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
59,Achtkarspelen,28149.0,12289.0,1896.0,569.0,445.0,882.0,11.2,87.0,420.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
482,Alblasserdam,20356.0,8449.0,1512.0,456.0,425.0,631.0,5.3,14.0,211.0,...,10001.532637,1277.046814,125.483074,9.25741,129.546838,3620.370586,70.119953,11.700731,51.063207,2.920071


We enrich the KM source data by combining it with the CBS municipal data. This way we now which municipality we're working on and we can aggregate municipalities to province level.

We rework the index so that it matches the GM format the ETM recognizes. We then add the
- municipality name
- province code (PV)
- province name

These can be matched using the municipal GM code.

In [163]:
# Fill GM code to match desired area code structure
df_km_source_data_transformed = df_km_source_data.copy()
df_km_source_data_transformed.index = df_km_source_data.index.str.zfill(4).map(lambda x: 'GM' + x)

# Add municipality name, province code and province name
df_km_source_data_transformed = df_km_source_data_transformed.merge(df_cbs_source_data[['Gemeentenaam', 'ProvinciecodePV', 'Provincienaam']], 
                        left_index=True, right_index=True, how='left')

# Reorder columns to move the new columns to the front
cols = ['Gemeentenaam', 'ProvinciecodePV', 'Provincienaam'] + [col for col in df_km_source_data.columns if col not in ['Gemeentenaam', 'ProvinciecodePV', 'Provincienaam']]
df_km_source_data_transformed = df_km_source_data_transformed[cols]

# Remove redundant 'Gebieden' column
df_km_source_data_transformed = df_km_source_data_transformed.drop(columns=['Gebieden'])

# Rename the index to 'GemeenteCode'
df_km_source_data_transformed.index.name = 'GemeenteCode'

# Preview the result
df_km_source_data_transformed


Unnamed: 0_level_0,Gemeentenaam,ProvinciecodePV,Provincienaam,inwoners_2023,woningen_2023,energie_totaal_combi_2023,verk_totaal_2023,elektra_totaal_combi_2023,warm_totaal_combi_2023,zonpvachtermeter_kwh_2023,...,co2_sw_pa_gasoline_2023,co2_sw_pa_diesel_2023,co2_sw_pa_lpg_2023,co2_sw_pa_cng_2023,co2_sw_ba_gasoline_2023,co2_sw_ba_diesel_2023,co2_sw_ba_lpg_2023,co2_sw_ba_cng_2023,co2_sw_ab_diesel_2023,co2_sw_ab_cng_2023
GemeenteCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GM1680,Aa en Hunze,PV22,Drenthe,25724.0,11428.0,3021.0,1282.0,327.0,0.0,13.7,...,29797.4921,3804.696113,373.850792,27.580534,385.994145,10787.155187,208.927454,34.863172,314.562248,17.988374
GM0358,Aalsmeer,PV27,Noord-Holland,33063.0,13460.0,2566.0,513.0,762.0,1291.0,15.6,...,0,0,0,0,0,0,0,0,0,0
GM0197,Aalten,PV25,Gelderland,27244.0,12164.0,1411.0,452.0,367.0,592.0,14.0,...,0,0,0,0,0,0,0,0,0,0
GM0059,Achtkarspelen,PV21,Fryslân,28149.0,12289.0,1896.0,569.0,445.0,882.0,11.2,...,0,0,0,0,0,0,0,0,0,0
GM0482,Alblasserdam,PV28,Zuid-Holland,20356.0,8449.0,1512.0,456.0,425.0,631.0,5.3,...,10001.532637,1277.046814,125.483074,9.25741,129.546838,3620.370586,70.119953,11.700731,51.063207,2.920071
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GM0879,Zundert,PV30,Noord-Brabant,22518.0,9940.0,1545.0,448.0,380.0,717.0,12.1,...,0,0,0,0,0,0,0,0,0,0
GM0301,Zutphen,PV25,Gelderland,48510.0,23381.0,2410.0,633.0,734.0,1043.0,15.1,...,0,0,0,0,0,0,0,0,0,0
GM1896,Zwartewaterland,PV23,Overijssel,23368.0,9196.0,2250.0,453.0,711.0,1086.0,11.6,...,0,0,0,0,0,0,0,0,0,0
GM0642,Zwijndrecht,PV28,Zuid-Holland,45018.0,20828.0,3304.0,794.0,710.0,1800.0,26.9,...,18078.321057,2308.332447,226.817566,16.733279,234.176269,6544.388792,126.752834,21.15091,51.116781,2.923135


We now convert the Klimaatmonitor variables (KM-vars) to internal variables (ivars) to be used in the separate workbooks. The match between km-var and ivar is specified in a csv in the config folder: 'A_km-var_to_ivar.csv'

This file is created manually in /config/kmvar_to_internal_var_mapping.xlsx:
- It is based on the 2019 update Excel '202110_A_ETLocal_data_transformation_Klimaatmonitor_2019', tab 'unit_conversion_overview'
- The headings are altered to be consistent with terminology in this notebook (e.g. km variable)
- The _2019 are removed to make the values consistent with KM meta data reporting
- Some CO2 emissions for specific fuels and vehicle types are added compared to the 2019 update. This is mainly CNG-related emissions for PA and BA (see [Transform - Transport data] )

We apply the conversion as follows:
1. Import the km-ivar conversion file
2. Retrieve the km-var unit from the km_meta_data dataframe
3. Inspect and preprocess the km-variable units so that they are recognizable for the TransformerV2 class
4. Convert using TransformerV2.convert
5. Write to file



In [164]:
# 1. Import the conversion list
path_conversion = Path("config", "A_km-var_to_ivar.csv")
df_km_ivar_conversion = pd.read_csv(path_conversion, sep=sep)
df_km_ivar_conversion.set_index('km-variable', inplace=True)
df_km_ivar_conversion.head()

# 2. Add km-variable unit to the conversion dataframe from the metadata
df_km_ivar_conversion = df_km_ivar_conversion.merge(df_km_meta_data[['Eenheid','Onderwerp']], 
                        left_index=True, right_index=True, how='left')
df_km_ivar_conversion.reset_index(inplace=True) # Reset index
# Rename 'Eenheid' column to 'km-variable unit'
df_km_ivar_conversion.rename(columns={'Eenheid': 'km-variable unit'}, inplace=True)

# Write the conversion list to a CSV file for inspection in Excel
df_km_ivar_conversion.to_csv("data/intermediate/km_ivar_conversion.csv", sep=sep, index=False)

# List all km-variable units for inspection
df_km_ivar_conversion['km-variable unit'].unique()


array(['aantal', nan, 'TJ', 'GWh', 'miljoen m3', 'GJ', 'm3', 'kWh', 'ton',
       'kW', 'MW', 'ton/TJ', 'tonne/TJ', 'ton/m3', 'ton/kWh'],
      dtype=object)

We see that some of the KM variable units are
1. not yet compatible with Pint (e.g "GWh (miljoen kWh)")
2. require additional specification to be used by the TransformerV2 (e.g. "m3" means "m3 (natural gas)", which is a unit manually specified in the TransformerV2 class).
3. apparently no longer available in KM (unit = nan). 
4. are written differently (e.g. "aantal" = "#" in the conversion list)

That means we need to preprocess these units and flag any nonavailable variables.


TODO: upload list of conversions for TransformerV2 from a csv rather than via hard coding in _init_rules. Update TransformerV2 accordingly.

In [165]:
# We first check which rules are available in the TransformerV2 class
transformer = TransformerV2()
transformer.list_rules()

Unnamed: 0,source,target,factor,description
0,m3 (natural gas),MJ,31.65,"LHV of natural gas, NL standard. Source: Gasunie"
1,miljoen m3,TJ,31.65,"LHV of natural gas, NL standard. Source: Gasunie"
2,tonne/m3,tonne/TJ (natural gas),31595.576619,Equal to [MJ to m3 (natural gas)] * 1e6. Based...
3,tonne/kWh,tonne/TJ (electricity),277777.777778,Equal to 1 / [kWh to TJ] or [TJ to kWh].


In [None]:
# 3. Preprocess the km-variable units
# Define a conversion dictionary for km-variable units
km_unit_conversion_dict = {
    'aantal': '#',
    'GWh (miljoen kWh)': 'GWh',
    'm3': 'm3 (natural gas)',
    'ton': 'tonne', # Note: pint interprets 'ton' as a short ton (US), while we need metric tonnes
    'ton/TJ': 'tonne/TJ',
    'ton/kWh': 'tonne/kWh',
    'ton/m3': 'tonne/m3'
}
# Use the replace method to update the values in the column
df_km_ivar_conversion_clean = df_km_ivar_conversion.replace({'km-variable unit': km_unit_conversion_dict})

# Add column specificying if a conversion is available (False if km-variable unit is NaN)
df_km_ivar_conversion_clean['conversion_available'] = ~df_km_ivar_conversion['km-variable unit'].isna()
# Conversion is needed if conversion is available and km-variable unit is different from ivar unit
df_km_ivar_conversion_clean['conversion_needed'] = (df_km_ivar_conversion_clean['conversion_available'] & 
                                                  (df_km_ivar_conversion_clean['km-variable unit'] != df_km_ivar_conversion_clean['ivar unit']))

# Write the conversion list to a CSV file for inspection in Excel
# df_km_ivar_conversion_clean.to_csv("data/intermediate/km_ivar_conversion_clean.csv", sep=sep, index=False) # no longer used as of August 2025

In [167]:
# 4 Apply conversion to relevant columns in df_km_source_data_transformed via transformer

# Create an empty dataframe with the same index as df_km_source_data_transformed
df_km_source_data_converted = pd.DataFrame(index=df_km_source_data_transformed.index)
# Add the 'Gemeentenaam', 'ProvinciecodePV' and 'Provincienaam' columns to the new dataframe
df_km_source_data_converted['Gemeentenaam'] = df_km_source_data_transformed['Gemeentenaam']
df_km_source_data_converted['ProvinciecodePV'] = df_km_source_data_transformed['ProvinciecodePV']
df_km_source_data_converted['Provincienaam'] = df_km_source_data_transformed['Provincienaam']

# Create a copy of the metadata dataframe to track the applied conversions
df_km_meta_data_converted = df_km_meta_data.copy()
# Add new columns for the ivar, ivar unit and conversion factor
df_km_meta_data_converted[['ivar', 'ivar_unit','conversion_factor']] = pd.NA 

# Define columns to be converted
km_variables_to_convert = df_km_ivar_conversion_clean[df_km_ivar_conversion_clean['conversion_needed']]['km-variable'].to_list()
# print(km_variables_to_convert) # DEBUG

# To boost efficiency we first collect all km-variable columns and their values to be converted
# We ultimately add all converted values to the new dataframe in one go
converted_values = {}

# Loop over the km-variable names and apply the conversion
# counter = 0 # DEBUG
for _, row in df_km_ivar_conversion_clean.iterrows():
    km_var = row['km-variable']
    # DEBUG 
    print(f"Processing {km_var}...")
    km_var_year = f"{km_var}_{year}"
    ivar = row['internal variable (ivar)']
    km_var_unit = row['km-variable unit']
    ivar_unit = row['ivar unit']
    conversion_available = row['conversion_available']
    conversion_needed = row['conversion_needed']
    # DEBUG print values for GM0363
    if km_var_year == 'co2_bk_pa_gasoline_2023':
        print(f"Processing {km_var_year} for GM0363 Amsterdam, current value: {df_km_source_data_transformed[km_var_year].loc['GM0363']}")

    # Define a helper function to only apply TransformerV2.convert to float values
    def safe_convert(val):
        try:
            if val == '?':
                return '?'
            elif pd.isna(val) or val == '':
                return pd.NA
            return transformer.convert(float(val), km_var_unit, ivar_unit, return_factor=True)
        except Exception:
            return pd.NA, None
    
    # If the km-variable requires conversion, apply the conversion function to the km-variable column
    # Else, keep the km-variable as is
    
    if km_var in km_variables_to_convert:
        result = df_km_source_data_transformed[km_var_year].apply(safe_convert) # returns a series of tuples (value, factor)
        # Split the tuples into two separate series
        value = result.apply(lambda x: x[0] if isinstance(x, (tuple, list)) else pd.NA)
        factor = result.apply(lambda x: x[1] if isinstance(x, (tuple, list)) else pd.NA)
        converted_values[ivar] = value
        # Add the conversion factor to the metadata dataframe
        df_km_meta_data_converted.loc[km_var, 'conversion_factor'] = factor.iloc[0]
    else: 
        try:
            converted_values[ivar] = df_km_source_data_transformed[km_var_year]
            # DEBUG print values for GM0363
            if km_var_year == 'co2_bk_pa_gasoline_2023':
                print(f"For GM0363 Amsterdam, setting {ivar} to {df_km_source_data_transformed[km_var_year].loc['GM0363']}")
        except Exception:
            print(f"Column {km_var_year} not found in source data. Setting {ivar} to pd.NA.")
            converted_values[ivar] = pd.NA

    # Add the ivar and ivar unit to the metadata dataframe
    # If the km-variable is already in the metadata dataframe, update the ivar and ivar unit
    # Else, add the km-variable, ivar and ivar unit to the metadata dataframe
    if km_var in df_km_meta_data_converted.index:
        df_km_meta_data_converted.loc[km_var, 'ivar'] = ivar
        df_km_meta_data_converted.loc[km_var, 'ivar_unit'] = ivar_unit
    else: 
        new_row = {col: 'N/A in KM data' for col in df_km_meta_data_converted.columns}
        new_row['ivar'] = ivar
        new_row['ivar_unit'] = ivar_unit
        new_row['conversion_factor'] = pd.NA
        # add new row to the metadata dataframe
        df_km_meta_data_converted.loc[km_var] = new_row
         

    # if counter <10: # DEBUG
    #     print(f"km-variable: {km_var}, ivar: {ivar}, km-variable unit: {km_var_unit}, ivar unit: {ivar_unit}") # DEBUG
    #     print(f"Conversion available: {conversion_available}, conversion needed: {conversion_needed}") # DEBUG
    #     print(f"Conversion factor: {df_km_meta_data_converted.loc[km_var, 'conversion_factor']}") # DEBUG
    # #     print(f"Converting variable: {km_var in km_variables_to_convert}") # DEBUG
        # print(f"Converted values for {ivar}:") # DEBUG
        # print(converted_values[ivar].head()) # DEBUG
    # counter += 1 # DEBUG
    
# Add the converted values to the new dataframe
df_converted_values = pd.DataFrame(converted_values, index=df_km_source_data_transformed.index)
df_km_source_data_converted = pd.concat([df_km_source_data_converted, df_converted_values], axis=1)

Processing inwoners...
Processing woningen...
Processing pautos...
Column pautos_2023 not found in source data. Setting no_cars to pd.NA.
Processing energie_totaal_combi...
Processing verk_totaal...
Processing elektra_totaal_combi...
Processing warm_totaal_combi...
Processing zonpvachtermeter_kwh...
Processing hern_warm_tot...
Processing gas_woningen_tj...
Processing warmwontjcor...
Processing kwbelektotaal...
Processing kwbgastotaal...
Processing warmwontier2...
Processing houtskool...
Processing houtwontj...
Processing gascomdv...
Processing gaspubldv...
Processing elcomdv...
Processing elpubldv...
Processing vbrzg_g1...
Processing vbrzg_h...
Processing vbrzg_i...
Processing vbrzg_j...
Processing vbrzg_k...
Processing vbrzg_l...
Processing vbrzg_m...
Processing vbrzg_n...
Processing vbrzg_o...
Processing vbrzg_p...
Processing vbrzg_q...
Processing vbrzg_r1...
Processing vbrzg_s...
Processing vbrzg_u...
Processing vbrze_g1...
Processing vbrze_h...
Processing vbrze_i...
Processing vbrz

TODO
The km_meta_data_converted contains some km vars that are not matched with an ivar. Maybe because they are new in KM? 
- Inspect these km vars
- Update the Transform part above as needed

### Export

We now export the KM dataframes to the intermediate folder.

TO DO: make sure CO_2 is retained (now saved as CO‚ÇÇ)

In [168]:
# Converted KM data and metadata
df_km_source_data_converted.to_csv("data/intermediate/km_source_data_converted.csv", sep=sep, index=True)
df_km_meta_data_converted.to_csv("data/intermediate/km_meta_data_converted.csv", sep=sep, index=True)

In [None]:
# # Original KM data and metadata
# df_km_source_data.to_csv("data/intermediate/km_source_data.csv", sep=sep, index=True) # no longer used as of August 2025
# df_km_meta_data.to_csv("data/intermediate/km_meta_data.csv", sep=sep, index=True) # no longer used as of August 2025

## Klimaatmonitor (national data)
Here we import a limited number of Klimaatmonitor variables on the national level. These are mainly used to distribute over the municipalities or help define distribution keys for individual municipalities.

Source: https://klimaatmonitor.databank.nl/Jive?workspace_guid=bdd470c1-9e0c-42cb-b0ed-71f20de6efc6

### Extract

First we import the most recent Klimaatmonitor data dumps

In [17]:
# Import both Klimaatmonitor data dumps
# Import the 'Thema's - 2022 - Nederland' data dump
year_km_nl = 2023
path_km_nl = Path("data", "raw", f"Klimaatmonitor - Thema's - {year_km_nl} - Nederland.xlsx")
wb_km = xw.Book(str(path_km_nl))
ws_km_national_source_data = wb_km.sheets["Data"]
ws_km_national_meta_data = wb_km.sheets["Onderwerp Informatie"]

# Read the source data into a DataFrame
# Note: The first row is the header, so we need to set it as the column names
# and remove it from the data
df_km_national_source_data = pd.DataFrame(ws_km_national_source_data.used_range.value)
df_km_national_source_data.columns = df_km_national_source_data.iloc[0]
df_km_national_source_data = df_km_national_source_data[1:]
df_km_national_source_data = df_km_national_source_data.set_index(df_km_national_source_data.columns[1])
# Remove the 'Gebieden' column
df_km_national_source_data = df_km_national_source_data.drop(columns=["Gebieden"])

# Read the metadata into a DataFrame
df_km_national_meta_data = pd.DataFrame(ws_km_national_meta_data.used_range.value)
df_km_national_meta_data.columns = df_km_national_meta_data.iloc[0]
df_km_national_meta_data = df_km_national_meta_data[1:]
df_km_national_meta_data = df_km_national_meta_data.set_index(df_km_national_meta_data.columns[0])

# Close the Excel workbook
wb_km.close()

# Preview data
df_km_national_source_data.head()

Unnamed: 0_level_0,inwoners_2023,woningen_2023,elektra_totaal_combi_2023,ovinfiets_2023,warm_totaal_combi_2023,warmwontjcor_2023,hern_warm_tot_2023,warwarmte_2023,avi_thtjbrut_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,17811291.0,8125229.0,349356.0,,801169.0,11539.0,61146.0,15909.0,9446.0


The fietskilometer data is missing from Klimaatmonitor because the Klimaatmonitor variable refers to a study completed in 2017. However we still need the bicycle kilometers for our dataset update. 

The national dataset uses 17.8 bln km reported in CBS: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/84687NED/table?dl=5526A.
We manually update the value for ovinfiets_2023 to this value.

In [18]:
df_km_national_source_data['ovinfiets_2023'] = 17.8e9 # km per year
df_km_national_source_data

Unnamed: 0_level_0,inwoners_2023,woningen_2023,elektra_totaal_combi_2023,ovinfiets_2023,warm_totaal_combi_2023,warmwontjcor_2023,hern_warm_tot_2023,warwarmte_2023,avi_thtjbrut_2023
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,17811291.0,8125229.0,349356.0,17800000000.0,801169.0,11539.0,61146.0,15909.0,9446.0


### Transform
We transform the KM vars to internal variables. For now we manually define the conversion table below.

In [19]:
# Define conversion table
# See metadata for what the variables mean
km_national_to_ivar_conversion = {
    f'inwoners_{year_km_nl}': 'nl_inwoners',
    f'woningen_{year_km_nl}': 'nl_woningen',
    f'elektra_totaal_combi_{year_km_nl}': 'nl_elektra_totaal_combi_tj',
    f'ovinfiets_{year_km_nl}': 'nl_fiets_mlrd_km',
    f'warm_totaal_combi_{year_km_nl}': 'nl_warm_totaal_combi_tj',
    f'warmwontjcor_{year_km_nl}': 'nl_warm_woningen_tj_temp_corrected',
    f'hern_warm_tot_{year_km_nl}': 'nl_hern_warm_tot_tj',
    f'warwarmte_{year_km_nl}': 'nl_doorg_warmte_avi_combi_tj',
    f'avi_thtjbrut_{year_km_nl}': 'nl_doorg_hern_warmte_tj'
}

# Rename the source data columns
df_km_national_source_data_converted = df_km_national_source_data.copy()
df_km_national_source_data_converted.rename(columns=km_national_to_ivar_conversion, inplace=True)
df_km_national_source_data_converted.head()

Unnamed: 0_level_0,nl_inwoners,nl_woningen,nl_elektra_totaal_combi_tj,nl_fiets_mlrd_km,nl_warm_totaal_combi_tj,nl_warm_woningen_tj_temp_corrected,nl_hern_warm_tot_tj,nl_doorg_warmte_avi_combi_tj,nl_doorg_hern_warmte_tj
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,17811291.0,8125229.0,349356.0,17800000000.0,801169.0,11539.0,61146.0,15909.0,9446.0


In [20]:
# Add the ivar and ivar unit to the metadata dataframe
df_km_national_meta_data_converted = df_km_national_meta_data.copy()
df_km_national_meta_data_converted['ivar'] = df_km_national_meta_data_converted.index.map(lambda x: km_national_to_ivar_conversion.get(f"{x}_{year_km_nl}", 'N/A in KM data'))
df_km_national_meta_data_converted['ivar_unit'] = df_km_national_meta_data_converted['Eenheid']

# update the ovinfiets row from the metadata dataframe to contain the KiM source reference
df_km_national_meta_data_converted.loc['ovinfiets', 'Bron'] = 'Kennisinstituut voor Mobiliteitsbeleid - Kerncijfers mobiliteit 2024, p. 14. Link: https://www.kimnet.nl/site/binaries/site-content/collections/documents/2024/11/18/kerncijfers-mobiliteit-2024/KiM+Kerncijfers_Mobiliteit_2024_DTdef.pdf'
df_km_national_meta_data_converted

Unnamed: 0_level_0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging,ivar,ivar_unit
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
inwoners,Aantal inwoners,aantal,CBS - Kerncijfers Wijken en buurten,,,Numeriek,6-8-2025 08:21:16,nl_inwoners,aantal
woningen,Aantal woningen per 1 januari,aantal,CBS - Kerncijfers Wijken en buurten,,,Numeriek,18-7-2025 10:36:01,nl_woningen,aantal
elektra_totaal_combi,"Totaal bekend elektriciteitsverbruik, incl. zo...",TJ,Berekening (sub)totalen energieverbruik,,,Numeriek,27-8-2025 05:46:20,nl_elektra_totaal_combi_tj,TJ
ovinfiets,Gereisde kilometers fiets,miljard km,Kennisinstituut voor Mobiliteitsbeleid - Kernc...,,,Numeriek,10-3-2020 13:55:34,nl_fiets_mlrd_km,miljard km
warm_totaal_combi,Totaal bekend warmteverbruik (aardgas en (hern...,TJ,Berekening o.b.v. gegevens meerdere bronnen,,,Numeriek,27-8-2025 05:46:20,nl_warm_totaal_combi_tj,TJ
warmwontjcor,Verbruik stadswarmte woningen (temperatuurgeco...,TJ,Duurzaamheidsrapportage warmtenetten RVO en hi...,,,Numeriek,27-8-2025 05:32:42,nl_warm_woningen_tj_temp_corrected,TJ
hern_warm_tot,Totaal bekende hernieuwbare warmte,TJ,Verdeling regionale gegevens hernieuwbare ener...,,,Numeriek,27-8-2025 05:47:26,nl_hern_warm_tot_tj,TJ
warwarmte,"Doorgeleverde warmte afvalverbranding (AVI, fo...",TJ,"Werkgroep Afvalregistratie, onderdeel afvalver...",,,Numeriek,20-8-2025 12:05:57,nl_doorg_warmte_avi_combi_tj,TJ
avi_thtjbrut,Afvalverbrandingsinstallatie hernieuwbare warmte,TJ,"Werkgroep Afvalregistratie, onderdeel afvalver...",,,Numeriek,20-8-2025 13:28:05,nl_doorg_hern_warmte_tj,TJ


We can define a few new internal variables based on these KM vars:
- nl_gas_totaal: total heat consumption - renewable heat consumption
- nl_biogene_fractie_afval: share of biodegredable waste in total waste

In [21]:
# Calculate variables
df_km_national_source_data_converted['nl_gas_totaal'] = df_km_national_source_data_converted['nl_warm_totaal_combi_tj'] - df_km_national_source_data_converted['nl_hern_warm_tot_tj']
df_km_national_source_data_converted['nl_biogene_fractie_afval'] = df_km_national_source_data_converted['nl_doorg_hern_warmte_tj'] / df_km_national_source_data_converted['nl_doorg_warmte_avi_combi_tj']

# Add information to metadata
df_new_meta_data = pd.DataFrame({
    'ivar': ['nl_gas_totaal', 'nl_biogene_fractie_afval'],
    'ivar_unit': ['TJ', 'factor'],
    'Eenheid': ['TJ', 'factor'],
    'Onderwerp': ['Gasverbruik totaal', 'Biogene fractie in afvalverbranding'],
    'Bron': ['Bewerking Quintel', 'Bewerking Quintel'],
    'Beschrijving': ['Totale warmte - hernieuwbare warmte', 'Hernieuwbare warmte / doorgeleverde warmte afvalverbranding']
}, index=['nl_gas_totaal', 'nl_biogene_fractie_afval'])
df_km_national_meta_data_converted = pd.concat([df_km_national_meta_data_converted, df_new_meta_data], axis=0)

df_km_national_meta_data_converted.tail()

Unnamed: 0,Onderwerp,Eenheid,Bron,Voetnoot,Beschrijving,Gegevenstype,Laatste database wijziging,ivar,ivar_unit
hern_warm_tot,Totaal bekende hernieuwbare warmte,TJ,Verdeling regionale gegevens hernieuwbare ener...,,,Numeriek,27-8-2025 05:47:26,nl_hern_warm_tot_tj,TJ
warwarmte,"Doorgeleverde warmte afvalverbranding (AVI, fo...",TJ,"Werkgroep Afvalregistratie, onderdeel afvalver...",,,Numeriek,20-8-2025 12:05:57,nl_doorg_warmte_avi_combi_tj,TJ
avi_thtjbrut,Afvalverbrandingsinstallatie hernieuwbare warmte,TJ,"Werkgroep Afvalregistratie, onderdeel afvalver...",,,Numeriek,20-8-2025 13:28:05,nl_doorg_hern_warmte_tj,TJ
nl_gas_totaal,Gasverbruik totaal,TJ,Bewerking Quintel,,Totale warmte - hernieuwbare warmte,,,nl_gas_totaal,TJ
nl_biogene_fractie_afval,Biogene fractie in afvalverbranding,factor,Bewerking Quintel,,Hernieuwbare warmte / doorgeleverde warmte afv...,,,nl_biogene_fractie_afval,factor


### Export
We export the national dataframe and metadata to csv

**Note**: some variables in the dataframe are redundant, but for completeness sake we just export everything. It's only a few variables.
TODO: clean up this dataframe with only relevant ivars

In [22]:
# Converted KM data and metadata
df_km_national_source_data_converted.to_csv("data/intermediate/km_national_source_data_converted.csv", sep=sep, index=True)
df_km_national_meta_data_converted.to_csv("data/intermediate/km_national_meta_data_converted.csv", sep=sep, index=True)

## Transport research
Here we import the manual research on transport related parameters. The data is manually adapted from the 2019 dataset update Excel D, tab transport_research.
**NOTE**: the data has not been updated (yet)!

TODO: update data

### Extract

In [47]:
# Import D_transport_research from raw folder
path_transport_research = Path("data", "raw", f"D_transport_research.csv")
df_transport_research = pd.read_csv(path_transport_research, sep=sep, index_col=0)
df_transport_research.head()

Unnamed: 0_level_0,Variable,Value,Unit,Info
Road vehicle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Car (FEV),electric_car_fev_annual_kms,15000.0,km/y,
Car (FEV),electric_car_fev_annual_kwh,3000.0,kWh/y,(20 kwh per 100 km)
Car (FEV),electric_car_fev_annual_tj,0.0108,TJ/y,
Car (PHEV),electric_car_phev_relative_annual_electric_kms,0.3,factor,
Car (PHEV),electric_car_phev_annual_tj,0.00324,TJ/y,


### Transform
We trim the data slightly so that it's easier to use.

In [48]:
# Remove Info column from df_transport_research
df_transport_research_cleaned = df_transport_research.copy()
df_transport_research_cleaned.drop(columns=['Info'], inplace=True)

# Set Variable column as index
df_transport_research_cleaned.set_index('Variable', inplace=True)
df_transport_research_cleaned.head()

Unnamed: 0_level_0,Value,Unit
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1
electric_car_fev_annual_kms,15000.0,km/y
electric_car_fev_annual_kwh,3000.0,kWh/y
electric_car_fev_annual_tj,0.0108,TJ/y
electric_car_phev_relative_annual_electric_kms,0.3,factor
electric_car_phev_annual_tj,0.00324,TJ/y


We add the prefix 'tres_' to identify the source of the data. This is (hopefully) helpful in the sector notebooks when comnbining internal variables to calculate ETLocal keys.
Also, these prefixes are present in the YAML files at the moment, so they are necessary for the YAML Calculator.


In [49]:
# Add prefix to the index
df_transport_research_cleaned.index = 'tres_' + df_transport_research_cleaned.index
df_transport_research_cleaned

Unnamed: 0_level_0,Value,Unit
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1
tres_electric_car_fev_annual_kms,15000.0,km/y
tres_electric_car_fev_annual_kwh,3000.0,kWh/y
tres_electric_car_fev_annual_tj,0.0108,TJ/y
tres_electric_car_phev_relative_annual_electric_kms,0.3,factor
tres_electric_car_phev_annual_tj,0.00324,TJ/y
tres_electric_bus_annual_kms,60000.0,km/y
tres_electric_bus_annual_kwh,120000.0,kWh/y
tres_electric_bus_annual_tj,0.432,TJ/y
tres_electric_truck_annual_kms,50000.0,km/y
tres_electric_truck_annual_kwh,50000.0,kWh/y


### Export
Export dataframe to transport

In [50]:
df_transport_research_cleaned.to_csv("data/intermediate/transport_research_cleaned.csv", sep=sep, index=True)

## Emissie-registratie for final demand keys industry & agriculture

Here we import the data from emissie-registratie.

Since we only focus on the final demand of industry keys, we want just want to extract the following 'ivars' that are described in Excel B by Michiel:
- steel_co2_scaled
- aluminium_co2_scaled
- fertilizers_co2_scaled
- refineries_co2_scaled
- chemical_other_co2_relative
- food_co2_relative
- paper_co2_relative
- metal_other_co2_relative
- industry_other_co2_relative

For Agriculture we focus on the following 'ivars':
- agriculture_gas_chp_relative
- agriculture_final_demand_relative

The chart below summarizes the process of this part of the notebook. It is generated using the code in `config/mermaid_flow_diagrams/preprocessing_emissieregistratie_data_transformations_code.mmd`.

![emissieregistratie_flow_diagram](preprocessing_emissieregistratie_data_transformations_diagram.png)

### Extract

In [51]:
# Import B_emissie_registratie from raw folder
path_emissie_registratie = Path("data", "raw", f"broeikasgassen_doelgroep_subdoelgroep2023.xlsx")
df_emissie_registratie = pd.read_excel(path_emissie_registratie, sheet_name='2023')
df_emissie_registratie.head()

Unnamed: 0,DATASET,EMISSIEJAAR,CODE,PROCES_OMSCHRIJVING,SUBDOELGROEPCODE,SUBDOELGROEPNAAM,DOELGROEPCODE,DOELGROEPNAAM,GEMEENTECODE,GEMEENTENAAM,STOFCODE,STOFNAAM,EMISSIE_KG
0,ER Reeks 1990-2023 Definitief,2023,910004,SBI 38: Afvalinzameling en -behandeling (afval...,pub_0904,Overige afvalbedrijven,9,Afvalverwijdering,1680,Aa en Hunze,205,Distikstofoxide,72.140339
1,ER Reeks 1990-2023 Definitief,2023,20401,SBI 41-43: Bouwnijverheid,pub_0802,Overig bouw,8,Bouw,1680,Aa en Hunze,205,Distikstofoxide,1.426587
2,ER Reeks 1990-2023 Definitief,2023,12102,"Vuurhaarden consumenten, hoofdverwarming woningen",pub_0701,Energiegebruik Consumenten,7,Consumenten,1680,Aa en Hunze,205,Distikstofoxide,33.544882
3,ER Reeks 1990-2023 Definitief,2023,800700,"Vuurhaarden consumenten, koken",pub_0701,Energiegebruik Consumenten,7,Consumenten,1680,Aa en Hunze,205,Distikstofoxide,1.069947
4,ER Reeks 1990-2023 Definitief,2023,800800,"Vuurhaarden consumenten, warm water voorziening",pub_0701,Energiegebruik Consumenten,7,Consumenten,1680,Aa en Hunze,205,Distikstofoxide,8.280051


In [52]:
# Replace "Nuenen, Gerwen en Nederwetten" with "Nuenen" in the GEMEENTENAAM column

df_emissie_registratie.loc[df_emissie_registratie['GEMEENTENAAM'] == "Nuenen, Gerwen en Nederwetten", 'GEMEENTENAAM'] = "Nuenen"

# Verify the change
print("Instances of 'Nuenen' in GEMEENTENAAM column:")
print(df_emissie_registratie[df_emissie_registratie['GEMEENTENAAM'].str.contains("Nuenen", na=False)]['GEMEENTENAAM'].value_counts())

Instances of 'Nuenen' in GEMEENTENAAM column:
GEMEENTENAAM
Nuenen    336
Name: count, dtype: int64


### Transform

#### steel_co2_scaled

In [53]:
variables_for_steel_co2_scaled = [
    'SBI 24 (per bedrijf): Vervaardiging van metalen in primaire vorm',
    'SBI 24.1-24.3: Basismetaalindustrie, verwerking en vervaardiging ijzer en staal, anode-gebruik bij electrostaal productie',
    'SBI 24.1-24.3: Basismetaalindustrie, verwerking en vervaardiging ijzer en staal, diffuus',
    'SBI 24.1-24.3: Basismetaalindustrie, verwerking en vervaardiging ijzer en staal, kalkgebruik (PBL)',
    'SBI 24.1-24.3 (per bedrijf): Basismetaalindustrie, verwerking en vervaardiging ijzer en staal'
]

In [54]:
df_steel_co2_scaled = df_emissie_registratie[df_emissie_registratie['PROCES_OMSCHRIJVING'].isin(variables_for_steel_co2_scaled)]

In [55]:
# Get all column names except 'GEMEENTENAAM' and 'EMISSIE_KG'
other_columns = [col for col in df_steel_co2_scaled.columns if col not in ['GEMEENTENAAM', 'EMISSIE_KG']]

# Create aggregation dictionary
agg_dict = {'EMISSIE_KG': 'sum'}  # Sum the emissions

# For all other columns, keep the first value (assuming they're the same within each group)
for col in other_columns:
    agg_dict[col] = 'first'

# Apply the groupby with all columns
df_steel_co2_scaled = df_steel_co2_scaled.groupby('GEMEENTENAAM').agg(agg_dict).reset_index()

In [56]:
df_steel_co2_scaled['df_steel_co2_scaled'] = df_steel_co2_scaled['EMISSIE_KG'] / df_steel_co2_scaled['EMISSIE_KG'].sum()
df_steel_co2_scaled_final = df_steel_co2_scaled[['GEMEENTECODE', 'df_steel_co2_scaled']].reset_index()

#### aluminium_co2_scaled

In [57]:

variables_for_aluminium_co2_scaled = [
    'SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, aluminium',
    'SBI 24.45: Vervaardiging van overige non-ferrometalen, aluminium',
    'SBI 24.4/24.53/24.54: Vervaardiging en gieten van lichte en overige non-ferrometalen'
]

In [58]:
df_aluminium_co2_scaled = df_emissie_registratie[df_emissie_registratie['PROCES_OMSCHRIJVING'].isin(variables_for_aluminium_co2_scaled)]

In [59]:
df_aluminium_co2_scaled

Unnamed: 0,DATASET,EMISSIEJAAR,CODE,PROCES_OMSCHRIJVING,SUBDOELGROEPCODE,SUBDOELGROEPNAAM,DOELGROEPCODE,DOELGROEPNAAM,GEMEENTECODE,GEMEENTENAAM,STOFCODE,STOFNAAM,EMISSIE_KG
4400,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,307,Amersfoort,205,Distikstofoxide,0.087408
4498,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,307,Amersfoort,204,Koolstofdioxide,49210.454746
4631,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,307,Amersfoort,523,Methaan,4.982231
4632,ER Reeks 1990-2023 Definitief,2023,8914702,SBI 24.45: Vervaardiging van overige non-ferro...,pub_0201,Basismetaal,2,Overige industrie,307,Amersfoort,523,Methaan,0.000000
4695,ER Reeks 1990-2023 Definitief,2023,8914702,SBI 24.45: Vervaardiging van overige non-ferro...,pub_0201,Basismetaal,2,Overige industrie,307,Amersfoort,641,Overige PFK's,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
113022,ER Reeks 1990-2023 Definitief,2023,8914702,SBI 24.45: Vervaardiging van overige non-ferro...,pub_0201,Basismetaal,2,Overige industrie,301,Zutphen,909,PFK 116 (Perfluorethaan),0.000000
113023,ER Reeks 1990-2023 Definitief,2023,8914702,SBI 24.45: Vervaardiging van overige non-ferro...,pub_0201,Basismetaal,2,Overige industrie,301,Zutphen,908,PFK 14 (Perfluormethaan),0.000000
113383,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,642,Zwijndrecht,205,Distikstofoxide,0.174815
113484,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,642,Zwijndrecht,204,Koolstofdioxide,98420.909492


In [60]:
# Get all column names except 'GEMEENTENAAM' and 'EMISSIE_KG'
other_columns = [col for col in df_aluminium_co2_scaled.columns if col not in ['GEMEENTENAAM', 'EMISSIE_KG']]

# Create aggregation dictionary
agg_dict = {'EMISSIE_KG': 'sum'}  # Sum the emissions

# For all other columns, keep the first value (assuming they're the same within each group)
for col in other_columns:
    agg_dict[col] = 'first'

# Apply the groupby with all columns
df_aluminium_co2_scaled = df_aluminium_co2_scaled.groupby('GEMEENTENAAM').agg(agg_dict).reset_index()

In [61]:
df_aluminium_co2_scaled

Unnamed: 0,GEMEENTENAAM,EMISSIE_KG,DATASET,EMISSIEJAAR,CODE,PROCES_OMSCHRIJVING,SUBDOELGROEPCODE,SUBDOELGROEPNAAM,DOELGROEPCODE,DOELGROEPNAAM,GEMEENTECODE,STOFCODE,STOFNAAM
0,'s-Gravenhage,5.413708e+05,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,518,205,Distikstofoxide
1,Amersfoort,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,307,205,Distikstofoxide
2,Amsterdam,1.230388e+05,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,363,205,Distikstofoxide
3,Apeldoorn,3.199009e+06,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,200,205,Distikstofoxide
4,Barneveld,7.382329e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,203,205,Distikstofoxide
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Wierden,6.644096e+05,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,189,205,Distikstofoxide
59,Zevenaar,7.136251e+05,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,299,205,Distikstofoxide
60,Zuidplas,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,1892,205,Distikstofoxide
61,Zutphen,7.382329e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,301,205,Distikstofoxide


In [62]:
df_aluminium_co2_scaled.sort_values(by='EMISSIE_KG', ascending=False, inplace=True)
df_aluminium_co2_scaled

Unnamed: 0,GEMEENTENAAM,EMISSIE_KG,DATASET,EMISSIEJAAR,CODE,PROCES_OMSCHRIJVING,SUBDOELGROEPCODE,SUBDOELGROEPNAAM,DOELGROEPCODE,DOELGROEPNAAM,GEMEENTECODE,STOFCODE,STOFNAAM
29,Kerkrade,1.345418e+07,ER Reeks 1990-2023 Definitief,2023,T104702,SBI 24.45 (per bedrijf) Vervaardiging van over...,pub_0201,Basismetaal,2,Overige industrie,928,204,Koolstofdioxide
46,Rotterdam,1.294339e+07,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,599,205,Distikstofoxide
25,Heusden,1.156776e+07,ER Reeks 1990-2023 Definitief,2023,T104702,SBI 24.45 (per bedrijf) Vervaardiging van over...,pub_0201,Basismetaal,2,Overige industrie,797,205,Distikstofoxide
52,Vlissingen,7.839917e+06,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,718,205,Distikstofoxide
17,Epe,5.977202e+06,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,232,205,Distikstofoxide
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22,Heerenveen,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,74,205,Distikstofoxide
28,Horst aan de Maas,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,1507,205,Distikstofoxide
60,Zuidplas,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,1892,205,Distikstofoxide
1,Amersfoort,4.921552e+04,ER Reeks 1990-2023 Definitief,2023,8920100,SBI 24.4/24.53/24.54: Vervaardiging en gieten ...,pub_0201,Basismetaal,2,Overige industrie,307,205,Distikstofoxide


In [63]:
df_aluminium_co2_scaled['df_aluminium_co2_scaled'] = df_aluminium_co2_scaled['EMISSIE_KG'] / df_aluminium_co2_scaled['EMISSIE_KG'].sum()
df_aluminium_co2_scaled_final = df_aluminium_co2_scaled[['GEMEENTECODE', 'df_aluminium_co2_scaled']].reset_index()
df_aluminium_co2_scaled_final

Unnamed: 0,index,GEMEENTECODE,df_aluminium_co2_scaled
0,29,928,0.147465
1,46,599,0.141867
2,25,797,0.126789
3,52,718,0.085930
4,17,232,0.065513
...,...,...,...
58,22,74,0.000539
59,28,1507,0.000539
60,60,1892,0.000539
61,1,307,0.000539


#### fertilizers_co2_scaled

In [64]:
variables_for_fertilizer_co2_scaled = [
    'Chemische Industrie kunstmeststoffen'
]


In [65]:
df_fertilizers_co2_scaled = df_emissie_registratie[df_emissie_registratie['SUBDOELGROEPNAAM'].isin(variables_for_fertilizer_co2_scaled)]

In [66]:
# Get all column names except 'GEMEENTENAAM' and 'EMISSIE_KG'
other_columns = [col for col in df_fertilizers_co2_scaled.columns if col not in ['GEMEENTENAAM', 'EMISSIE_KG']]

# Create aggregation dictionary
agg_dict = {'EMISSIE_KG': 'sum'}  # Sum the emissions

# For all other columns, keep the first value (assuming they're the same within each group)
for col in other_columns:
    agg_dict[col] = 'first'

# Apply the groupby with all columns
df_fertilizers_co2_scaled = df_fertilizers_co2_scaled.groupby('GEMEENTENAAM').agg(agg_dict).reset_index()

In [67]:
df_fertilizers_co2_scaled['df_fertilizers_co2_scaled'] = df_fertilizers_co2_scaled['EMISSIE_KG'] / df_fertilizers_co2_scaled['EMISSIE_KG'].sum()
df_fertilizers_co2_scaled_final = df_fertilizers_co2_scaled[['GEMEENTECODE', 'df_fertilizers_co2_scaled']].reset_index()
df_fertilizers_co2_scaled_final


Unnamed: 0,index,GEMEENTECODE,df_fertilizers_co2_scaled
0,0,363,0.006079
1,1,917,0.003476
2,2,715,0.990225
3,3,622,0.00022


#### refineries_co2_scaled

In [68]:
variables_for_refineries_co2_scaled = [
    'Raffinaderijen'
]


In [69]:
df_refineries_co2_scaled = df_emissie_registratie[df_emissie_registratie['DOELGROEPNAAM'].isin(variables_for_refineries_co2_scaled)]

In [70]:
# Get all column names except 'GEMEENTENAAM' and 'EMISSIE_KG'
other_columns = [col for col in df_refineries_co2_scaled.columns if col not in ['GEMEENTENAAM', 'EMISSIE_KG']]

# Create aggregation dictionary
agg_dict = {'EMISSIE_KG': 'sum'}  # Sum the emissions

# For all other columns, keep the first value (assuming they're the same within each group)
for col in other_columns:
    agg_dict[col] = 'first'

# Apply the groupby with all columns
df_refineries_co2_scaled = df_refineries_co2_scaled.groupby('GEMEENTENAAM').agg(agg_dict).reset_index()

In [71]:
df_refineries_co2_scaled['df_refineries_co2_scaled'] = df_refineries_co2_scaled['EMISSIE_KG'] / df_refineries_co2_scaled['EMISSIE_KG'].sum()
df_refineries_co2_scaled_final = df_refineries_co2_scaled[['GEMEENTECODE', 'df_refineries_co2_scaled']].reset_index()
df_refineries_co2_scaled_final


Unnamed: 0,index,GEMEENTECODE,df_refineries_co2_scaled
0,0,518,1.536551e-05
1,1,796,4.368577e-06
2,2,1680,7.003117e-07
3,3,358,9.026759e-07
4,4,197,7.423833e-07
...,...,...,...
339,339,879,6.152394e-07
340,340,301,1.324899e-06
341,341,1896,6.407697e-07
342,342,642,1.234981e-06


#### Variables for other industry
These comprise of:
- chemical_other_co2_relative
- food_co2_relative
- paper_co2_relative
- metal_other_co2_relative
- industry_other_co2_relative


In [72]:


# Definieer de categorieën
other_industry_sector = [
    'Overige industrie'
]

other_industry_subsector = [
    'Chemische Industrie basisproducten',
    'Chemische Industrie bestrijdingsmiddelen', 
    'Chemische Industrie overig',
    'Voedings- en genotmiddelenindustrie',
    'Papier(waren)',
    'Basismetaal'
]

other_industry_emissieoorzaak = [
    'SBI 24.5 (per bedrijf): Gieten van metalen',
    'SBI 24.5: Gieten van metalen',
    'SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, zink',
    'SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, lood',
    'SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, koper'
]


variables_for_aluminium_co2_scaled = [
    'SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, aluminium',
    'SBI 24.45: Vervaardiging van overige non-ferrometalen, aluminium',
    'SBI 24.4/24.53/24.54: Vervaardiging en gieten van lichte en overige non-ferrometalen'
]


# Filter op sector "Overige industrie" en CO2
df_other_industry_sector = df_emissie_registratie[
    (df_emissie_registratie['DOELGROEPNAAM'].isin(other_industry_sector)) &
    (df_emissie_registratie['STOFNAAM'] == 'Koolstofdioxide')
].reset_index()[['GEMEENTECODE', 'GEMEENTENAAM', 'EMISSIE_KG']].copy()

# Filter op relevante subsectoren en CO2
df_other_industry_subsector = df_emissie_registratie[
    (df_emissie_registratie['SUBDOELGROEPNAAM'].isin(other_industry_subsector)) &
    (df_emissie_registratie['STOFNAAM'] == 'Koolstofdioxide')
].reset_index()[['GEMEENTECODE', 'GEMEENTENAAM', 'SUBDOELGROEPNAAM', 'EMISSIE_KG']].copy()

# Filter op relevante emissieoorzaken en CO2
df_other_industry_emissieoorzaak = df_emissie_registratie[
    (df_emissie_registratie['PROCES_OMSCHRIJVING'].isin(other_industry_emissieoorzaak)) &
    (df_emissie_registratie['STOFNAAM'] == 'Koolstofdioxide')
].reset_index()[['GEMEENTECODE', 'GEMEENTENAAM', 'PROCES_OMSCHRIJVING', 'EMISSIE_KG']].copy()

# Groepeer subsector data per gemeente
df_subsector_grouped = df_other_industry_subsector.pivot_table(
    index='GEMEENTECODE', 
    columns='SUBDOELGROEPNAAM', 
    values='EMISSIE_KG', 
    fill_value=0, 
    aggfunc='sum'
).reset_index()

# Groepeer emissieoorzaak data per gemeente  
df_emissieoorzaak_grouped = df_other_industry_emissieoorzaak.pivot_table(
    index='GEMEENTECODE',
    columns='PROCES_OMSCHRIJVING',
    values='EMISSIE_KG',
    fill_value=0,
    aggfunc='sum'
).reset_index()

# Groepeer sector data per gemeente en behoud Gebied
df_sector_grouped = df_other_industry_sector.groupby(['GEMEENTECODE', 'GEMEENTENAAM'])['EMISSIE_KG'].sum().reset_index()
df_sector_grouped.columns = ['GEMEENTECODE', 'GEMEENTENAAM', 'Overige industrie']

# Voeg alle data samen
df_other_industry_combined = df_sector_grouped.copy()

# Voeg subsector kolommen toe
for subsector in other_industry_subsector:
    if subsector in df_subsector_grouped.columns:
        df_other_industry_combined = df_other_industry_combined.merge(
            df_subsector_grouped[['GEMEENTECODE', subsector]], 
            on='GEMEENTECODE', 
            how='left'
        )
        df_other_industry_combined[subsector] = df_other_industry_combined[subsector].fillna(0)
    else:
        df_other_industry_combined[subsector] = 0

# Voeg emissieoorzaak kolommen toe
for emissieoorzaak in other_industry_emissieoorzaak:
    if emissieoorzaak in df_emissieoorzaak_grouped.columns:
        df_other_industry_combined = df_other_industry_combined.merge(
            df_emissieoorzaak_grouped[['GEMEENTECODE', emissieoorzaak]], 
            on='GEMEENTECODE', 
            how='left'
        )
        df_other_industry_combined[emissieoorzaak] = df_other_industry_combined[emissieoorzaak].fillna(0)
    else:
        df_other_industry_combined[emissieoorzaak] = 0

# Bereken de afgeleide kolommen volgens de Excel formules
# chemical_other_co2 = SUM van alle chemische industrie subsectoren
df_other_industry_combined['chemical_other_co2'] = (
    df_other_industry_combined['Chemische Industrie basisproducten'] +
    df_other_industry_combined['Chemische Industrie bestrijdingsmiddelen'] +
    df_other_industry_combined['Chemische Industrie overig']
)

# food_co2 = Voedings- en genotmiddelenindustrie
df_other_industry_combined['food_co2'] = df_other_industry_combined['Voedings- en genotmiddelenindustrie']

# paper_co2 = Papier(waren)
df_other_industry_combined['paper_co2'] = df_other_industry_combined['Papier(waren)']

# metal_other_co2 = SUM van Basismetaal + alle SBI codes
df_other_industry_combined['metal_other_co2'] = (
    df_other_industry_combined['SBI 24.5 (per bedrijf): Gieten van metalen'] +
    df_other_industry_combined['SBI 24.5: Gieten van metalen'] +
    df_other_industry_combined['SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, zink'] +
    df_other_industry_combined['SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, lood'] +
    df_other_industry_combined['SBI 24.45 (per bedrijf) Vervaardiging van overige non-ferrometalen, koper']
)

# industry_other_co2 = Overige industrie - (food_co2 + paper_co2 + Basismetaal)
# Dit is volgens de Excel formule: C2-SUM(G2:I2)
df_other_industry_combined['industry_other_co2'] = (
    df_other_industry_combined['Overige industrie'] - 
    df_other_industry_combined['Voedings- en genotmiddelenindustrie'] -
    df_other_industry_combined['Papier(waren)'] -
    df_other_industry_combined['Basismetaal']
)

# Bereken de relatieve aandelen
total_co2 = (
    df_other_industry_combined['chemical_other_co2'] +
    df_other_industry_combined['food_co2'] +
    df_other_industry_combined['paper_co2'] +
    df_other_industry_combined['metal_other_co2'] +
    df_other_industry_combined['industry_other_co2']
)



# Voorkom deling door nul
total_co2_safe = total_co2.replace(0, 1)  # Vervang 0 door 1 om deling door nul te voorkomen

df_other_industry_combined['chemical_other_co2_relative'] = df_other_industry_combined['chemical_other_co2'] / total_co2_safe
df_other_industry_combined['food_co2_relative'] = df_other_industry_combined['food_co2'] / total_co2_safe
df_other_industry_combined['paper_co2_relative'] = df_other_industry_combined['paper_co2'] / total_co2_safe
df_other_industry_combined['metal_other_co2_relative'] = df_other_industry_combined['metal_other_co2'] / total_co2_safe
df_other_industry_combined['industry_other_co2_relative'] = df_other_industry_combined['industry_other_co2'] / total_co2_safe

df_other_industry_combined['chemical_other_co2_scaled'] = df_other_industry_combined['chemical_other_co2'] / df_other_industry_combined['chemical_other_co2'].sum()
df_other_industry_combined['food_co2_scaled'] = df_other_industry_combined['food_co2'] / df_other_industry_combined['food_co2'].sum()
df_other_industry_combined['paper_co2_scaled'] = df_other_industry_combined['paper_co2'] / df_other_industry_combined['paper_co2'].sum()
df_other_industry_combined['metal_other_co2_scaled'] = df_other_industry_combined['metal_other_co2'] / df_other_industry_combined['metal_other_co2'].sum()
df_other_industry_combined['industry_other_co2_scaled'] = df_other_industry_combined['industry_other_co2'] / df_other_industry_combined['industry_other_co2'].sum()



# Zet relatieve aandelen op 0 waar totaal CO2 daadwerkelijk 0 was
mask_zero_total = (total_co2 == 0)
df_other_industry_combined.loc[mask_zero_total, 'chemical_other_co2_relative'] = 0
df_other_industry_combined.loc[mask_zero_total, 'food_co2_relative'] = 0
df_other_industry_combined.loc[mask_zero_total, 'paper_co2_relative'] = 0
df_other_industry_combined.loc[mask_zero_total, 'metal_other_co2_relative'] = 0
df_other_industry_combined.loc[mask_zero_total, 'industry_other_co2_relative'] = 0

# Het resultaat df_other_industry_combined bevat nu dezelfde kolommen en berekeningen
# als de other_industry tab in de Excel file
print("Analyse voltooid!")
print(f"Aantal gemeenten: {len(df_other_industry_combined)}")
print(f"Totaal CO2 emissies 'Overige industrie': {df_other_industry_combined['Overige industrie'].sum():,.0f}")
print("\nEerste 5 rijen:")
print(df_other_industry_combined.head())

Analyse voltooid!
Aantal gemeenten: 341
Totaal CO2 emissies 'Overige industrie': 11,296,314,622

Eerste 5 rijen:
   GEMEENTECODE GEMEENTENAAM  Overige industrie  \
0            14    Groningen       1.633611e+08   
1            34       Almere       1.278731e+07   
2            37  Stadskanaal       4.290899e+06   
3            47      Veendam       4.330131e+06   
4            50     Zeewolde       3.759252e+06   

   Chemische Industrie basisproducten  \
0                        3.689764e+06   
1                        2.021589e+06   
2                        5.474292e+05   
3                        1.033577e+06   
4                        1.200270e+06   

   Chemische Industrie bestrijdingsmiddelen  Chemische Industrie overig  \
0                                         0                0.000000e+00   
1                                         0                0.000000e+00   
2                                         0                0.000000e+00   
3                                

In [73]:
df_other_industry_combined['chemical_other_co2']

0      3.689764e+06
1      2.021589e+06
2      5.474292e+05
3      1.071317e+07
4      1.200270e+06
           ...     
336    5.348960e+08
337    3.453795e+06
338    3.952111e+05
339    2.865219e+06
340    2.445563e+05
Name: chemical_other_co2, Length: 341, dtype: float64

In [74]:
df_other_industry_combined

Unnamed: 0,GEMEENTECODE,GEMEENTENAAM,Overige industrie,Chemische Industrie basisproducten,Chemische Industrie bestrijdingsmiddelen,Chemische Industrie overig,Voedings- en genotmiddelenindustrie,Papier(waren),Basismetaal,SBI 24.5 (per bedrijf): Gieten van metalen,...,chemical_other_co2_relative,food_co2_relative,paper_co2_relative,metal_other_co2_relative,industry_other_co2_relative,chemical_other_co2_scaled,food_co2_scaled,paper_co2_scaled,metal_other_co2_scaled,industry_other_co2_scaled
0,14,Groningen,1.633611e+08,3.689764e+06,0,0.000000e+00,1.463955e+08,8.442038e+06,0.000000e+00,0.0,...,0.022088,0.876353,0.050536,0.0,0.051024,0.000280,0.042080,0.009533,0.0,0.003842
1,34,Almere,1.278731e+07,2.021589e+06,0,0.000000e+00,7.311246e+06,2.505937e+05,0.000000e+00,0.0,...,0.136512,0.493706,0.016922,0.0,0.352860,0.000153,0.002102,0.000283,0.0,0.002355
2,37,Stadskanaal,4.290899e+06,5.474292e+05,0,0.000000e+00,1.417197e+06,0.000000e+00,0.000000e+00,0.0,...,0.113144,0.292910,0.000000,0.0,0.593945,0.000042,0.000407,0.000000,0.0,0.001295
3,47,Veendam,4.330131e+06,1.033577e+06,0,9.679591e+06,9.933622e+05,0.000000e+00,0.000000e+00,0.0,...,0.712156,0.066034,0.000000,0.0,0.221811,0.000813,0.000286,0.000000,0.0,0.001504
4,50,Zeewolde,3.759252e+06,1.200270e+06,0,0.000000e+00,2.966842e+06,0.000000e+00,0.000000e+00,0.0,...,0.242013,0.598211,0.000000,0.0,0.159776,0.000091,0.000853,0.000000,0.0,0.000357
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336,1979,Eemsdelta,8.762064e+06,5.182023e+08,0,1.669376e+07,1.165545e+06,7.831053e+04,3.194819e+06,0.0,...,0.989699,0.002157,0.000145,0.0,0.007999,0.040598,0.000335,0.000088,0.0,0.001949
337,1980,Dijk en Waard,1.425266e+07,3.453795e+06,0,0.000000e+00,8.728343e+06,0.000000e+00,0.000000e+00,0.0,...,0.195058,0.492947,0.000000,0.0,0.311995,0.000262,0.002509,0.000000,0.0,0.002490
338,1982,Land van Cuijk,4.405887e+07,3.952111e+05,0,0.000000e+00,8.609119e+06,1.131161e+07,7.381568e+04,0.0,...,0.008905,0.193985,0.254879,0.0,0.542230,0.000030,0.002475,0.012773,0.0,0.010847
339,1991,Maashorst,1.591028e+07,3.827032e+05,0,2.482516e+06,1.003492e+07,1.409590e+05,2.952627e+05,0.0,...,0.155042,0.543008,0.007628,0.0,0.294322,0.000217,0.002884,0.000159,0.0,0.002452


In [75]:
df_other_industry_combined_final = df_other_industry_combined[['GEMEENTECODE', 'GEMEENTENAAM','chemical_other_co2_scaled','food_co2_scaled', 'paper_co2_scaled', 'metal_other_co2_scaled', 'industry_other_co2_scaled']]
df_other_industry_combined_final

Unnamed: 0,GEMEENTECODE,GEMEENTENAAM,chemical_other_co2_scaled,food_co2_scaled,paper_co2_scaled,metal_other_co2_scaled,industry_other_co2_scaled
0,14,Groningen,0.000280,0.042080,0.009533,0.0,0.003842
1,34,Almere,0.000153,0.002102,0.000283,0.0,0.002355
2,37,Stadskanaal,0.000042,0.000407,0.000000,0.0,0.001295
3,47,Veendam,0.000813,0.000286,0.000000,0.0,0.001504
4,50,Zeewolde,0.000091,0.000853,0.000000,0.0,0.000357
...,...,...,...,...,...,...,...
336,1979,Eemsdelta,0.040598,0.000335,0.000088,0.0,0.001949
337,1980,Dijk en Waard,0.000262,0.002509,0.000000,0.0,0.002490
338,1982,Land van Cuijk,0.000030,0.002475,0.012773,0.0,0.010847
339,1991,Maashorst,0.000217,0.002884,0.000159,0.0,0.002452


#### Variables for agriculture

In [76]:

# Definieer de emissieoorzaken voor aardgasverbruik landbouw
agriculture_gas_wkk = [
    'Aardgasverbruik landbouw (WKK)'
]

agriculture_gas_niet_wkk = [
    'Aardgasverbruik landbouw (niet WKK)'
]

# Filter op aardgasverbruik landbouw WKK
df_agriculture_wkk = df_emissie_registratie[
    df_emissie_registratie['PROCES_OMSCHRIJVING'].isin(agriculture_gas_wkk)
].reset_index()[['GEMEENTECODE', 'GEMEENTENAAM', 'EMISSIE_KG']].copy()

# Filter op aardgasverbruik landbouw niet-WKK  
df_agriculture_niet_wkk = df_emissie_registratie[
    df_emissie_registratie['PROCES_OMSCHRIJVING'].isin(agriculture_gas_niet_wkk)
].reset_index()[['GEMEENTECODE', 'GEMEENTENAAM', 'EMISSIE_KG']].copy()

# Groepeer WKK data per gemeente en behoud Gebied
df_wkk_grouped = df_agriculture_wkk.groupby(['GEMEENTECODE', 'GEMEENTENAAM'])['EMISSIE_KG'].sum().reset_index()
df_wkk_grouped.columns = ['GEMEENTECODE', 'GEMEENTENAAM', 'Aardgasverbruik landbouw (WKK)']

# Groepeer niet-WKK data per gemeente en behoud Gebied
df_niet_wkk_grouped = df_agriculture_niet_wkk.groupby(['GEMEENTECODE', 'GEMEENTENAAM'])['EMISSIE_KG'].sum().reset_index()
df_niet_wkk_grouped.columns = ['GEMEENTECODE', 'GEMEENTENAAM', 'Aardgasverbruik landbouw (niet WKK)']

# Voeg beide datasets samen
df_agriculture_combined = df_wkk_grouped.merge(
    df_niet_wkk_grouped, 
    on=['GEMEENTECODE', 'GEMEENTENAAM'], 
    how='outer'
).fillna(0)

# Bereken de relatieve aandelen volgens Excel formules
# agriculture_gas_chp_relative = ROUND(IFERROR(WKK/SUM(WKK:Niet_WKK),0), 3)

# Bereken totaal aardgasverbruik per gemeente
total_gas = (df_agriculture_combined['Aardgasverbruik landbouw (WKK)'] + 
            df_agriculture_combined['Aardgasverbruik landbouw (niet WKK)'])

# Bereken WKK relatief aandeel met IFERROR logica (geeft 0 als deling door 0)
wkk_relative = np.where(
    total_gas > 0,
    df_agriculture_combined['Aardgasverbruik landbouw (WKK)'] / total_gas,
    0
)

# Round naar 3 decimalen (zoals in Excel formule)
df_agriculture_combined['agriculture_gas_chp_relative'] = np.round(wkk_relative, 3)

# agriculture_gas_final_demand_relative = 1 - agriculture_gas_chp_relative
df_agriculture_combined['agriculture_gas_final_demand_relative'] = (
    1 - df_agriculture_combined['agriculture_gas_chp_relative']
)

# Het resultaat df_agriculture_combined bevat nu dezelfde kolommen en berekeningen
# als de agriculture_natural_gas tab in de Excel file
print("Agriculture Natural Gas analyse voltooid!")
print(f"Aantal gemeenten: {len(df_agriculture_combined)}")
print(f"Totaal WKK gasverbruik: {df_agriculture_combined['Aardgasverbruik landbouw (WKK)'].sum():,.0f}")
print(f"Totaal niet-WKK gasverbruik: {df_agriculture_combined['Aardgasverbruik landbouw (niet WKK)'].sum():,.0f}")
print("\nEerste 5 rijen:")
print(df_agriculture_combined.head())

# Controleer of de som van relatieve aandelen gelijk is aan 1 (behalve waar totaal 0 is)
check_sum = (df_agriculture_combined['agriculture_gas_chp_relative'] + 
            df_agriculture_combined['agriculture_gas_final_demand_relative'])
total_gas_check = (df_agriculture_combined['Aardgasverbruik landbouw (WKK)'] + 
                  df_agriculture_combined['Aardgasverbruik landbouw (niet WKK)'])

print(f"\nControle: Aantal rijen waar relatieve aandelen niet optellen tot 1 (exclusief 0-totalen): {len(check_sum[(total_gas_check > 0) & (np.abs(check_sum - 1) > 0.001)])}")

Agriculture Natural Gas analyse voltooid!
Aantal gemeenten: 339
Totaal WKK gasverbruik: 4,479,645,868
Totaal niet-WKK gasverbruik: 1,854,836,539

Eerste 5 rijen:
   GEMEENTECODE GEMEENTENAAM  Aardgasverbruik landbouw (WKK)  \
0            14    Groningen                    1.551009e+05   
1            34       Almere                    1.700638e+07   
2            37  Stadskanaal                    9.335328e+03   
3            47      Veendam                    0.000000e+00   
4            50     Zeewolde                    0.000000e+00   

   Aardgasverbruik landbouw (niet WKK)  agriculture_gas_chp_relative  \
0                         2.696024e+06                         0.054   
1                         1.309482e+05                         0.992   
2                         1.856516e+06                         0.005   
3                         1.333183e+06                         0.000   
4                         6.594316e+06                         0.000   

   agriculture_gas_f

In [77]:
df_agriculture_combined_final = df_agriculture_combined[['GEMEENTECODE', 'GEMEENTENAAM', 'agriculture_gas_chp_relative',
                                                              'agriculture_gas_final_demand_relative' ]]

In [78]:

# Properly combine all dataframes using merge operations
# Start with the first dataframe as the base
er_final_demand_data_combined = df_other_industry_combined_final.copy()

# Add the Code_gebied column as the merge key for all subsequent merges
dataframes_to_merge = [
    df_aluminium_co2_scaled_final,
    df_fertilizers_co2_scaled_final,
    df_refineries_co2_scaled_final,
    df_steel_co2_scaled_final,
  
    df_agriculture_combined_final
]

# Merge each dataframe on 'GEMEENTECODE'
for df in dataframes_to_merge:
    er_final_demand_data_combined = er_final_demand_data_combined.merge(
        df, 
        on='GEMEENTECODE', 
        how='outer',  # Use outer join to keep all municipalities
        suffixes=('', '_dup')  # Handle duplicate column names
    )

# Clean up duplicate 'Gebied' columns if they exist
gebied_columns = [col for col in er_final_demand_data_combined.columns if col.startswith('GEMEENTENAAM')]
if len(gebied_columns) > 1:
    # Keep the first 'Gebied' column and drop the duplicates
    columns_to_drop = [col for col in gebied_columns if col != 'GEMEENTENAAM']
    er_final_demand_data_combined = er_final_demand_data_combined.drop(columns=columns_to_drop)

# Fill any NaN values with 0 (for municipalities that don't have certain industry types)
er_final_demand_data_combined = er_final_demand_data_combined.fillna(0)


In [79]:
er_final_demand_data_combined = er_final_demand_data_combined.rename(columns=
    {
        'df_aluminium_co2_scaled': 'aluminium_co2_scaled',
        'df_fertilizers_co2_scaled': 'fertilizers_co2_scaled',
        'df_refineries_co2_scaled': 'refineries_co2_scaled',
        'df_steel_co2_scaled': 'steel_co2_scaled',
}
)

In [80]:
er_final_demand_data_combined

Unnamed: 0,GEMEENTECODE,GEMEENTENAAM,chemical_other_co2_scaled,food_co2_scaled,paper_co2_scaled,metal_other_co2_scaled,industry_other_co2_scaled,index,aluminium_co2_scaled,index_dup,fertilizers_co2_scaled,index_dup.1,refineries_co2_scaled,index_dup.2,steel_co2_scaled,agriculture_gas_chp_relative,agriculture_gas_final_demand_relative
0,14,Groningen,0.000280,0.042080,0.009533,0.0,0.003842,0.0,0.000000,0.0,0.0,110,6.550055e-06,0.0,0.0,0.054,0.946
1,34,Almere,0.000153,0.002102,0.000283,0.0,0.002355,0.0,0.000000,0.0,0.0,10,6.139524e-06,0.0,0.0,0.992,0.008
2,37,Stadskanaal,0.000042,0.000407,0.000000,0.0,0.001295,0.0,0.000000,0.0,0.0,262,8.574373e-07,0.0,0.0,0.005,0.995
3,47,Veendam,0.000813,0.000286,0.000000,0.0,0.001504,0.0,0.000000,0.0,0.0,289,7.631889e-07,0.0,0.0,0.000,1.000
4,50,Zeewolde,0.000091,0.000853,0.000000,0.0,0.000357,0.0,0.000000,0.0,0.0,333,6.464988e-07,0.0,0.0,0.000,1.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,1982,Land van Cuijk,0.000030,0.002475,0.012773,0.0,0.010847,31.0,0.000809,0.0,0.0,157,2.503773e-06,0.0,0.0,0.240,0.760
340,1991,Maashorst,0.000217,0.002884,0.000159,0.0,0.002452,35.0,0.003237,0.0,0.0,177,1.621228e-06,0.0,0.0,0.184,0.816
341,1992,Voorne aan Zee,0.000019,0.009624,0.000150,0.0,0.001398,0.0,0.000000,0.0,0.0,301,2.031166e-06,0.0,0.0,0.996,0.004
342,9997,0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,29,4.934270e-09,0.0,0.0,1.000,0.000


In [81]:
er_final_demand_data_combined

Unnamed: 0,GEMEENTECODE,GEMEENTENAAM,chemical_other_co2_scaled,food_co2_scaled,paper_co2_scaled,metal_other_co2_scaled,industry_other_co2_scaled,index,aluminium_co2_scaled,index_dup,fertilizers_co2_scaled,index_dup.1,refineries_co2_scaled,index_dup.2,steel_co2_scaled,agriculture_gas_chp_relative,agriculture_gas_final_demand_relative
0,14,Groningen,0.000280,0.042080,0.009533,0.0,0.003842,0.0,0.000000,0.0,0.0,110,6.550055e-06,0.0,0.0,0.054,0.946
1,34,Almere,0.000153,0.002102,0.000283,0.0,0.002355,0.0,0.000000,0.0,0.0,10,6.139524e-06,0.0,0.0,0.992,0.008
2,37,Stadskanaal,0.000042,0.000407,0.000000,0.0,0.001295,0.0,0.000000,0.0,0.0,262,8.574373e-07,0.0,0.0,0.005,0.995
3,47,Veendam,0.000813,0.000286,0.000000,0.0,0.001504,0.0,0.000000,0.0,0.0,289,7.631889e-07,0.0,0.0,0.000,1.000
4,50,Zeewolde,0.000091,0.000853,0.000000,0.0,0.000357,0.0,0.000000,0.0,0.0,333,6.464988e-07,0.0,0.0,0.000,1.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,1982,Land van Cuijk,0.000030,0.002475,0.012773,0.0,0.010847,31.0,0.000809,0.0,0.0,157,2.503773e-06,0.0,0.0,0.240,0.760
340,1991,Maashorst,0.000217,0.002884,0.000159,0.0,0.002452,35.0,0.003237,0.0,0.0,177,1.621228e-06,0.0,0.0,0.184,0.816
341,1992,Voorne aan Zee,0.000019,0.009624,0.000150,0.0,0.001398,0.0,0.000000,0.0,0.0,301,2.031166e-06,0.0,0.0,0.996,0.004
342,9997,0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,29,4.934270e-09,0.0,0.0,1.000,0.000


### Export

In [82]:
# Export the final dataframe to a CSV file
sep = ','
er_final_demand_data_combined.to_csv("data/intermediate/ER_final_demand_data_combined.csv", sep=sep, index=False)

print("Final demand data combined and exported successfully!")
print(f"Total rows in final dataframe: {len(er_final_demand_data_combined)}")
print(f"Columns in final dataframe: {list(er_final_demand_data_combined.columns)}")

Final demand data combined and exported successfully!
Total rows in final dataframe: 344
Columns in final dataframe: ['GEMEENTECODE', 'GEMEENTENAAM', 'chemical_other_co2_scaled', 'food_co2_scaled', 'paper_co2_scaled', 'metal_other_co2_scaled', 'industry_other_co2_scaled', 'index', 'aluminium_co2_scaled', 'index_dup', 'fertilizers_co2_scaled', 'index_dup', 'refineries_co2_scaled', 'index_dup', 'steel_co2_scaled', 'agriculture_gas_chp_relative', 'agriculture_gas_final_demand_relative']


### General

## ETM queries of national dataset
Some ETLocal keys require information that the ETM calculates for an 'empty' national scenario. That means running Sandbox queries on the ETM.

The required queries can be found in the .txt files in `pipelines/input_queries_for_national_dataset`. The ETM output (for the future year or present, should not matter) should be stored in `pipelines/data/raw/...`
* `query_dump_2023_edge_output.txt`
* `D_ETM_demand_query_dump_nl_2023.csv`. **NOTE**: Paste the output in the csv file (in e.g. Excel) to import it correctly in this notebook. TODO: use the src.helper.parse_query_dump_to_dataframe method instead.
* `query_dump_2023_miscellaneous_output.txt`



### Extract

#### Edge queries

In [13]:
path_etm_edge = Path("data", "raw", "query_dump_2023_edge_output.txt")
df_etm_edge = src.helper.parse_query_dump_to_dataframe(path_etm_edge)

# Preview the result
print(f"Parsed {len(df_etm_edge)} query-value pairs")
df_etm_edge.head()


Parsed 3 query-value pairs


Unnamed: 0,query,value
0,rail_electricity_freight_train_electricity,0.101759
1,rail_electricity_passenger_train_electricity,0.820453
2,rail_electricity_tram_electricity,0.077788


#### Demand queries

In [15]:
# Import D_ETM_demand_query_dump_nl_2023 from raw folder
path_etm_demand = Path("data", "raw", f"D_ETM_demand_query_dump_nl_2023.csv")
df_etm_demand = pd.read_csv(path_etm_demand, sep=",")
df_etm_demand.head()

Unnamed: 0,key,demand,unit
0,agriculture_burner_crude_oil,2824.287,tj
1,agriculture_burner_hydrogen,0.0,tj
2,agriculture_burner_network_gas,14315.169388,tj
3,agriculture_burner_wood_pellets,2993.334409,tj
4,agriculture_chp_engine_biogas,3611.21129,tj


#### Miscellaneous queries

In [16]:
path_etm_misc = Path("data", "raw", "query_dump_2023_miscellaneous_output.txt")
df_etm_misc = src.helper.parse_query_dump_to_dataframe(path_etm_misc)

# Preview the result
print(f"Parsed {len(df_etm_misc)} query-value pairs")
df_etm_misc


Parsed 528 query-value pairs


Unnamed: 0,query,value
0,lv_net_total_costs_present,8.388472e+08
1,lv_net_costs_per_capacity_step,1.832000e+05
2,mv_net_total_costs_present,7.208843e+08
3,mv_net_costs_per_capacity_step,1.380000e+06
4,hv_net_total_costs_present,4.236178e+08
...,...,...
523,input_percentage_of_mt_steam_hot_water_househo...,0.000000e+00
524,input_percentage_of_ht_steam_hot_water_househo...,1.000000e+00
525,input_percentage_of_central_mt_steam_hot_water...,0.000000e+00
526,input_percentage_of_central_ht_steam_hot_water...,4.227633e-02


### Transform

We first add prefixes to the various queries and variables to denote their origin. This is (hopefully) helpful in the sector notebooks when comnbining internal variables to calculate ETLocal keys.
Also, these prefixes are present in the YAML files at the moment, so they are necessary for the YAML Calculator.


In [17]:
# Add prefix eq to the values of the first column of the edge query dataframe
df_etm_edge_transformed = df_etm_edge.copy()
df_etm_edge_transformed['query'] = 'eq_' + df_etm_edge_transformed['query']

# Add prefix dm to the values of the first column of the demand query dataframe
df_etm_demand_transformed = df_etm_demand.copy()
df_etm_demand_transformed['key'] = 'dq_' + df_etm_demand_transformed['key']

# Add prefix mq to the values of the first column of the misc query dataframe
df_etm_misc_transformed = df_etm_misc.copy()
df_etm_misc_transformed['query'] = 'mq_' + df_etm_misc_transformed['query']

We then combine the three separate query dataframes into one.

In [18]:
# The ETM demand query dataframe refers to the keys of the ETM graph nodes
# For legibility we rename the columns to 'query' and 'value'
df_etm_demand_transformed.rename(columns={'key': 'query', 'demand': 'value'}, inplace=True)

# The ETM edge query dataframe calculates three parameters
# Also rename the 'parameter' column to 'query' for legibility
df_etm_edge_transformed.rename(columns={'parameter': 'query'}, inplace=True)

# Combine the three dataframes into one
df_etm_combined = pd.concat([df_etm_demand_transformed, df_etm_edge_transformed, df_etm_misc_transformed], axis=0, ignore_index=True)

# Add area code column as first column to make clear that these are based on the national dataset
df_etm_combined.insert(0, 'geo_id', pd.NA)
# Fill the geo_id column with the parent and year_etm
df_etm_combined['geo_id'] = f"{parent}{year_etm}"
df_etm_combined

Unnamed: 0,geo_id,query,value,unit
0,nl2023,dq_agriculture_burner_crude_oil,2824.287000,tj
1,nl2023,dq_agriculture_burner_hydrogen,0.000000,tj
2,nl2023,dq_agriculture_burner_network_gas,14315.169388,tj
3,nl2023,dq_agriculture_burner_wood_pellets,2993.334409,tj
4,nl2023,dq_agriculture_chp_engine_biogas,3611.211290,tj
...,...,...,...,...
1712,nl2023,mq_input_percentage_of_mt_steam_hot_water_hous...,0.000000,
1713,nl2023,mq_input_percentage_of_ht_steam_hot_water_hous...,1.000000,
1714,nl2023,mq_input_percentage_of_central_mt_steam_hot_wa...,0.000000,
1715,nl2023,mq_input_percentage_of_central_ht_steam_hot_wa...,0.042276,


### Export

In [19]:
# Export the combined query dataframe to the intermediate folder
df_etm_combined.to_csv("data/intermediate/etm_query_combined.csv", sep=sep, index=False)


## Data analysis miscellaneous sources
Here we import the analysis of miscellaneous data sources. These mostly consist of distribution keys to distribute e.g. the train traffic of the Netherlands to the municipalities. 
The data analysis can be found in the `Miscelaneous analysis (Excel) folder`. The CSV imported here is taken from the Excel file `step_C.xlsx`, tab `miscellaneous_data_analysis`.

### Extract

In [184]:
# Import C_miscellaneous_data_analysis from raw folder
path_misc = Path("data", "raw", f"C_miscellaneous_data_analysis.csv")
df_misc = pd.read_csv(path_misc, sep=sep, index_col=0)
df_misc.head()

Unnamed: 0_level_0,Gemeentenaam,train_share_in_nl,has_tram_metro,buildings_total,arable_land_km2,number_of_vans_data,number_of_trucks_data,number_of_busses_data,greenhouse_scaled,arable_land_scaled,...,dry_biomass_potential,oily_biomass_potential,number_of_cars_data,total_land_area,coast_line,households_roof_pv_potential,buildings_roof_pv_potential,Unnamed: 19,Unnamed: 20,Unnamed: 21
GemeentecodeGM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GM1680,Aa en Hunze,0.0,0,3169.492385,187.488427,1826,101,3,0.0,0.008406,...,0.206609,0.015424,15568,276.060323,0.0,1.62186,1.151709,,,
GM0358,Aalsmeer,0.0,0,3228.845692,7.63299,1963,181,17,0.011069,0.000342,...,0.083723,0.019277,17396,20.098982,0.0,1.678781,2.585545,,,
GM0197,Aalten,0.000606,0,2507.548923,80.464678,1763,91,2,0.0,0.003608,...,0.117793,0.016411,14912,96.530352,0.0,1.14036,1.670153,,,
GM0059,Achtkarspelen,0.0,0,2582.520923,85.163223,2554,152,40,8.1e-05,0.003818,...,0.150534,0.016922,15604,102.205957,0.0,1.289278,1.40414,,,
GM0482,Alblasserdam,0.0,0,721.365538,2.917326,1210,83,28,0.0,0.000131,...,0.051743,0.012193,9531,8.772639,0.0,0.455858,0.633341,,,


### Transform
No data transformation required at present

### Export

In [185]:
# Export the miscellaneous data analysis dataframe to the intermediate folder
df_misc.to_csv("data/intermediate/miscellaneous_data_analysis.csv", sep=sep, index=True)

## Fill in empty etlocal template with national values

In [11]:
Path_template = Path("data","intermediate","etlocal_template_empty_2025.csv")
Path_mmisc_queries = Path("data","intermediate","etm_query_combined.csv")
Path_etm_misc = Path("data","raw", "national_dataset_queries.csv")
Path_filled_template = Path("data","intermediate","etlocal_template_for_sector_notebooks.csv")

In [12]:
empty_etlocal_template = pd.read_csv(Path_template)

df_etm_query_combined = pd.read_csv(Path_mmisc_queries,sep=',')

df_etm_misc = pd.read_csv(Path_etm_misc)

In [13]:
# Merge the dataframes to get values for queries in df_etm_query_combined
merged_df = df_etm_misc.merge(df_etm_query_combined, on='query', how='left')
merged_df

Unnamed: 0,key,query,geo_id,value,unit
0,transport_final_demand_for_rail_electricity_tr...,mq_transport_final_demand_for_rail_electricity...,nl2023,0.101759,
1,transport_final_demand_for_rail_electricity_tr...,mq_transport_final_demand_for_rail_electricity...,nl2023,0.077788,
2,transport_final_demand_for_rail_electricity_tr...,mq_transport_final_demand_for_rail_electricity...,nl2023,0.820453,
3,transport_rail_mixer_diesel_transport_freight_...,mq_transport_rail_mixer_diesel_transport_freig...,nl2023,0.518162,
4,transport_rail_mixer_diesel_transport_passenge...,mq_transport_rail_mixer_diesel_transport_passe...,nl2023,0.481838,
...,...,...,...,...,...
441,present_share_of_terraced_houses_before_1945_i...,mq_present_share_of_terraced_houses_before_194...,nl2023,0.042532,
442,present_share_of_terraced_houses_1945_1964_in_...,mq_present_share_of_terraced_houses_1945_1964_...,nl2023,0.039437,
443,present_share_of_terraced_houses_1965_1984_in_...,mq_present_share_of_terraced_houses_1965_1984_...,nl2023,0.112216,
444,present_share_of_terraced_houses_1985_2004_in_...,mq_present_share_of_terraced_houses_1985_2004_...,nl2023,0.076117,


In [14]:
filled_etlocal_template = empty_etlocal_template.copy()

In [15]:
# Create a mapping from key to value from merged_df
key_value_mapping = merged_df.set_index('key')['value'].to_dict()

# Fill the empty template with values from merged_df where keys match
# Also set commit message for rows that get filled

filled_etlocal_template['value'] = empty_etlocal_template['key'].map(key_value_mapping)
filled_etlocal_template.loc[filled_etlocal_template['value'].notna(), 'commit'] = 'Taken from the National 2023 dataset (ETM)'

  filled_etlocal_template.loc[filled_etlocal_template['value'].notna(), 'commit'] = 'Taken from the National 2023 dataset (ETM)'


In [16]:
filled_etlocal_template.to_csv(Path_filled_template, index=False)