## Probabilities of Modes of Transportation - Preprocessing

This notebook describes the preprocessing of data on the distribution of used modes of transport based on trip length for the pharmalink project. \
The goal is to create a simple matrix with two axes: trip length intervals and modes of transport. \
Cell values represent the probability that a specific mode of transport is used to accomplish a trip with a length included in the interval.   \
This matrix is used to determine modes of transportation for customers within the trip cost calculation.

### Source: [Mobilität in Deutschland - MiD 2017](https://www.mobilitaet-in-deutschland.de/archive/index.html) 
MiD is a large-scale study about mobility behaviour in Germany, last conducted in 2017. \
A newer version, MiD 2023, is currently in processing and not yet avaliable for public use.

© infas, DLR, IVT und infas 360 (2018): Mobilität in Deutschland (im Auftrag des BMVI)

##### Data: [Mobilität in Tabellen - MiT 2017](https://mobilitaet-in-tabellen.dlr.de):
Web app to create tables based on the MiT 2017 datasets. 

Sadly there is no way to download the used table as-is; it needs to be recreated: \
Click "Start MiT", select "Wege" for "Auswertungsebene (Deutschland)", \
then "Verkehrsmittelnutzung am Stichtag" - "Hauptverkehrsmittel (differenziert)" as "Zeile(...)", \
followed by "Entfernung" - "Wegelänge [km] in Gruppen" as "Spalte".

Finally, click "Export" and place the resulting .xlsx spreadsheet in the same folder as this notebook to continue the preprocessing.

The source was last accessed on 2024-09-10.

In [1]:
import pathlib as path
import pandas as pd
import json
import re
import warnings

In [2]:
# Establish notebook path for handling relative paths in the notebook
notebook_path = path.Path().resolve()

if notebook_path.stem != "transport_modes":
    raise Exception(
        "Notebook file root must be set to parent directory of the notebook. Please resolve and re-run."
    )

In [3]:
# Read the data from the excel file and clean it up
# Also catch the UserWarning that is thrown by openpyxl when reading the file due to no default style being set for the cells

probabilities_file = notebook_path.joinpath("output.xlsx")

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore", category=UserWarning, module=re.escape("openpyxl.styles.stylesheet")
    )

    probs = pd.read_excel(
        io=probabilities_file,
        engine="openpyxl",
        sheet_name="MiD Tabellen",
        skiprows=5,
        header=0,
        index_col=0,
        na_values=["-"],
    )

probs = probs.dropna(how="all", axis=0)
probs = probs.drop(index=["Basis ungewichtet", "Basis gewichtet"])
probs = probs.drop(columns=[" "])

# Correct data format from "-" (NaN alias) to 0 % to avoid error in conversion to float
probs.iloc[0, 8] = "0 %"

# Convert all values to float and divide by 100 to get the correct percentage
for column in probs.columns:
    probs[column] = probs[column].str.rstrip(" %").astype("float") / 100.0

# Rename the columns
new_labels = {
    "Pkw (Fahrer)": "car_driver",
    "Pkw (Mitfahrer)": "car_passenger",
    "Motorrad/Moped/Mofa": "motorcycle",
    "Taxi": "taxi",
    "Fahrrad": "bicycle",
    "zu Fuß": "pedestrian",
}
probs = probs.rename(index=new_labels)

# Combine all vehicle-related columns into one (driver and passenger for cars, motorcycles, taxi)
probs.loc["auto"] = (
    probs.loc["car_driver"]
    + probs.loc["car_passenger"]
    + probs.loc["motorcycle"]
    + probs.loc["taxi"]
)

# Drop all columns except for auto, bicycle and pedestrian
probs = probs.loc[["auto", "bicycle", "pedestrian"]]

probs = probs.T

# Replace string-based index with interval-based equivalent
breaks = [0, 0.5, 1, 2, 5, 10, 20, 50, 100, float("inf")]

index = pd.IntervalIndex.from_breaks(breaks, closed="left", name="distance")

probs.index = index

# Normalize the data by dividing each value by the sum of the row
probs = probs.div(probs.agg(axis=1, func="sum"), axis=0)

In [4]:
output = {}

output["breaks"] = breaks
output["data"] = probs.to_dict(orient="records")

output_file = notebook_path.joinpath("transport_modes.json")

with output_file.open("w") as file:
    json.dump(output, file, indent=4)

In [5]:
# Read the data from the json file and check if it is the same as the original data
test_file = notebook_path.joinpath("transport_modes.json")

with open(test_file, "r") as file:
    data = json.load(file)

test_index = pd.IntervalIndex.from_breaks(
    data["breaks"], closed="left", name="distance"
)

test_probs = pd.DataFrame(data["data"], index=index)

test_probs

Unnamed: 0_level_0,auto,bicycle,pedestrian
distance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"[0.0, 0.5)",0.123711,0.092784,0.783505
"[0.5, 1.0)",0.301075,0.182796,0.516129
"[1.0, 2.0)",0.477273,0.215909,0.306818
"[2.0, 5.0)",0.679012,0.160494,0.160494
"[5.0, 10.0)",0.855263,0.092105,0.052632
"[10.0, 20.0)",0.948052,0.038961,0.012987
"[20.0, 50.0)",0.961039,0.025974,0.012987
"[50.0, 100.0)",0.972603,0.027397,0.0
"[100.0, inf)",1.0,0.0,0.0
