# 📝 Notebook-Dokumentation

**Notebook:** `31_scoring__2_get_raster_coeff_v05_germany.ipynb`  
**Beschreibung:**  
Berechnet Zugänglichkeits-Scores für verschiedene Verkehrsmodi (`bike`, `my_bike_cycleways`, `cargo_bike`) auf 100 m-Rasterebene, basierend auf gestanzten Isochronen (Isodonuts) und POI-Attributen.  
Die Methode kombiniert Zeitgewichtung, räumliche Zuordnung und kategorienbasierte Limitierung der POIs zu einem intuitiven, vergleichbaren Scoring-Output.

---

### 📥 Input

- **Isodonut-Daten pro PLZ und Modus**  
  z. B. `{input_folder_isodon}/53925_*_isoDonuts_6x5min__bike_simp0002.parquet`
- **POI-Daten** (als `attr_all` von `data/attractions/attr_germany_all_shapes_25-05-12.parquet`) mit:
  - Geometrie
  - Score
  - Kategorie (`cat`)
  - POI-Typ (`poi_type`)

---

### 🔧 Verarbeitungsschritte

1. **Pro PLZ und Modus:**
   - Laden der zugehörigen Isodonut-Datei
   - Durchführung eines räumlichen Joins zwischen Isodonut-Zonen und POIs
   - Berechnung eines gewichteten Scores (`coeff`) pro POI unter Berücksichtigung des `time_bucket` (z. B. 5 min → 1.0, 10 min → 0.8, ...)

2. **Kategoriebasierte Begrenzung:**
   - Maximalanzahl an POIs je Kategorie/Typ pro Rasterzelle über `set_category_limits()` (z. B. max. 3 Schulen)
   - Selektion der relevantesten POIs basierend auf `coeff`

3. **Aggregation:**
   - Summierung der gewichteten Scores (`coeff`) pro Rasterzelle (`id`) und Modus

4. **Export:**
   - Speichern der aggregierten Scores pro Rasterzelle als CSV (`acc2raster_`)
   - Optional: Speichern aller bewerteten POI-Zuordnungen als `.parquet` zur Nachvollziehbarkeit (`pois_w_score_all`) (nur für die interaktive Visualisierung ist das relevant)

---

### 📤 Output

- `{output_folder_csv}/{plz}_acc2raster_.csv`  
  → Scoring-Resultate pro Rasterzelle und Modus
- `{output_folder_csv}/{plz}_pois_w_score_all.parquet`  
  → POIs mit vollständigem Scoring je PLZ und Modus

---



In [2]:
import json
import pandas as pd
import geopandas as gpd
import os

import numpy as np

from tqdm import tqdm

In [3]:
# get all pois, calc scoring per raster and ...
attr_all= gpd.read_parquet("data/attractions/attr_germany_all_shapes_25-05-12.parquet")
attr_all

Unnamed: 0,poi_index,name,cat,attr,score,geometry,osm_category,poi_type,stops_count
0,0,Papa-Pizza,Freizeit und Kultur,attr_pois,0.5,POINT (6.94125 50.91559),amenity,fast_food,
1,1,Hartis Cafe,Freizeit und Kultur,attr_pois,0.5,POINT (6.96393 50.9052),amenity,restaurant,
2,2,Grundmühle,Freizeit und Kultur,attr_pois,0.5,POINT (13.65719 51.11308),amenity,restaurant,
3,3,Shell,Versorgung (Lebensmittel),attr_pois,2.5,POINT (13.64555 51.01484),amenity,fuel,
4,4,Aral,Versorgung (Lebensmittel),attr_pois,2.5,POINT (8.3896 48.99517),amenity,fuel,
...,...,...,...,...,...,...,...,...,...
4919701,4919701,"Klieken, Schule",OePNV Haltepunkt,pt_stops,1.5,POINT (12.3765 51.89085),,pt_stop,14.0
4919702,4919702,Coswig(Anh),OePNV Haltepunkt,pt_stops,2.0,POINT (12.45762 51.888),,pt_stop,46.0
4919703,4919703,Griebo,OePNV Haltepunkt,pt_stops,2.0,POINT (12.52298 51.88058),,pt_stop,39.0
4919704,4919704,Klieken,OePNV Haltepunkt,pt_stops,2.0,POINT (12.37178 51.89453),,pt_stop,39.0


### PRODUCTION functions

In [5]:
# add scroing for each mode, setting cat_limits


def set_category_limits(df):
    # Define how many rows to keep for each category
    category_row_limit = {
        "Bildung (Basis)": 3,
        "Freizeit und Kultur": 1,
        "Freizeit/Erholung Freiraum": 1,
        "OePNV Haltepunkt": 5, 
        "Versorgung (Einzelhandel und Dienstleistungen)": 1,
        "Versorgung (Lebensmittel)": 3,
        "Verwaltung/Behoerden": 1,
        "Gesundheit": 3,
        "Weiterfuehrende/offene Bildung": 2
    }

    # Get the total number of groups for tqdm progress bar
    num_groups = df.groupby(["id", "cat", "poi_type"]).ngroups
    # Initialize an empty list to collect filtered rows
    filtered_rows = []
   
    
    # Iterate over unique combinations of `Gitter_ID_100m`, `cat`, and `poi_type`
    for (gid, cat, poi_type), group in tqdm(df.groupby(["id", "cat", "poi_type"]), 
                                        total=num_groups, 
                                        desc="Processing groups"):
        # Get the limit for this category
        row_limit = category_row_limit.get(cat, 0)  # Default to 0 if category is not listed
        # row_limit = 99 for testing
        
        # Sort the group by `coeff` in descending order
        group_sorted = group.sort_values("coeff", ascending=False)

        # Select the top `row_limit` rows
        filtered_group = group_sorted.head(row_limit)

        # Store only row values instead of DataFrames
        filtered_rows.extend(filtered_group.values.tolist())
    
    # Convert raw list of rows to a DataFrame **only once** at the end
    #print("Concatenating filtered rows into final DataFrame...")
    dfsjoin_topN = pd.DataFrame(filtered_rows, columns=df.columns)
    
    return dfsjoin_topN

    
def spatial_join(gdf_iso, attr_all):
    """
    Perform spatial join on `gdf_iso` without chunking.
    
    Parameters:
    - gdf_iso (GeoDataFrame): Input GeoDataFrame to be joined.
    - attr_all (GeoDataFrame): Attribute GeoDataFrame for spatial join.
    
    Returns:
    - GeoDataFrame: Result of spatial join.
    """
    #print("___________")
    #print("Performing spatial join...")
    
    dfsjoin = gpd.sjoin(
        gdf_iso, 
        attr_all[['geometry', 'score', 'attr', 'cat', 'name', 'poi_index', 'poi_type']], 
        how="inner", 
        predicate="intersects"
    )
    
    #print("___________") 
    #print("Spatial join completed.")
    
    return dfsjoin


def add_access_score(gdf_iso, mode):
    #print ("___________") 
    #print("Start processing with spatial join...")
    # Perform chunked spatial join
    dfsjoin = spatial_join(gdf_iso, attr_all)

    #print ("___________") 
    #print ("Adding coeff based on Score and time factors...") 
    conditions = [
        dfsjoin["time_bucket"] == 5,
        dfsjoin["time_bucket"] == 10,
        dfsjoin["time_bucket"] == 15,
        dfsjoin["time_bucket"] == 20,
        dfsjoin["time_bucket"] == 25,
        dfsjoin["time_bucket"] == 30
    ]
    time_factors = [1, 0.8, 0.6, 0.4, 0.2, 0.1]
    dfsjoin["coeff"] = np.select(conditions, time_factors, default=0) * dfsjoin["score"]

    #print ("___________") 
    #print ("Start with category_limits...")
    dfsjoin_cat_limit=set_category_limits(dfsjoin)

    #print ("___________") 
    #print ("Start grouping by IDs...")
    # future: dont need attr anymore as pt and rest will get the same
    df_grouped_ID=dfsjoin_cat_limit.groupby(['id','attr'])['coeff'].sum().reset_index()
    
    df_grouped_ID['mode']=mode
    #df_grouped_ID['attr']=attr

    return df_grouped_ID, dfsjoin_cat_limit #for viz table

## generate acc2raster

### what is been done here?
#### input:
* all attraction (osm pois and pt stops)
* plz areas where isodons area fully available

#### output:
* csv where for each pofile and raster the coeff is stored
* (parquet "pois_w_score_all", for visualisition)

In [14]:

### set input and output folder

scenario_name="test_plz_88636"


#input_folder_isodon = "../../storage/isos_ger/isodon/"
input_folder_isodon  = f"isochronen/{scenario_name}/isodon/"

output_folder_csv = f"output/{scenario_name}/"




### get the plz that need to be done (isodon done)

In [15]:
import glob
import re
from collections import Counter

# Get all files in the "data/isos" folder
#files = glob.glob("data/isos/isodon/*")
#files = glob.glob("../../storage/isos_ger/isodon/*")
files = glob.glob(input_folder_isodon+"*")

# Extract PLZ (assuming it's the first numeric part of the filename)
plz_list = []
for file in files:
    match = re.search(r"(\d{5})", file)  # Looks for a 5-digit number
    if match:
        plz_list.append(match.group(1))

# Count occurrences of each PLZ
plz_counts = Counter(plz_list)

# Filter PLZs that appear at least 3 times
valid_plz_isodon = [plz for plz, count in plz_counts.items() if count >= 3]

# Find PLZs that did not appear at least 3 times
invalid_plz = [plz for plz, count in plz_counts.items() if count < 3]

# Print results
print("PLZs that appeared at least 3 times: (show only the first 100)", valid_plz_isodon[:100])
print("PLZs that appeared at least 3 times, number:", len(valid_plz_isodon))
print("PLZs that did NOT appear 3 times:", invalid_plz)

PLZs that appeared at least 3 times: (show only the first 100) ['88636']
PLZs that appeared at least 3 times, number: 1
PLZs that did NOT appear 3 times: []


### check which plz have already been calculated

In [16]:
import glob
import re
from collections import Counter

# Get all files in the "data/isos" folder
#files = glob.glob("data/germany/*")
files = glob.glob(output_folder_csv + "*.csv")

# Extract PLZ (assuming it's the first numeric part of the filename)
plz_list = []
for file in files:
    match = re.search(r"(\d{5})", file)  # Looks for a 5-digit number
    if match:
        plz_list.append(match.group(1))

# Count occurrences of each PLZ
plz_counts = Counter(plz_list)


# Print results
print("PLZs that are ready produeced scoring raster:", plz_list)
print("PLZs that are ready produeced scoring raster, number:", len(plz_list))


PLZs that are ready produeced scoring raster: []
PLZs that are ready produeced scoring raster, number: 0


### get which plz have to be calculated

In [17]:
# Find elements in B but not in A
plz_difference_to_calc = list(set(valid_plz_isodon) - set(plz_list))

In [18]:
len(plz_difference_to_calc)

1

In [19]:
def read_isodon_file(input_folder, plz, m):
    """
    Finds and reads a .parquet file for the given PLZ and m, ignoring the date in the filename.

    Parameters:
        input_folder (str): Path to the folder containing the files.
        plz (str): Postal code or identifier in the filename.
        m (str): Suffix used in the filename (e.g., a mode or version string).

    Returns:
        GeoDataFrame: The loaded GeoDataFrame from the matched parquet file.

    Raises:
        FileNotFoundError: If no file matching the pattern is found.
    """
    pattern = os.path.join(input_folder, f"{plz}_*_isoDonuts_6x5min__{m}_simp0002.parquet")
    matched_files = glob.glob(pattern)

    if matched_files:
        return gpd.read_parquet(matched_files[0])
    else:
        raise FileNotFoundError(f"No matching file found for pattern: {pattern}")

In [22]:


#plz = "50126"

## test one plz only:
#plz_difference_to_calc=["53925"]


# Ensure output directory exists
os.makedirs(output_folder_csv, exist_ok=True)

modes = ["bike", "cargo_bike", "my_bike_cycleways"]

# Total number of PLZs
total_plz = len(plz_difference_to_calc)

#for plz in valid_plz:
#for plz in plz_difference_to_calc:
#    print (plz)
for idx, plz in enumerate(plz_difference_to_calc, 1):
    print(f"\nProcessing PLZ {idx}/{total_plz}: {plz}")
    
    acc2raster_modes=pd.DataFrame()
    pois_w_score_all=pd.DataFrame()
    for m in modes:
        print ("Processing", m, "...")
        #isodons= gpd.read_parquet(input_folder_isodon+plz+"_25-03-15_isoDonuts_6x5min__"+m+"_simp0002.parquet")
        try:
            isodons = read_isodon_file(input_folder_isodon, plz, m)
        except FileNotFoundError as e:
            print(e)

        #isodons=isodons[:30].copy()
        acc2raster_mode, pois_w_score=add_access_score(isodons, m)
        pois_w_score["mode"]=m
        pois_w_score["plz"]=plz
        pois_w_score=pois_w_score[["plz","id","mode","poi_index","time_bucket","name","cat","score","coeff","poi_type"]].copy()
                
        acc2raster_modes=pd.concat([acc2raster_modes, acc2raster_mode])
        pois_w_score_all=pd.concat([pois_w_score_all, pois_w_score])
    
    acc2raster_modes.to_csv(output_folder_csv+plz+"_acc2raster_.csv")
    pois_w_score_all.to_parquet(output_folder_csv+plz+"_pois_w_score_all.parquet")
    print ("####__________####")
    print (" ")
    


Processing PLZ 1/1: 88636
Processing bike ...


Processing groups: 100%|██████████| 10028/10028 [00:06<00:00, 1454.12it/s]


Processing cargo_bike ...


Processing groups: 100%|██████████| 11921/11921 [00:08<00:00, 1445.57it/s]


Processing my_bike_cycleways ...


Processing groups: 100%|██████████| 9020/9020 [00:06<00:00, 1456.60it/s]


####__________####
 
