# 01_Datenaufbereitung (CSV → Parquet)

Dieses Notebook:
- erstellt eine **SparkSession**
- definiert **Schemas** für Hub/Spoke
- liest die Original-CSV Dateien ein
- führt **Parsing/Cleansing** durch (z.B. `"-"` → `null`, Timestamp-Parsing, Feature-Vector-Parsing)
- speichert die bereinigten Daten als **Parquet**

> **Hinweis:** Pfade sind standardmäßig auf die hochgeladenen Dateien in dieser Umgebung gesetzt. Wenn ihr das Notebook in eurem Repo/Cluster nutzt, passt die Pfade im nächsten Cell an.


In [37]:
# Imports

# nur auf ZHAW-Cluster nötig
# import swissproc
# sc = swissproc.connect("frommthi", 2)

import os

import pyspark.sql

spark = pyspark.sql.SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col, from_unixtime, date_format, round, when

import time

In [8]:
# Pfade

# --- INPUT: Originaldateien ---
HUB_CSV = "data/raw/dm-hub.csv"
SPOKE_CSV = "data/raw/dm-spoke.csv"
METEO_CSV = "data/raw/ogd-smn-precip_leu_t_recent.csv"

# --- OUTPUT: Parquet-Zielordner ---
OUT_ROOT = "data/processed"
HUB_OUT_PARQUET = os.path.join(OUT_ROOT, "hub")
SPOKE_OUT_PARQUET = os.path.join(OUT_ROOT, "spoke")
METEO_OUT_PARQUET = os.path.join(OUT_ROOT, "meteo")

os.makedirs(OUT_ROOT, exist_ok=True)

print("HUB_CSV:", HUB_CSV)
print("SPOKE_CSV:", SPOKE_CSV)
print("OUT_ROOT:", OUT_ROOT)


HUB_CSV: data/raw/dm-hub.csv
SPOKE_CSV: data/raw/dm-spoke.csv
OUT_ROOT: data/processed


## Datenaufbereitung Hubs
Die Daten der Hubs enthält folgende Spalten:
- `hubtracker`: Eindeutige ID des Hubs
- `timestamp_hubs`: Zeitstempel der Messung (Unix-Epoche in Sekunden -> DD.MM.YYYY HH:MM)
- `hub_coords`: Koordinaten des Hubs als String `<latitude> <longitude>`
- `lat_hubs`: Latitude des Hubs (Dezimalgrad)
- `lon_hubs`: Longitude des Hubs (Dezimalgrad)
- `voltage`: Spannung des Hubs in Volt
- `temperature_hubs`: Temperatur des Hubs in Grad Celsius
- `signal`: Signalstärke des Hubs (nicht benötigt)


In [9]:
# Hub-Format: <hubtracker> <timestamp> hub-coords <latitude> <longitude> <voltage/V> <temperature/°C> <signal>
# => 8 Spalten (signal ist die 8. Spalte, wird aber nicht benötigt)

start = time.time()

hubs = spark.read.format("csv") \
    .option("header", "false") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load(HUB_CSV) \
    .toDF("hubtracker",
          "timestamp_hubs",
          "hub_coords",
          "lat_hubs",
          "lon_hubs",
          "voltage",
          "temperature_hubs",
          "signal") \
    .select("hubtracker", "timestamp_hubs", "lat_hubs", "lon_hubs") \
    .withColumn(
    "timestamp_hubs",
    date_format(from_unixtime(col("timestamp_hubs")), "dd.MM.yyyy HH:mm")) \
    .withColumn(
    "lat_hubs",
    col("lat_hubs").cast("double")) \
    .withColumn(
    "lon_hubs",
    col("lon_hubs").cast("double")) \
    .dropna() \
    # .dropDuplicates(["hubtracker"])

hubs.write.mode("overwrite").parquet(HUB_OUT_PARQUET)
hubs_parquet = spark.read.parquet(HUB_OUT_PARQUET)

hubs_parquet.printSchema()
print(f"Hub-Verarbeitung: {time.time() - start:.2f} Sekunden")
print(f"Anzahl Hubs: {hubs_parquet.count()}")

root
 |-- hubtracker: integer (nullable = true)
 |-- timestamp_hubs: string (nullable = true)
 |-- lat_hubs: double (nullable = true)
 |-- lon_hubs: double (nullable = true)

Hub-Verarbeitung: 7.05 Sekunden
Anzahl Hubs: 46757


In [10]:
display(hubs_parquet.limit(100).toPandas())


Unnamed: 0,hubtracker,timestamp_hubs,lat_hubs,lon_hubs
0,937832,12.07.2025 06:36,46.369462,7.673785
1,932400,12.07.2025 06:38,46.369377,7.673778
2,937832,12.07.2025 06:36,46.369462,7.673785
3,937832,12.07.2025 06:36,46.369462,7.673785
4,937832,12.07.2025 06:36,46.369462,7.673785
...,...,...,...,...
95,931861,12.07.2025 07:09,46.369495,7.674707
96,931861,12.07.2025 07:09,46.369495,7.674707
97,931861,12.07.2025 07:09,46.369495,7.674707
98,931861,12.07.2025 07:09,46.369495,7.674707


## Datenaufbereitung Spokes
Die Daten der Spokes enthält folgende Spalten:
- `hubtracker`: Eindeutige ID des Hubs, der den Spoke emp
- `timestamp_spokes`: Zeitstempel der Messung (Unix-Epoche in Sekunden -> DD.MM.YYYY HH:MM)
- `spoke_visibility`: Sichtbarkeit des Spokes (Boolean)
- `spoketracker`: Eindeutige ID des Spokes (Tier)
- `rssi`: Signalstärke des Spokes in dB
- `device_state`: Zustand des Geräts (z.B. "OK", "LOW_BATTERY", etc.)
- `voltage`: Spannung des Spokes in Volt
- `temperature_spokes`: Temperatur des Spokes in Grad Celsius
- `animal_state`: Zustand des Tiers (z.B. "RESTING", "WALKING", etc.)
- `state_resting`: Zeit in Minuten, die das Tier ruhend verbracht hat
- `state_walking`: Zeit in Minuten, die das Tier gehend verbracht hat
- `state_grazing`: Zeit in Minuten, die das Tier grast verbracht hat
- `state_running`: Zeit in Minuten, die das Tier rennend verbracht hat

In [11]:
# Spoke-Format: <hubtracker> <timestamp> spoke-visibility <spoketracker> <rssi/dB> <device-state>
#               <voltage/V> <temperature/°C> <animal-state> <state-resting/min> <state-walking/min>
#               <state-grazing/min> <state-running/min> <f01> ... <f15> <extra>
# => 29 Spalten (15 Feature-Werte als einzelne Spalten)

start = time.time()

# Spaltennamen für alle 29 Spalten
spoke_cols = [
                 "hubtracker", "timestamp_spokes", "spoke_visibility", "spoketracker",
                 "rssi", "device_state", "voltage", "temperature_spokes", "animal_state",
                 "state_resting", "state_walking", "state_grazing", "state_running"
             ] + [f"f{i:02d}" for i in range(1, 16)] + ["extra"]

spokes = spark.read.format("csv") \
    .option("header", "false") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load(SPOKE_CSV) \
    .toDF(*spoke_cols)

# Nur benötigte Spalten selektieren (ohne extra-Spalte)
spokes_selected = spokes.select(
    "hubtracker",
    "timestamp_spokes",
    "spoke_visibility",
    "spoketracker",
    "rssi",
    "device_state",
    "voltage",
    "temperature_spokes",
    "animal_state",
    "state_resting",
    "state_walking",
    "state_grazing",
    "state_running",
    #*[f"f{i:02d}" for i in range(1, 16)]
)

# Datentypen anpassen
spokes_cleaned = spokes_selected \
    .withColumn("timestamp_spokes",
                date_format(from_unixtime(col("timestamp_spokes")), "dd.MM.yyyy HH:mm")) \
    .withColumn("rssi", col("rssi").cast("double")) \
    .withColumn("voltage", col("voltage").cast("double")) \
    .withColumn("temperature_spokes", col("temperature_spokes").cast("double")) \
    .dropna()

spokes_cleaned.write.mode("overwrite").parquet(SPOKE_OUT_PARQUET)
spokes_parquet = spark.read.parquet(SPOKE_OUT_PARQUET)

spokes_parquet.printSchema()
print(f"Spoke-Verarbeitung: {time.time() - start:.2f} Sekunden")
print(f"Anzahl Spoke-Einträge: {spokes_parquet.count()}")


root
 |-- hubtracker: integer (nullable = true)
 |-- timestamp_spokes: string (nullable = true)
 |-- spoke_visibility: string (nullable = true)
 |-- spoketracker: string (nullable = true)
 |-- rssi: double (nullable = true)
 |-- device_state: string (nullable = true)
 |-- voltage: double (nullable = true)
 |-- temperature_spokes: double (nullable = true)
 |-- animal_state: string (nullable = true)
 |-- state_resting: integer (nullable = true)
 |-- state_walking: integer (nullable = true)
 |-- state_grazing: integer (nullable = true)
 |-- state_running: integer (nullable = true)

Spoke-Verarbeitung: 2.00 Sekunden
Anzahl Spoke-Einträge: 187416


In [12]:
display(spokes_parquet.limit(100).toPandas())


Unnamed: 0,hubtracker,timestamp_spokes,spoke_visibility,spoketracker,rssi,device_state,voltage,temperature_spokes,animal_state,state_resting,state_walking,state_grazing,state_running
0,932400,25.07.2025 06:04,spoke-visibility,newspoke-608B7C97,-98.0,noproblem,4.55,15.0,walking,5151,8843,2259,1695
1,932400,25.07.2025 06:04,spoke-visibility,newspoke-67CA7EB0,-100.0,noproblem,4.55,14.0,walking,4950,11511,1811,512
2,932400,25.07.2025 06:04,spoke-visibility,newspoke-C99FA3B7,-78.0,noproblem,4.55,15.0,walking,3684,10991,3142,1586
3,932400,25.07.2025 06:04,spoke-visibility,newspoke-2902F0B8,-77.0,noproblem,4.55,14.0,running,3594,8097,397,6682
4,932400,25.07.2025 06:04,spoke-visibility,newspoke-6AA669CF,-91.0,noproblem,4.55,16.0,running,5892,2722,5421,3163
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,932400,25.07.2025 06:19,spoke-visibility,newspoke-1371664E,-97.0,noproblem,4.55,13.0,running,5310,7163,500,3997
96,932400,25.07.2025 06:19,spoke-visibility,newspoke-240C8E6C,-92.0,noproblem,4.55,15.0,running,4691,9097,363,3210
97,932400,25.07.2025 06:19,spoke-visibility,newspoke-B9A7FF6F,-91.0,noproblem,4.55,14.0,walking,7002,8109,1165,2733
98,932400,25.07.2025 06:19,spoke-visibility,newspoke-EFD1CF8E,-93.0,noproblem,4.55,13.0,walking,3867,7487,1997,3666


## Datenaufbereitung MeteoSchweiz Niederschlagsdaten
Niederschlagsdaten der Station Leukerbad (CH) von MeteoSchweiz
Quelle: https://data.geo.admin.ch/ch.meteoschweiz.ogd-smn-precip/leu/ogd-smn-precip_leu_t_recent.csv
Die Daten enthalten folgende Spalten:
- `station_abbr`: Stations-ID (Leukerbad: LEU)
- `reference_timestamp`: Datum/Zeit der Messung (DD-MM-YYYY HH:MM:SS) -> Auflösung: 10 Minuten
- `rre150z0`: Niederschlagsmenge in mm

Folgende Spalten werden zusätzlich erstellt:
- `rain_h`: Niederschlagsmenge in mm/h (berechnet aus `rain_10min` * 6)
- `rain_category`: Kategorisierung des Niederschlags in fünf Klassen:
    - kein (0 mm/h)
    - sehr_schwach (0.1 - 0.5 mm/h)
    - schwach (0.6 - 2 mm/h)
    - mässig (2.1 - 5 mm/h)
    - stark (5.1 - 10 mm/h)
    - sehr_stark (> 10 mm/h)

In [38]:
weather = (
    spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", ";")
    .option("inferSchema", "true")
    .load(METEO_CSV)
    .toDF("station_id", "meteo_timestamp", "rain_10min")
    .withColumn("rain_10min", col("rain_10min").cast("double"))
    .withColumn("rain_h", round(col("rain_10min") * 6.0, 1))  # Umrechnung auf mm/h
    .withColumn(
        "rain_category",
        when(col("rain_h") == 0, "kein")
        .when(col("rain_h") <= 0.5, "sehr_schwach")
        .when(col("rain_h") <= 2, "schwach")
        .when(col("rain_h") <= 5, "mässig")
        .when(col("rain_h") <= 10, "stark")
        .otherwise("sehr_stark")
    )
    .dropna()
)

weather.write.mode("overwrite").parquet(METEO_OUT_PARQUET)
weather_parquet = spark.read.parquet(METEO_OUT_PARQUET)

weather_parquet.printSchema()
print(f"Anzahl Meteo-Einträge: {weather_parquet.count()}")

root
 |-- station_id: string (nullable = true)
 |-- meteo_timestamp: string (nullable = true)
 |-- rain_10min: double (nullable = true)
 |-- rain_h: double (nullable = true)
 |-- rain_category: string (nullable = true)

Anzahl Meteo-Einträge: 50976


In [39]:
display(weather_parquet.show(1000))

+----------+----------------+----------+------+-------------+
|station_id| meteo_timestamp|rain_10min|rain_h|rain_category|
+----------+----------------+----------+------+-------------+
|       LEU|01.01.2025 00:00|       0.0|   0.0|         kein|
|       LEU|01.01.2025 00:10|       0.0|   0.0|         kein|
|       LEU|01.01.2025 00:20|       0.0|   0.0|         kein|
|       LEU|01.01.2025 00:30|       0.0|   0.0|         kein|
|       LEU|01.01.2025 00:40|       0.0|   0.0|         kein|
|       LEU|01.01.2025 00:50|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:00|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:10|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:20|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:30|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:40|       0.0|   0.0|         kein|
|       LEU|01.01.2025 01:50|       0.0|   0.0|         kein|
|       LEU|01.01.2025 02:00|       0.0|   0.0|         kein|
|       

None