# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


## Setup
### Imports

In [1]:
import numpy as np
import pandas as pd
import random
import tensorflow as tf
import matplotlib.pyplot as pl
import os, json, hashlib

from datetime import datetime
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.sequence import pad_sequences
from huggingface_hub import login, list_repo_files, hf_hub_download, upload_file

### Configuration

In [None]:
REPO_ID = "mttfst/Paulette_Cloud_Tracks"
token = ""

WINDOW_MINUTES =  30 #Minuten
CUTOFF_STEPS = 5

MODEL = "SimpleRNN"
LOSS = "MSE"

BATCH_SIZE = 8
EPOCHS = 2 #40

In [3]:
# =========================================
# Login HuggingFace
# =========================================
login(token)

### Logging

In [4]:
# Du kannst hier ein separates Repo für Logs/Configs nutzen (empfohlen),
# oder du lässt es auf dem Dataset-Repo laufen.
CONFIG_REPO_ID = REPO_ID  # z.B. "thorsten789/hurricane_cloud_runs"

def make_run_id(prefix: str, config: dict) -> str:
    ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    cfg_str = json.dumps(config, sort_keys=True)
    h = hashlib.sha1(cfg_str.encode("utf-8")).hexdigest()[:10]
    return f"{prefix}_{ts}_{h}"

def save_json_local(path: str, data: dict) -> str:
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return path

def upload_json_hf(local_path: str, run_id: str, name: str, base_dir: str = "runs"):
    """Lädt JSON als {base_dir}/{run_id}/{name}.json in CONFIG_REPO_ID hoch."""
    try:
        path_in_repo = f"{base_dir}/{run_id}/{name}.json"
        upload_file(
            path_or_fileobj=local_path,
            path_in_repo=path_in_repo,
            repo_id=CONFIG_REPO_ID,
            repo_type="dataset",
            commit_message=f"Add {name}.json for {run_id}",
        )
        print(f"[HF] uploaded: {path_in_repo}")
    except Exception as e:
        print(f"[HF] upload skipped/failed ({name}): {e}")

# --- zentrale Run-Config (die ID basiert auf config -> sinnvoller Run-Name)
RUN_CONFIG = {
    "model": MODEL,
    "optimizer": "adam",
    "units1": 64,
    "units2": 32,
    "lr": 1e-3,
    "batch_size": BATCH_SIZE,
    "epochs": EPOCHS,
    "cutoff_steps": CUTOFF_STEPS,
    "loss": LOSS,
}

RUN_ID = make_run_id("simple_rnn", RUN_CONFIG)
print("RUN_ID:", RUN_ID)

# Setup sofort speichern (damit du schon am Anfang einen Run hast)
setup = {
    "run_id": RUN_ID,
    "config": RUN_CONFIG,
    "data": {"repo_id": REPO_ID},
    "meta": {"notebook": "3_Model/model_definition_evaluation_JS.ipynb"},
}

# +++ Logging: Save Setup +++
local_setup = save_json_local(f"runs_local/{RUN_ID}/setup.json", setup)
upload_json_hf(local_setup, RUN_ID, "setup")

class AutoSaveTrain(tf.keras.callbacks.Callback):
    """Speichert train.json am Ende von model.fit (History + best_val_loss)."""
    def __init__(self, run_id: str):
        super().__init__()
        self.run_id = run_id

    def on_train_end(self, logs=None):
        hist = getattr(self.model, "history", None)
        history_dict = hist.history if hist is not None else {}

        train_data = {
            "run_id": self.run_id,
            "history": history_dict,
            "summary": {
                "best_val_loss": float(min(history_dict["val_loss"])) if "val_loss" in history_dict else None,
                "final_train_loss": float(history_dict["loss"][-1]) if "loss" in history_dict and len(history_dict["loss"]) else None,
            },
        }
        local_train = save_json_local(f"runs_local/{self.run_id}/train.json", train_data)
        upload_json_hf(local_train, self.run_id, "train")


RUN_ID: simple_rnn_2026-02-17_12-35-09_a7afa260af


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


[HF] upload skipped/failed (setup): The read operation timed out


### Load Data from HuggingFace

TRACK LENGTH FILE:

Diese CSV enthält Metadaten über alle Wolken:
- filename: Dateiname des Tracks
- track_len: Anzahl Zeitschritte (Lebensdauer)

WARUM WICHTIG?
- Wir trainieren nur auf Wolken mit >= 120 Zeitschritten
- Grund: Kurze Wolken (<60 Minuten) sind zu variabel/chaotisch
- Längere Wolken zeigen klare Lebenszyklen

DATENQUELLE:
Diese Datei wurde vorberechnet aus allen Tracks.
Spart Zeit beim Training-Setup.

In [5]:
tracks_120 = None

# Wenn lokal nicht da: von Hugging Face Dataset-Repo herunterladen
if tracks_120 is None:
    from huggingface_hub import hf_hub_download

    filename_in_repo = "track_len/track_len_exp_1.1.csv"

    print(f"⬇️  Downloading from HF: {REPO_ID}/{filename_in_repo}")
    local_file = hf_hub_download(
        repo_id=REPO_ID,
        repo_type="dataset",
        filename=filename_in_repo,
    )

    print(f"✅ Loading track_len from HF-cached file: {local_file}")
    tracks_120 = pd.read_csv(local_file)

print("tracks_120 shape:", tracks_120.shape)
tracks_120.head()

# =========================================
# Dataset-Split
# =========================================
files = list_repo_files(REPO_ID, repo_type="dataset")
csv_files = [f for f in files if f.startswith("exp_1.1/") and f.endswith(".csv")]

tracks_120 = tracks_120[tracks_120.track_len >= 120]
print("Total CSV tracks with at least 120 timesteps:", len(tracks_120))

tracks_120 = tracks_120.filename.to_list()

csv_files = [
    f for f in csv_files
    if f.split("/")[1] in tracks_120
]

random.seed(42)
random.shuffle(csv_files)

n = len(csv_files)
#train_files = csv_files[: int(0.7 * n)]
#val_files   = csv_files[int(0.7 * n): int(0.85 * n)]
#test_files  = csv_files[int(0.85 * n):]

train_files= csv_files[:150]
val_files= csv_files[100:125]
test_files= csv_files[150:175]

print(f"Train: {len(train_files)}, Val: {len(val_files)}, Test: {len(test_files)}")

⬇️  Downloading from HF: mttfst/Paulette_Cloud_Tracks/track_len/track_len_exp_1.1.csv


track_len_exp_1.1.csv:   0%|          | 0.00/173k [00:00<?, ?B/s]

✅ Loading track_len from HF-cached file: /root/.cache/huggingface/hub/datasets--mttfst--Paulette_Cloud_Tracks/snapshots/f47236a556953abe353bd3aacd54093b322c009a/track_len/track_len_exp_1.1.csv
tracks_120 shape: (9227, 2)
Total CSV tracks with at least 120 timesteps: 1115
Train: 150, Val: 25, Test: 25


## Model Selection

[Discuss the type(s) of models you consider for this task, and justify the selection.]



## Feature Engineering

[Describe any additional feature engineering you've performed beyond what was done for the baseline model.]


In [6]:
PROFILE_PREFIXES = ["qr_", "qc_", "qi_", "qs_", "qg_", "qv_", "roh_", "w_"]

SCALAR_FEATURES = [
    "cape_ml_L00", "cin_ml_L00",
    "lwp_L00",
    "iwp_L00", "rain_gsp_rate_L00",
    "tqc_L00", "tqi_L00", "area_m2"
]

In [7]:
def compute_remaining_lifetime(df, timestep_minutes=5):
    """
    Berechnet verbleibende Lebensdauer pro Zeitschritt
    """
    n = len(df)
    return [(n - i - 1) * timestep_minutes for i in range(n)]

#def compute_future_rain(df, timestep_minutes=5):
#    rain = df["rain_gsp_rate_L00"].values
#    dt = timestep_minutes * 60  # Sekunden

#    future_rain = []
#    for i in range(len(rain)):
#        future_rain.append(rain[i:].sum() * dt)

#    return future_rain

In [8]:
def extract_profile(df, prefix, n_levels=50):
    """
    Extrahiert ein vertikales Profil mit exakt n_levels.
    Fehlende Level werden mit 0 aufgefüllt.
    """
    data = np.zeros((len(df), n_levels), dtype="float32")
    for i in range(n_levels):
        col = f"{prefix}L{i:02d}"
        if col in df.columns:
            data[:, i] = df[col].values
    return data

In [9]:
def extract_ts_features_from_profiles(profiles):
    """
    profiles: (T, Z, F) mit F = len(PROFILE_PREFIXES)

    Z MUSS = 50 sein und der folgenden vertikalen Struktur entsprechen
    (Index 0 = oberste Schicht)
    """

    # --- hard-coded z-levels (physikalische Höhen) ---
    z = np.array([
        3.136780e+04, 2.736595e+04, 2.492369e+04, 2.294698e+04, 2.125334e+04,
        1.975951e+04, 1.841803e+04, 1.719845e+04, 1.607970e+04, 1.504641e+04,
        1.408694e+04, 1.319217e+04, 1.235481e+04, 1.156892e+04, 1.082956e+04,
        1.013258e+04, 9.474455e+03, 8.852140e+03, 8.263009e+03, 7.704765e+03,
        7.175387e+03, 6.673090e+03, 6.196285e+03, 5.743555e+03, 5.313629e+03,
        4.905366e+03, 4.517735e+03, 4.149806e+03, 3.800737e+03, 3.469765e+03,
        3.156199e+03, 2.859414e+03, 2.578843e+03, 2.313976e+03, 2.064356e+03,
        1.829575e+03, 1.609273e+03, 1.403137e+03, 1.210904e+03, 1.032357e+03,
        8.673333e+02, 7.157275e+02, 5.774996e+02, 4.526887e+02, 3.414336e+02,
        2.440081e+02, 1.608839e+02, 9.285786e+01, 4.137239e+01, 1.000000e+01
    ], dtype="float32")

    idx = {name: i for i, name in enumerate(PROFILE_PREFIXES)}

    qc = profiles[:, :, idx["qc_"]]   # (T, Z)
    qi = profiles[:, :, idx["qi_"]]
    qr = profiles[:, :, idx["qr_"]]
    w  = profiles[:, :, idx["w_"]]

    T, Z = qc.shape
    if Z != z.shape[0]:
        raise ValueError(f"Z={Z} passt nicht zu z_levels={z.shape[0]}")

    ts_features = []

    for t in range(T):
        qc_t = qc[t]
        qi_t = qi[t]
        qr_t = qr[t]
        w_t  = w[t]

        cloud_mask = (qc_t + qi_t) > 0

        if not np.any(cloud_mask):
            ts_features.append(np.zeros(12, dtype="float32"))
            continue

        z_cloud = z[cloud_mask]

        # physikalisch korrekt
        cloud_base = float(np.min(z_cloud))
        cloud_top  = float(np.max(z_cloud))
        cloud_thickness = cloud_top - cloud_base

        cloud_mass = float(np.sum(qc_t + qi_t))
        rain_mass  = float(np.sum(qr_t))

        w_in_cloud = w_t[cloud_mask]
        mean_w = float(np.mean(w_in_cloud))
        max_w  = float(np.max(w_in_cloud))

        height_max_qc = float(z[int(np.argmax(qc_t))])
        height_max_w  = float(z[int(np.argmax(w_t))])

        weights = qc_t + qi_t
        center_of_mass = float(np.sum(z * weights) / (cloud_mass + 1e-12))

        ts_features.append([
            cloud_base,
            cloud_top,
            cloud_thickness,
            cloud_mass,
            rain_mass,
            mean_w,
            max_w,
            height_max_qc,
            height_max_w,
            center_of_mass,
            float(np.max(qc_t)),
            float(np.std(w_in_cloud)),
        ])

    return np.array(ts_features, dtype="float32")

In [100]:
def preprocess_cloud(df):
    df = df.sort_values("time")

    # Target
    y_lifetime = compute_remaining_lifetime(df)

    # Profile (T, Z, F)
    profile_features = []
    for prefix in PROFILE_PREFIXES:
        prof = extract_profile(df, prefix, n_levels=50)
        profile_features.append(prof)
    
    profiles = np.stack(profile_features, axis=-1)
    
    # cin and cape only every 5min interpolate the rest
    df['cin_ml_L00'] = df['cin_ml_L00'].interpolate(method='linear').bfill()
    df['cape_ml_L00'] = df['cape_ml_L00'].interpolate(method='linear').bfill()

    scalars = df[SCALAR_FEATURES].values.astype("float32")
    
    # neue TS-Features
    ts_features = extract_ts_features_from_profiles(profiles)

    return {
        "ts_features": ts_features,      # (T, 12)
        "scalars": scalars,              # (T, 8)
        "y": np.array(y_lifetime, dtype="float32")[:, None]
    }

In [101]:
def load_and_preprocess(files):
    samples = []
    for f in files:
        local_file = hf_hub_download(
            repo_id=REPO_ID,
            repo_type="dataset",
            filename=f,
        )
        df = pd.read_csv(local_file)
        
        if len(df) <= CUTOFF_STEPS:
            continue
        
        sample = preprocess_cloud(df)
        samples.append(sample)
    
    return samples

In [102]:
test_sample= hf_hub_download(
            repo_id=REPO_ID,
            repo_type="dataset",
            filename=train_files[0],
        )
df = pd.read_csv(test_sample)
df['cin_ml_L00'].bfill(limit=int(df['cin_ml_L00'].isna().cumprod().sum())).interpolate()


Unnamed: 0,cin_ml_L00
0,0.591320
1,0.591320
2,0.591320
3,0.591320
4,0.591320
...,...
183,28.405456
184,28.405456
185,28.405456
186,28.405456


In [103]:
df['cin_ml_L00'].interpolate(method='linear').bfill()

Unnamed: 0,cin_ml_L00
0,0.591320
1,0.591320
2,0.591320
3,0.591320
4,0.591320
...,...
183,28.386643
184,28.405456
185,28.405456
186,28.405456


In [104]:
df['cin_ml_L00'].isna().cumprod().sum()

np.int64(4)

In [105]:
print("Loading data...")
train_samples = load_and_preprocess(train_files)
val_samples   = load_and_preprocess(val_files)
test_samples  = load_and_preprocess(test_files)

print(f"Loaded: {len(train_samples)} train, {len(val_samples)} val, {len(test_samples)} test")

Loading data...
Loaded: 150 train, 25 val, 25 test


In [106]:
def create_fixed_window_sequences(samples, window_steps=WINDOW_MINUTES):  # 30min
    X, y = [], []
    WINDOW_STEPS = int(WINDOW_MINUTES * 60 / 30)  # 30min * 60s/min / 30s/step = 60

    for s in samples:
        combined = np.concatenate([s["scalars"], s["ts_features"]], axis=1)
        for i in range(len(combined) - WINDOW_STEPS):
            X.append(combined[i:i+WINDOW_STEPS])  # (60, 20)
            y.append(s["y"][i+WINDOW_STEPS])  # RUL am Ende
    return np.array(X), np.array(y)

X_train, y_train = create_fixed_window_sequences(train_samples, WINDOW_MINUTES)
X_val, y_val = create_fixed_window_sequences(val_samples, WINDOW_MINUTES)
X_test, y_test = create_fixed_window_sequences(test_samples, WINDOW_MINUTES)

In [107]:
sample = train_samples[0]
ts_features = sample["ts_features"]

print("cloud_base  > cloud_top?", np.all(ts_features[:,0] >= ts_features[:,1]))
print("thickness > 0?", np.all(ts_features[:,2] > 0))
print("Sample values:\n", ts_features[:3, :5])  # Erste 5 Features

cloud_base  > cloud_top? False
thickness > 0? True
Sample values:
 [[3.4143359e+02 2.1253340e+04 2.0911906e+04 8.6598527e-03 1.0412185e-02]
 [3.4143359e+02 2.1253340e+04 2.0911906e+04 9.4145974e-03 1.1317367e-02]
 [3.4143359e+02 2.1253340e+04 2.0911906e+04 1.0146300e-02 1.2264996e-02]]


In [108]:
sample = train_samples[8]

scalars = sample["scalars"]          # (T, 8)
ts      = sample["ts_features"]      # (T, 12)

# Zu (T, 20) zusammenfügen
combined = np.concatenate([scalars, ts], axis=1)

cols = [
    "cape_ml_L00", "cin_ml_L00",
    "lwp_L00",
    "iwp_L00", "rain_gsp_rate_L00",
    "tqc_L00", "tqi_L00", "area_m2",
    "cloud_base", "cloud_top", "cloud_thickness",
    "cloud_mass", "rain_mass",
    "mean_w", "max_w",
    "height_max_qc", "height_max_w",
    "center_of_mass", "max_qc", "std_w_in_cloud"
]

ts_df = pd.DataFrame(combined, columns=cols)
ts_df

Unnamed: 0,cape_ml_L00,cin_ml_L00,lwp_L00,iwp_L00,rain_gsp_rate_L00,tqc_L00,tqi_L00,area_m2,cloud_base,cloud_top,cloud_thickness,cloud_mass,rain_mass,mean_w,max_w,height_max_qc,height_max_w,center_of_mass,max_qc,std_w_in_cloud
0,604.720398,1.058137,2.507058,0.188360,0.000228,2.034199,0.037669,31360000.0,715.727478,21253.339844,20537.613281,0.008732,0.002572,0.233418,1.020461,3156.198975,1609.272949,2483.903564,0.001669,0.409249
1,604.720398,1.058137,2.567386,0.176616,0.000240,2.066459,0.037015,31360000.0,577.499573,21253.339844,20675.839844,0.008874,0.002705,0.242999,1.030474,3156.198975,1609.272949,2476.058594,0.001720,0.409450
2,604.720398,1.058137,2.617499,0.166323,0.000243,2.089409,0.036390,31360000.0,577.499573,21253.339844,20675.839844,0.008977,0.002832,0.246419,1.034724,3156.198975,1609.272949,2468.230713,0.001759,0.412110
3,604.720398,1.058137,2.656605,0.157891,0.000245,2.102927,0.035792,31360000.0,577.499573,21253.339844,20675.839844,0.009041,0.002950,0.248115,1.033336,3156.198975,1609.272949,2461.224854,0.001787,0.412336
4,604.720398,1.058137,2.683506,0.151085,0.000247,2.106670,0.035219,70560000.0,577.499573,21253.339844,20675.839844,0.009063,0.003056,0.247676,1.024991,3156.198975,1609.272949,2455.364258,0.001802,0.409315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,233.265198,1.563237,7.119512,14.164972,0.009249,3.292426,0.204281,47040000.0,1210.904053,21253.339844,20042.435547,0.012460,0.023017,0.697362,1.854948,5743.555176,4905.366211,5217.223633,0.001781,0.656486
230,233.265198,1.563237,6.451401,13.833088,0.009249,3.027145,0.195926,47040000.0,1210.904053,21253.339844,20042.435547,0.011494,0.020605,0.613873,1.635766,5743.555176,4905.366211,5192.318848,0.001679,0.578922
231,233.265198,1.563237,5.802324,13.519963,0.009249,2.747786,0.192676,47040000.0,1403.136963,21253.339844,19850.203125,0.010495,0.018324,0.545470,1.420383,5743.555176,4905.366211,5172.251953,0.001563,0.505073
232,233.265198,1.563237,5.195273,13.173322,0.009249,2.478987,0.193946,47040000.0,1403.136963,21253.339844,19850.203125,0.009544,0.016275,0.464979,1.213811,5743.555176,4905.366211,5175.997070,0.001439,0.433654


### Normalisierung

In [109]:
print("Start Normalization")

scaler = StandardScaler()
X_train_flat = X_train.reshape(-1, X_train.shape[-1])  # (4500, 12)
X_train_scaled = scaler.fit_transform(X_train_flat).reshape(X_train.shape)

X_val_flat = X_val.reshape(-1, X_val.shape[-1])
X_val_scaled = scaler.transform(X_val_flat).reshape(X_val.shape)

X_test_flat = X_test.reshape(-1, X_test.shape[-1])
X_test_scaled = scaler.transform(X_test_flat).reshape(X_test.shape)

print("TS alle Splits skaliert")

print(f"Final shapes:")
print(f"  Train: TS={X_train_scaled.shape}")
print(f"  Val:   TS={X_val_scaled.shape}")
print(f"  Test:  TS={X_test_scaled.shape}")


Start Normalization
TS alle Splits skaliert
Final shapes:
  Train: TS=(23528, 60, 20)
  Val:   TS=(4580, 60, 20)
  Test:  TS=(3646, 60, 20)


## Hyperparameter Tuning

[Discuss any hyperparameter tuning methods you've applied, such as Grid Search or Random Search, and the rationale behind them.]


In [83]:
# Implement hyperparameter tuning
# Example using GridSearchCV with a DecisionTreeClassifier
# param_grid = {'max_depth': [2, 4, 6, 8]}
# grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
# grid_search.fit(X_train, y_train)


## Implementation

[Implement the final model(s) you've selected based on the above steps.]


## Evaluation Metrics

[Clearly specify which metrics you'll use to evaluate the model performance, and why you've chosen these metrics.]


## Comparative Analysis

[Compare the performance of your model(s) against the baseline model. Discuss any improvements or setbacks and the reasons behind them.]


In [35]:
# Comparative Analysis code (if applicable)
# Example: comparing accuracy of the baseline model and the new model
# print(f"Baseline Model Accuracy: {baseline_accuracy}, New Model Accuracy: {new_model_accuracy}")
