# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


In [40]:
# Import necessary libraries
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from huggingface_hub import login, list_repo_files, hf_hub_download, list_repo_tree, upload_file


In [None]:

token = ""
REPO_ID = "mttfst/Paulette_Cloud_Tracks"

login(token)

In [32]:

files = list_repo_files(REPO_ID, repo_type="dataset")

csv_files = [f for f in files if f.startswith("exp_1.1/") and f.endswith(".csv")]


MIN_BYTES = 700000  # ca ein stunde
MIN_BYTES = 1500000  # ca  2 h

tree = list_repo_tree(REPO_ID, repo_type="dataset", path_in_repo="exp_1.1", recursive=True)

# dict: filename -> size
size_map = {item.path: item.size for item in tree if item.path.endswith(".csv")}

# dann filtern:
kept = [f for f in csv_files if size_map.get(f, 0) >= MIN_BYTES]
dropped = [f for f in csv_files if size_map.get(f, 0) < MIN_BYTES]
print("Kept:", len(kept), "Dropped:", len(dropped))

Kept: 371 Dropped: 8856


In [33]:
# 1) Alle Files im Repo holen
files = list_repo_files(REPO_ID, repo_type="dataset")

# 2) Nur die Track-CSV-Dateien für exp_1.1 auswählen
#csv_files = [f for f in files if f.startswith("exp_1.1/") and f.endswith(".csv")]
#print("Total CSV tracks:", len(csv_files))
csv_files=kept

# 3) Reproduzierbar mischen
random.seed(42)        # fixer Seed, damit der Split immer gleich ist
csv_files_shuffled = csv_files.copy()
random.shuffle(csv_files_shuffled)

# 4) 70/15/15 Split auf Track-Ebene
n = len(csv_files_shuffled)
n_train = int(0.7 * n)
n_val   = int(0.15 * n)
# Rest geht in Test
n_test  = n - n_train - n_val

train_files = csv_files_shuffled[:n_train]
val_files   = csv_files_shuffled[n_train:n_train + n_val]
test_files  = csv_files_shuffled[n_train + n_val:]

print(f"Train tracks: {len(train_files)}")
print(f"Val tracks:   {len(val_files)}")
print(f"Test tracks:  {len(test_files)}")

# Optional: in einem Dict sammeln, damit es übersichtlich bleibt
split_files = {
    "train": train_files,
    "val": val_files,
    "test": test_files,
}


Train tracks: 259
Val tracks:   55
Test tracks:  57


In [34]:
def add_stats_features_placeholder(
    df: pd.DataFrame,
    timestep_seconds: float = 30.0,
) -> pd.DataFrame:
    """
    Placeholder for Task A.2 feature engineering.
    Here we will later add:
    - running means / max over past N timesteps
    - growth rates (e.g. d(area)/dt)
    - integrated column values, etc.

    For now, this function just returns df unchanged.
    """
    # Example of something we *could* already add (optional, kannst du auch rauslassen):
    df["track_length_s"] = df["age_s"].iloc[-1] + timestep_seconds
    df["age_frac"] = df["age_s"] / df["track_length_s"]

    variables = [
        "area_m2",
        "rain_gsp_rate_L00",
        "lwp_L00",
        "iwp_L00",
        "tqc_L00",
        "tqi_L00",
    ]

    for var in variables:
        if var not in df.columns:
            continue  # robust gegen fehlende Spalten

        df[f"{var}_mean_10"] = (
            df[var]
            .rolling(window=10, min_periods=1)
            .mean()
        )

        df[f"{var}_max_10"] = (
            df[var]
            .rolling(window=10, min_periods=1)
            .max()
        )

    return df


In [35]:
def preprocess_track(
    df: pd.DataFrame,
    timestep_seconds: float = 30.0,
) -> pd.DataFrame:
    """
    Preprocess a single cloud track:
    - ensure correct ordering
    - create local frame index (0..T-1)
    - compute remaining_lifetime_s per timestep
    - optionally drop useless columns (e.g. global time stamp)
    - hook for later stats features (Task A.2)
    """
    df = df.copy()
    
    # Safety: ensure sorted by frame (global frame currently)
    if "frame" in df.columns:
        df = df.sort_values("frame").reset_index(drop=True)
    
    T = len(df)
    
    # 1) Preserve original global frame (for debugging if needed)
    if "frame" in df.columns:
        df["frame_global"] = df["frame"]
    
    # 2) Create local frame index: 0, 1, ..., T-1
    df["frame"] = np.arange(T, dtype=int)
    
    # 3) Age of the cloud at each timestep (could be useful feature)
    df["age_s"] = df["frame"] * timestep_seconds
    
    # 4) Remaining lifetime from each timestep
    #    Last timestep (frame = T-1) → 0 seconds remaining
    df["remaining_lifetime_s"] = (T - 1 - df["frame"]) * timestep_seconds
    
    # 5) Drop irrelevant columns (start minimal; wirf 'time' raus)
    cols_to_drop = ["frame",]
    if "time" in df.columns:
        cols_to_drop.append("time")
    
    # falls du noch andere Spalten immer loswerden willst:
    for c in ["feature", "feature_orig", "cell", "latitude",  "longitude"]:
        if c in df.columns:
            cols_to_drop.append(c)
    
    if cols_to_drop:
        df = df.drop(columns=cols_to_drop)
    
    # 6) Placeholder: add stats-based features for Task A.2 (snapshot + stats)
    df = add_stats_features_placeholder(df, timestep_seconds=timestep_seconds)
    
    return df


In [36]:

def load_track(csv_path_in_repo: str) -> pd.DataFrame:
    """
    Lädt einen einzelnen Track (eine CSV-Datei) aus dem HF-Dataset
    und gibt ein nach 'frame' sortiertes pandas DataFrame zurück.
    
    csv_path_in_repo: z.B. "exp_1.1/track_000001.csv"
    """
    # 1) Datei von HF runterladen (wird lokal gecached)
    local_path = hf_hub_download(
        repo_id=REPO_ID,
        repo_type="dataset",
        filename=csv_path_in_repo,
    )
    
    # 2) CSV in DataFrame laden
    df = pd.read_csv(local_path)
    
    # 3) Nach 'frame' sortieren (oder 'time', wenn du lieber Zeitstempel nutzt)
    if "frame" in df.columns:
        df = df.sort_values("frame").reset_index(drop=True)
    elif "time" in df.columns:
        df = df.sort_values("time").reset_index(drop=True)
    else:
        raise ValueError("Neither 'frame' nor 'time' column found in track CSV.")
    
    df = preprocess_track(df)

    return df

# Test: ersten Train-Track laden
example_track_file = train_files[0]
print("Example track file:", example_track_file)

track_df = load_track(example_track_file)
print("Track shape:", track_df.shape)
print("Columns:", track_df.columns[:10])  # nur die ersten paar Spalten
track_df.head()



Example track file: exp_1.1/cell_08094.csv
Track shape: (238, 426)
Columns: Index(['qv_L00', 'qv_L01', 'qv_L02', 'qv_L03', 'qv_L04', 'qv_L05', 'qv_L06',
       'qv_L07', 'qv_L08', 'qv_L09'],
      dtype='object')


Unnamed: 0,qv_L00,qv_L01,qv_L02,qv_L03,qv_L04,qv_L05,qv_L06,qv_L07,qv_L08,qv_L09,...,rain_gsp_rate_L00_mean_10,rain_gsp_rate_L00_max_10,lwp_L00_mean_10,lwp_L00_max_10,iwp_L00_mean_10,iwp_L00_max_10,tqc_L00_mean_10,tqc_L00_max_10,tqi_L00_mean_10,tqi_L00_max_10
0,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,3e-06,3e-06,3e-06,7e-06,...,0.001203,0.001203,4.824359,4.824359,0.149095,0.149095,4.361375,4.361375,0.054448,0.054448
1,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,3e-06,3e-06,3e-06,7e-06,...,0.00172,0.002236,5.091748,5.359137,0.149235,0.149374,4.453624,4.545874,0.053976,0.054448
2,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,3e-06,3e-06,3e-06,7e-06,...,0.002316,0.003509,5.363396,5.906694,0.149298,0.149425,4.519607,4.651571,0.053483,0.054448
3,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,3e-06,3e-06,3e-06,7e-06,...,0.002973,0.004942,5.636679,6.456526,0.149469,0.14998,4.556423,4.666873,0.053089,0.054448
4,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,3e-06,3e-06,3e-06,7e-06,...,0.003683,0.006526,5.906835,6.987462,0.149733,0.15079,4.563119,4.666873,0.052831,0.054448


In [17]:
def is_track_long_enough(df, cutoff_steps: int = 5) -> bool:
    """
    Returns True if the track has enough timesteps to be used for Task A
    (remaining lifetime prediction) with the given cutoff at the end.
    """
    T = len(df)
    # we need at least one valid t in [0, T-1-cutoff_steps]
    return T > cutoff_steps

In [18]:
example_track_file = train_files[0]
print("Example track file:", example_track_file)

track_df = load_track(example_track_file)

print("Track shape:", track_df.shape)
print(track_df[["frame", "frame_global", "age_s", "remaining_lifetime_s"]].head())
print(track_df[["frame", "age_s", "remaining_lifetime_s"]].tail())


Example track file: exp_1.1/cell_08094.csv
Track shape: (238, 427)
   frame  frame_global  age_s  remaining_lifetime_s
0      0          4975    0.0                7110.0
1      1          4976   30.0                7080.0
2      2          4977   60.0                7050.0
3      3          4978   90.0                7020.0
4      4          4979  120.0                6990.0
     frame   age_s  remaining_lifetime_s
233    233  6990.0                 120.0
234    234  7020.0                  90.0
235    235  7050.0                  60.0
236    236  7080.0                  30.0
237    237  7110.0                   0.0


In [19]:
track_df

Unnamed: 0,frame,qv_L00,qv_L01,qv_L02,qv_L03,qv_L04,qv_L05,qv_L06,qv_L07,qv_L08,...,rain_gsp_rate_L00_mean_10,rain_gsp_rate_L00_max_10,lwp_L00_mean_10,lwp_L00_max_10,iwp_L00_mean_10,iwp_L00_max_10,tqc_L00_mean_10,tqc_L00_max_10,tqi_L00_mean_10,tqi_L00_max_10
0,0,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000003,0.000003,...,0.001203,0.001203,4.824359,4.824359,0.149095,0.149095,4.361375,4.361375,0.054448,0.054448
1,1,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000003,0.000003,...,0.001720,0.002236,5.091748,5.359137,0.149235,0.149374,4.453624,4.545874,0.053976,0.054448
2,2,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000003,0.000003,...,0.002316,0.003509,5.363396,5.906694,0.149298,0.149425,4.519607,4.651571,0.053483,0.054448
3,3,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000003,0.000003,...,0.002973,0.004942,5.636679,6.456526,0.149469,0.149980,4.556423,4.666873,0.053089,0.054448
4,4,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000003,0.000003,...,0.003683,0.006526,5.906835,6.987462,0.149733,0.150790,4.563119,4.666873,0.052831,0.054448
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233,233,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000004,0.000004,...,0.018179,0.018942,10.826949,12.529528,15.464530,16.487701,3.675286,4.159562,0.398244,0.443092
234,234,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000004,0.000004,...,0.017415,0.018942,10.307621,12.390125,15.251093,16.206348,3.546093,4.097106,0.393209,0.431623
235,235,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000004,0.000004,...,0.016651,0.018942,9.728088,12.149268,15.039826,15.964456,3.402563,4.006101,0.389926,0.419364
236,236,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000003,0.000004,0.000004,...,0.015887,0.018942,9.100243,11.815310,14.831469,15.745263,3.247639,3.890110,0.387906,0.403839


### Save config settings

In [None]:

import json
from datetime import datetime

def save_json_with_timestamp(model_name, params_dict):
    """
    Saves a JSON file with a timestamp in the filename.
    
    Args:
        model_name: Name of the model
        params_dict: Dictionary containing hyperparameters and performance statistics
    
    Returns:
        str: Path to the saved file
    """
    # Current date and time in format: YYYY-MM-DD_HH-MM-SS
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    
    # Compose filename
    filename = f"{model_name}_{timestamp}.json"
    
    # Use the provided dictionary directly
    data = params_dict
    
    # Save JSON file
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4, ensure_ascii=False)
    
    print(f"Hyperparameters and statistics saved: {filename}")
    return filename




Hyperparameters and statistics saved: MyModel_2026-01-15_19-25-59.json


In [None]:
## Test run
params = {
        "Param1": "Value1",
        "Param2": 42,
        "Param3": [1, 2, 3]
    }
    
config_file = save_json_with_timestamp(
        model_name="MyModel",
        params_dict=params
    )


Hyperparameters and statistics saved: MyModel_2026-01-15_19-32-45.json


### Upload config to Huggingface 

In [44]:
upload_file(
    path_or_fileobj=config_file,
    path_in_repo=f"configs/{config_file}",   # ordentliche Struktur im Repo
    repo_id=REPO_ID,
    repo_type="dataset",
    commit_message=f"Add config {config_file}",
)

CommitInfo(commit_url='https://huggingface.co/datasets/mttfst/Paulette_Cloud_Tracks/commit/ab5d8a92212fc1598146b4359225de8c04259937', commit_message='Add config MyModel_2026-01-15_19-32-45.json', commit_description='', oid='ab5d8a92212fc1598146b4359225de8c04259937', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/mttfst/Paulette_Cloud_Tracks', endpoint='https://huggingface.co', repo_type='dataset', repo_id='mttfst/Paulette_Cloud_Tracks'), pr_revision=None, pr_num=None)

## Model Selection

[Discuss the type(s) of models you consider for this task, and justify the selection.]



## Feature Engineering

[Describe any additional feature engineering you've performed beyond what was done for the baseline model.]


In [None]:
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('your_dataset.csv')

# Perform any feature engineering steps
# Example: df['new_feature'] = df['feature1'] + df['feature2']

# Feature and target variable selection
X = df[['your', 'selected', 'features']]
y = df['target_variable']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Hyperparameter Tuning

[Discuss any hyperparameter tuning methods you've applied, such as Grid Search or Random Search, and the rationale behind them.]


In [None]:
# Implement hyperparameter tuning
# Example using GridSearchCV with a DecisionTreeClassifier
# param_grid = {'max_depth': [2, 4, 6, 8]}
# grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
# grid_search.fit(X_train, y_train)


## Implementation

[Implement the final model(s) you've selected based on the above steps.]


In [None]:
# Implement the final model(s)
# Example: model = YourChosenModel(best_hyperparameters)
# model.fit(X_train, y_train)


## Evaluation Metrics

[Clearly specify which metrics you'll use to evaluate the model performance, and why you've chosen these metrics.]


In [None]:
# Evaluate the model using your chosen metrics
# Example for classification
# y_pred = model.predict(X_test)
# print(classification_report(y_test, y_pred))

# Example for regression
# mse = mean_squared_error(y_test, y_pred)

# Your evaluation code here


## Comparative Analysis

[Compare the performance of your model(s) against the baseline model. Discuss any improvements or setbacks and the reasons behind them.]


In [None]:
# Comparative Analysis code (if applicable)
# Example: comparing accuracy of the baseline model and the new model
# print(f"Baseline Model Accuracy: {baseline_accuracy}, New Model Accuracy: {new_model_accuracy}")
