Dieses Notebook dient dazu, die METAR-Daten von OpenSky zu cleanen und das gecleante Dataset als CSV abzuspeichern. Wichtig: Dataset darf vorher NICHT verändert werden. Es müssen die Rohdaten als CSV importiert werden.

In einem ersten Schritt werden alle Anpassungen als Funktion definiert. Die einzelnen Comments beschreiben jeweils, was Data Wrangler macht.

In [20]:
import pandas as pd

def clean_data(df):
    
    # Derive column 'wind_dir_cleaned' from column: 'wind_dir'
    # Transform based on the following examples:
    # wind_dir         Output
    # 1: "170 degrees" => "170"
    df.insert(8, "wind_dir_cleaned", df["wind_dir"].str.split(" ").str[0])


    #If Wind is variable -> NaN in the dataset. This new column therefore states if wind is variable (=1) or not (=0)
    df["is_wind_variable"] = df["wind_dir_cleaned"].isna().astype(int)


    # Drop rows with missing data in column: 'wind_speed'
    df = df.dropna(subset=['wind_speed'])


    # Derive column 'wind_speed_cleaned' from column: 'wind_speed'
    # Transform based on the following examples:
    #    wind_speed    Output
    # 1: "3 knots"  => "3"
    df.insert(10, "wind_speed_cleaned", df["wind_speed"].str.split(" ").str[0])


    # Drop rows with missing data in column: 'vis'
    df = df.dropna(subset=['vis'])


    # Derive column 'vis_cleaned' from column: 'vis'
    def vis_cleaned(vis):
        """
        Transform based on the following examples:
           vis                            Output
        1: "5000 meters"               => "5000"
        2: "greater than 10000 meters" => "99999" -> We use 99999 to indicate "greater than 10000"
        """
        if len(vis) - len(vis.replace(" ", "")) == 1:
            return vis.split(" ")[0]
        if len(vis) - len(vis.replace(" ", "")) == 3:
            return "99999"
        return None
    df.insert(15, "vis_cleaned", df.apply(lambda row : vis_cleaned(row["vis"]), axis=1))


    # Derive column 'temp_cleaned' from column: 'temp'
    # Transform based on the following examples:
    #    temp        Output
    # 1: "10.0 C" => "10"
    df.insert(20, "temp_cleaned", df["temp"].str.split(".").str[0])


    # Derive column 'dewpt_cleaned' from column: 'dewpt'
    # Transform based on the following examples:
    #    dewpt      Output
    # 1: "6.0 C" => "6"
    df.insert(22, "dewpt_cleaned", df["dewpt"].str.split(".").str[0])


    # Derive column 'press_cleaned' from column: 'press'
    # Transform based on the following examples:
    #    press          Output
    # 1: "1023.0 mb" => "1023"
    df.insert(24, "press_cleaned", df["press"].str.split(".").str[0])


    # Change column type to int64 for column: 'is_wind_variable'
    df = df.astype({'is_wind_variable': 'int64'})


    # Change column type to float64 for columns: 'wind_dir_cleaned', 'wind_speed_cleaned' , 'vis_cleaned' , 'temp_cleaned' , 'dewpt_cleaned' , 'press_cleaned'
    df = df.astype({'wind_dir_cleaned': 'float64', 'wind_speed_cleaned': 'float64', 'vis_cleaned': 'float64', 'temp_cleaned': 'float64', 'dewpt_cleaned': 'float64', 'press_cleaned': 'float64'})
    return df



Jetzt geht es noch darum zu definieren 1) von welchem Pfad die Daten genommen werden und 2) wohin sie anschliessend als CSV abgespeichert werden sollen.

In [21]:
#Path where raw data is stored
df = pd.read_csv('../data/raw/metar_lszh_2023-2025.csv')

df_clean = clean_data(df.copy())

#Output path where cleaned data should be stored
output_path = "../data/processed/clean_metar_lszh_2023-2025.csv"

#Save cleaned dataframe to CSV
df_clean.to_csv(output_path, index=False)

print(f" Cleaned file saved to: {output_path}")

 Cleaned file saved to: ../data/processed/clean_metar_lszh_2023-2025.csv
