<a href="https://colab.research.google.com/github/rjanow/Masterarbeit/blob/main/0_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting UVI with LSTMs

[Notebook 0: Data Cleaning](./0_DataCleaning.ipynb)

[Notebook 1: EDA and Cleaning](./1_EDA and Cleaning.ipynb)

[Notebook 2: Modeling and Predictions](./2_Modeling and Predictions.ipynb)

[Notebook 3: Technical Report](./3_Technical_Report.ipynb)

## Allgemeine Einstellungen:

In [4]:
pip install pvlib



In deiesem Notebook werden die aufgezeichenten UVI-Messungen weiter verarbeitet und für das Training vorbereitet.


- Einlesen der UVI-Werte
- Ersetzen von fehlenden Messwerten

- Einlesen der weiteren Inputwerte
- EDA (exploratory data analysis)

In [5]:
# Verbinden mit der Google-Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# import der benötigten Module

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pvlib

from datetime import datetime
from datetime import timedelta

import matplotlib
import seaborn as sns

In [7]:
latitude = 50.8
longitude = 7.2

seconds_in_day = 24*60*60
seconds_in_year = (365.2425)*seconds_in_day

In [21]:
# Pfad zur CSV-Datei mit UVI-Messwerten auf Google Drive
drive_path = '/content/drive/My Drive/Colab_Notebooks/CSV_UVI/'
pickle_path = '/content/drive/My Drive/Colab_Notebooks/CSV_Vorhersage/'

## Import der UVI-Messdaten:

Die Messdaten sind in einer CSV-Datei gespeichert, diese muss importiert werden.

In [9]:
## Code zum Import der Messdaten
file_list = ['22.06', '22.07', '22.08', '22.10', '22.11', '22.12', '23.01', '23.02', '23.03', '23.04', '23.05']  # Hier wird angegeben, welche Monate importiert werden sollen
dataframes = []
df_UVI_combined = []

for filename in file_list:
    file_path = drive_path + filename
    df_import = pd.read_csv(file_path)
    dataframes.append(df_import)

df_UVI_combined = pd.concat(dataframes, ignore_index=True)
df_UVI_combined['Datetime'] = pd.to_datetime(df_UVI_combined['Datetime'])

In [10]:
# Dataframe ausgeben
df_UVI_combined

Unnamed: 0,Datetime,Datum,Uhrzeit,Messzeitpunkt,erythem,UVI
0,2022-06-15 07:21:00,2022-06-15,07:21:00,26460,0.060209,2.408378
1,2022-06-15 07:23:00,2022-06-15,07:23:00,26580,0.061560,2.462381
2,2022-06-15 07:25:00,2022-06-15,07:25:00,26700,0.061976,2.479048
3,2022-06-15 07:27:00,2022-06-15,07:27:00,26820,0.063588,2.543531
4,2022-06-15 07:29:00,2022-06-15,07:29:00,26940,0.064412,2.576485
...,...,...,...,...,...,...
100757,2023-05-26 09:04:00,2023-05-26,09:04:00,32640,0.117537,4.701465
100758,2023-05-26 09:06:00,2023-05-26,09:06:00,32760,0.118624,4.744953
100759,2023-05-26 09:08:00,2023-05-26,09:08:00,32880,0.094757,3.790279
100760,2023-05-26 09:10:00,2023-05-26,09:10:00,33000,,


## Bereinigen der Messdaten

Hier wird erklärt, was zum Bereinigen der Messdaten getan werden muss.

- Fehlende Messtage müssen ersetzt werden
  - Prüfen, ob die Messwerte zusammenhängen

**Hier wird geprüft, ob die Messwerte zusammenhängend sind:**

In [11]:
def insert_missing_rows(df):
    # Sortieren des DataFrame nach 'Datetime'
    df.sort_values(by='Datetime', inplace=True)

    # Initialisieren einer Liste, um die Zeilen mit fehlenden Daten einzufügen
    rows_to_insert = []

    # Gruppieren des DataFrame nach 'Datum'
    grouped = df.groupby('Datum')

    for date, group in grouped:
        # Sortieren der Gruppe nach 'Datetime'
        group.sort_values(by='Datetime', inplace=True)

        for i in range(1, len(group)):
            current_time = group.iloc[i]['Datetime']
            prev_time = group.iloc[i - 1]['Datetime']
            time_diff = current_time - prev_time

            if time_diff > timedelta(minutes=2):
                while prev_time + timedelta(minutes=2) < current_time:
                    prev_time += timedelta(minutes=2)
                    new_row = {
                        'Datetime': prev_time,
                        'Datum': date,
                        'Uhrzeit': prev_time.time(),
                        'Messzeitpunkt': (prev_time - prev_time.replace(hour=0, minute=0, second=0, microsecond=0)).total_seconds(),
                        'erythem': 0,
                        'UVI': 0,
                        'DiffGreater2': 1,
                    }
                    rows_to_insert.append(new_row)

    # Einfügen der fehlenden Zeilen in einen DataFrame
    if rows_to_insert:
        df = df.append(rows_to_insert, ignore_index=True)

    # Sortieren des DataFrame nach 'Datetime'
    df.sort_values(by='Datetime', inplace=True)
    df = df.reset_index(drop = True)
    df['DiffGreater2'] = df['DiffGreater2'].fillna(0)

    return df

In [12]:
df_UVI_WRows = pd.DataFrame()
df_UVI_WRows = insert_missing_rows(df_UVI_combined)
len(df_UVI_WRows)

  df = df.append(rows_to_insert, ignore_index=True)


101741

In [13]:
df_UVI_WRows

Unnamed: 0,Datetime,Datum,Uhrzeit,Messzeitpunkt,erythem,UVI,DiffGreater2
0,2022-06-15 07:21:00,2022-06-15,07:21:00,26460.0,0.060209,2.408378,0.0
1,2022-06-15 07:23:00,2022-06-15,07:23:00,26580.0,0.061560,2.462381,0.0
2,2022-06-15 07:25:00,2022-06-15,07:25:00,26700.0,0.061976,2.479048,0.0
3,2022-06-15 07:27:00,2022-06-15,07:27:00,26820.0,0.063588,2.543531,0.0
4,2022-06-15 07:29:00,2022-06-15,07:29:00,26940.0,0.064412,2.576485,0.0
...,...,...,...,...,...,...,...
101736,2023-05-26 09:04:00,2023-05-26,09:04:00,32640.0,0.117537,4.701465,0.0
101737,2023-05-26 09:06:00,2023-05-26,09:06:00,32760.0,0.118624,4.744953,0.0
101738,2023-05-26 09:08:00,2023-05-26,09:08:00,32880.0,0.094757,3.790279,0.0
101739,2023-05-26 09:10:00,2023-05-26,09:10:00,33000.0,,,0.0


## Sonnenstandswinkel hinzufügen

In [14]:
def calculate_solar_zenith_angle(dataframe, date_column, latitude, longitude, altitude=0):

    # Kopiere das ursprüngliche DataFrame, um es nicht zu ändern.
    result_df = dataframe.copy()

    # Konvertiere die Datumsspalte in einen datetime-Datentyp, falls sie es nicht bereits ist.
    if not pd.api.types.is_datetime64_any_dtype(dataframe[date_column]):
        result_df[date_column] = pd.to_datetime(dataframe[date_column])

    # Iteriere über die Zeilen des DataFrames und berechne den Solarzenitwinkel für jedes Datum.
    solar_zenith_angles = []
    for date in result_df[date_column]:
        solar_position = pvlib.solarposition.get_solarposition(date, latitude, longitude, altitude)
        solar_zenith_angle = solar_position['zenith'].values[0]
        solar_zenith_angles.append(solar_zenith_angle)

    # Füge die berechneten Solarzenitwinkel dem DataFrame hinzu.
    result_df['SolarZenithAngle'] = solar_zenith_angles

    return result_df

In [15]:
df_UVI_WRows_SZ = pd.DataFrame()
df_UVI_WRows_SZ = calculate_solar_zenith_angle(df_UVI_WRows, 'Datetime', latitude,
longitude)

In [16]:
df_UVI_WRows_SZ

Unnamed: 0,Datetime,Datum,Uhrzeit,Messzeitpunkt,erythem,UVI,DiffGreater2,SolarZenithAngle
0,2022-06-15 07:21:00,2022-06-15,07:21:00,26460.0,0.060209,2.408378,0.0,55.032236
1,2022-06-15 07:23:00,2022-06-15,07:23:00,26580.0,0.061560,2.462381,0.0,54.717711
2,2022-06-15 07:25:00,2022-06-15,07:25:00,26700.0,0.061976,2.479048,0.0,54.403414
3,2022-06-15 07:27:00,2022-06-15,07:27:00,26820.0,0.063588,2.543531,0.0,54.089361
4,2022-06-15 07:29:00,2022-06-15,07:29:00,26940.0,0.064412,2.576485,0.0,53.775570
...,...,...,...,...,...,...,...,...
101736,2023-05-26 09:04:00,2023-05-26,09:04:00,32640.0,0.117537,4.701465,0.0,40.909176
101737,2023-05-26 09:06:00,2023-05-26,09:06:00,32760.0,0.118624,4.744953,0.0,40.644895
101738,2023-05-26 09:08:00,2023-05-26,09:08:00,32880.0,0.094757,3.790279,0.0,40.382418
101739,2023-05-26 09:10:00,2023-05-26,09:10:00,33000.0,,,0.0,40.121787


## Zeit und Datum in Sin und Cos codieren

In [17]:
def calculate_date_in_sine_cosine(dataframe, day, year):

    result_df = dataframe.copy()

    result_df['time_sin'] = np.sin(2*np.pi*result_df['Messzeitpunkt']/day)
    result_df['time_cos'] = np.cos(2*np.pi*result_df['Messzeitpunkt']/day)
    result_df['date_sin'] = np.sin((2*np.pi*result_df['Datetime'].dt.dayofyear * 24 * 60 * 60 + result_df['Datetime'].dt.hour * 60 * 60 + result_df['Datetime'].dt.minute * 60) / year)
    result_df['date_cos'] = np.cos((2*np.pi*result_df['Datetime'].dt.dayofyear * 24 * 60 * 60 + result_df['Datetime'].dt.hour * 60 * 60 + result_df['Datetime'].dt.minute * 60) / year)

    return result_df

In [18]:
df_UVI_SZ_SC = calculate_date_in_sine_cosine(df_UVI_WRows_SZ, seconds_in_day, seconds_in_year)

## Import der Vorhersagedaten und löschen der nicht gebrauchten Einträge

In [22]:
dateiname = 'pickle_Cams_2M'
df_cams_2m = pd.read_pickle(pickle_path + dateiname)

In [23]:
df_cams_2m

Unnamed: 0_level_0,aod469,aod550,aod670,aod865,uvbed,uvbedcs,hcc,lcc,mcc,tcc,cbh
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1900-01-01 00:00:00,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310,6.825462e-310
1900-01-01 00:02:00,8.672378e-09,6.988984e-09,5.163832e-09,2.579854e-10,4.310954e-25,4.310954e-25,6.825462e-310,1.394555e-09,1.142189e-08,1.154514e-08,6.582474e-05
1900-01-01 00:04:00,1.734476e-08,1.397797e-08,1.032766e-08,5.159709e-10,8.621907e-25,8.621907e-25,6.825462e-310,2.789111e-09,2.284378e-08,2.309027e-08,1.316495e-04
1900-01-01 00:06:00,2.601713e-08,2.096695e-08,1.549150e-08,7.739563e-10,1.293286e-24,1.293286e-24,6.825461e-310,4.183666e-09,3.426567e-08,3.463541e-08,1.974742e-04
1900-01-01 00:08:00,3.468951e-08,2.795594e-08,2.065533e-08,1.031942e-09,1.724381e-24,1.724381e-24,6.825461e-310,5.578221e-09,4.568757e-08,4.618054e-08,2.632989e-04
...,...,...,...,...,...,...,...,...,...,...,...
2023-09-30 22:52:00,2.211528e-01,1.795828e-01,1.326945e-01,5.868408e-03,0.000000e+00,0.000000e+00,1.656550e-01,0.000000e+00,2.563594e-04,1.658716e-01,
2023-09-30 22:54:00,2.204093e-01,1.789766e-01,1.322410e-01,5.867872e-03,0.000000e+00,0.000000e+00,1.720263e-01,0.000000e+00,2.380480e-04,1.722232e-01,
2023-09-30 22:56:00,2.196659e-01,1.783705e-01,1.317874e-01,5.867335e-03,0.000000e+00,0.000000e+00,1.783977e-01,0.000000e+00,2.197366e-04,1.785747e-01,
2023-09-30 22:58:00,2.189224e-01,1.777643e-01,1.313339e-01,5.866799e-03,0.000000e+00,0.000000e+00,1.847690e-01,0.000000e+00,2.014252e-04,1.849262e-01,


## Abspeichern des DataFrames als Pickle

In [None]:
dateiname = 'pickle_Cams_2M'
df_cams_interpolated.to_pickle(pickle_path + dateiname)

# Erstes Plotten der Messdaten

In [None]:
# Funktion zum Plotten aller Messdaten
def plot_data_per_day(dataframe, date_column, value_column, x_column, dates, save_path):
    for date in dates:
        subset = dataframe[dataframe[date_column] == date]

        plt.figure(figsize=(10, 6))
        ax = sns.lineplot(data=subset, x=x_column, y=value_column)

        #interval = 2  # Intervall in Stunden
        #ax.xaxis.set_major_locator(mdates.HourLocator(interval=interval))

        plt.xticks(rotation=45)
        plt.title(f'Verlauf des UVI für den {date}')
        plt.xlabel('Uhrzeit (UTC)')
        plt.ylabel('UVI')
        plt.tight_layout()
        # plt.show()

        plot_filename = f'{date}.png'
        plot_filepath = save_path + plot_filename
        plt.savefig(plot_filepath)  # Plot speichern
        plt.close()  # Plot schließen, um Ressourcen freizugeben

In [None]:
# Funktion zum Erzeugen einer Liste mit Daten die geplottet werden sollen
def generate_dates_to_plot(start_date, end_date):
    date_range = []
    current_date = start_date

    while current_date <= end_date:
        date_range.append(current_date.strftime('%Y-%m-%d'))
        current_date += timedelta(days=1)

    return date_range

In [None]:
# Erzeugen einer Liste mit Daten
start_date = datetime(2022, 6, 15)
end_date = datetime(2022, 6, 15)

dates_to_plot = generate_dates_to_plot(start_date, end_date)

In [None]:
# Speicherort für die Plots der täglichen Messdaten
daily_plots_path = '/content/drive/My Drive/Colab_Notebooks/plot_daily_UVI/'

In [None]:
# Funktion zum Plotten der Messdaten aufrufen
plot_data_per_day(df_UVI_combined, 'Datum', 'UVI', 'Uhrzeit', dates_to_plot, daily_plots_path)