<a href="https://colab.research.google.com/github/sudo-Oliver/Predictive-Analytics-Private/blob/main/notebooks/LSTM%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Daten laden und vorbereiten**
1. Laden der Daten in einen Dataframe
2. Zeitspalte umwandeln (Unix-Timestamp -> Datetime)
3. nach homeid gruppieren (jeder Haushalt hat seine eigene Zeitreihe)
4. Sortieren nach Zeit innerhalb des Haushalts

In [5]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
import gdown
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [8]:
def load_processed_data():
    """Load preprocessed sensor data with fallback to Drive download"""
    # Update file ID from new link
    file_id = "1KHQCVfwTxm5bjjITS8WMm9P3M12ETVsR"

    # Create download path
    download_path = Path('data/processed')
    download_path.mkdir(parents=True, exist_ok=True)
    file_path = download_path / 'final_processed_data3.parquet'

    # Download if not exists
    if not file_path.exists():
        print("Downloading from Google Drive...")
        url = f"https://drive.google.com/uc?id={file_id}"
        gdown.download(url, str(file_path), quiet=False)

    # Load and verify data
    if file_path.exists():
        df = pd.read_parquet(file_path)
        print(f"Data loaded successfully: {df.shape} rows")
        return df
    else:
        raise FileNotFoundError("Could not load or download data file")

# Usage remains same
df = load_processed_data()
df_clean = clean_data(df.copy())


Downloading from Google Drive...


Downloading...
From: https://drive.google.com/uc?id=1KHQCVfwTxm5bjjITS8WMm9P3M12ETVsR
To: /content/data/processed/final_processed_data3.parquet
100%|██████████| 104M/104M [00:01<00:00, 84.7MB/s] 


Data loaded successfully: (1641653, 23) rows


In [10]:
def clean_data(df):
    """Clean and preprocess sensor data"""
    # Convert Unix timestamp to datetime
    df['timestamp_local'] = pd.to_datetime(df['timestamp_local'], unit='ms')

    # Set timestamp_local as index
    df.set_index('timestamp_local', inplace=True)

    # Sort by homeid and timestamp_local
    df = df.sort_values(by=['homeid', 'timestamp_local'])

    # Remove specified columns
    columns_to_drop = [
        'sensorid', 'median_temperature', '_room',
        'sensorid_room', 'measured_entity',
        'sensorid_electric', 'sensorid_gas'
    ]
    df = df.drop(columns=columns_to_drop)

    return df

# Load and clean data
df = load_processed_data()
df_clean = clean_data(df.copy())

df_clean.head()

Data loaded successfully: (1641653, 23) rows


Unnamed: 0_level_0,homeid,electric_min_consumption,electric_max_consumption,std_consumption,electric_median_consumption,electric_total_consumption_Wh,gas_mean_consumption,gas_min_consumption,gas_max_consumption,gas_median_consumption,gas_total_consumption_Wh,median_value,roomid,income_band_mid,education_map
timestamp_local,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2016-09-20 09:00:00,47,0.069,0.335,0.033905,0.194,0.179807,0.112,0.112,0.112,0.112,0.224,20.72,652.0,0.0,8.0
2016-09-20 10:00:00,47,0.068875,0.458375,0.035875,0.187625,0.17669,0.112,0.112,0.112,0.112,0.21,20.695,652.0,0.0,8.0
2016-09-20 11:00:00,47,0.06875,0.58175,0.037846,0.18125,0.173574,0.112,0.112,0.112,0.112,0.196,20.67,652.0,0.0,8.0
2016-09-20 12:00:00,47,0.068625,0.705125,0.039817,0.174875,0.170457,0.112,0.112,0.112,0.112,0.182,20.645,652.0,0.0,8.0
2016-09-20 13:00:00,47,0.0685,0.8285,0.041788,0.1685,0.16734,0.112,0.112,0.112,0.112,0.168,20.62,652.0,0.0,8.0


**2. Feature Engineering & Datenbereinigung**
1. Zyklische Transformation für Zeitdaten (hour_sin, hour_cos für Stunden)
2. Lag-Features erstellen (für vorherige Strom und Gaswerte)
3. Rolling-Average-Features (z.B gleitender Mittelwert über 3 oder 7 Zeitschritte)
4. Daten normalisieren (Min-Max-Scaling für LSTM)

In [11]:
# Vollständige Korrelation mit allen spalten berechnen
correlation_matrix_all = df_clean.corr()

# Korrelation der Features mit den Zielvariablen (Strom und Gasverbtauch)
correlation_target_all = correlation_matrix_all[['electric_total_consumption_Wh', 'gas_total_consumption_Wh']]

# Sortieren nach Stärke der Korrelation
correlation_target_all_sorted = correlation_target_all.abs().sort_values(by=['electric_total_consumption_Wh', 'gas_total_consumption_Wh'], ascending=False)

# Korrelationsergebnisse anzeigen
print("Full Feature Correlation:")
display(correlation_target_all_sorted)

Full Feature Correlation:


Unnamed: 0,electric_total_consumption_Wh,gas_total_consumption_Wh
electric_total_consumption_Wh,1.0,0.032214
electric_median_consumption,0.803903,0.025043
std_consumption,0.775032,0.024055
electric_max_consumption,0.693017,0.042803
electric_min_consumption,0.516674,0.039677
income_band_mid,0.154421,0.034544
median_value,0.066262,0.003073
education_map,0.054028,0.012451
gas_total_consumption_Wh,0.032214,1.0
gas_max_consumption,0.028237,0.999275


In [12]:
# Extract hour from timestamp index
df_clean['hour'] = df_clean.index.hour

# Create cyclical features
df_clean['hour_sin'] = np.sin(2 * np.pi * df_clean['hour']/24)
df_clean['hour_cos'] = np.cos(2 * np.pi * df_clean['hour']/24)

# Create lag features for electric consumption (t-1, t-2, t-3)
for lag in range(1, 4):
    df_clean[f'electric_lag_{lag}'] = df_clean.groupby('homeid')['electric_total_consumption_Wh'].shift(lag)
    # Create lag features for gas consumption (t-1, t-2, t-3)
    df_clean[f'gas_lag_{lag}'] = df_clean.groupby('homeid')['gas_total_consumption_Wh'].shift(lag)

# Create rolling means for electric consumption (3 and 7 time steps)
df_clean['electric_rolling_mean_3h'] = df_clean.groupby('homeid')['electric_total_consumption_Wh'].rolling(window=3).mean().reset_index(0, drop=True)
df_clean['electric_rolling_mean_7h'] = df_clean.groupby('homeid')['electric_total_consumption_Wh'].rolling(window=7).mean().reset_index(0, drop=True)

# Create rolling means for gas consumption (3 and 7 time steps)
df_clean['gas_rolling_mean_3h'] = df_clean.groupby('homeid')['gas_total_consumption_Wh'].rolling(window=3).mean().reset_index(0, drop=True)
df_clean['gas_rolling_mean_7h'] = df_clean.groupby('homeid')['gas_total_consumption_Wh'].rolling(window=7).mean().reset_index(0, drop=True)

# Replace deprecated fillna methods with new syntax
df_clean = df_clean.ffill()  # Forward fill
df_clean = df_clean.bfill()  # Backward fill

# Define features to scale
scaled_features = ['electric_total_consumption_Wh', 'gas_total_consumption_Wh', 'electric_median_consumption', 'electric_max_consumption', 'electric_min_consumption', 'std_consumption', 'gas_max_consumption', 'gas_min_consumption', 'gas_median_consumption', 'median_value', 'hour_sin', 'hour_cos', 'electric_lag_1', 'electric_lag_2', 'electric_lag_3', 'gas_lag_1', 'gas_lag_2', 'gas_lag_3', 'electric_rolling_mean_3h', 'electric_rolling_mean_7h', 'gas_rolling_mean_3h', 'gas_rolling_mean_7h']

# Initialize scaler
scaler = MinMaxScaler()

# Fit and transform the selected features
df_clean[scaled_features] = scaler.fit_transform(df_clean[scaled_features])

#df_clean.to_parquet('lstm_preprocessed_data.parquet')

**3. Trainings und Testdatensätze erstellen**
1. Daten für jeden Haushalt in eine geeignetes Format bringen
2. Train-Test-Split: 80% Training 20% Test
3. Zeitfenster für LSTM definieren (z.B 24 Stunden zurückblicken um die nächste Stunde vorherzusagen)

In [None]:
time_steps = 90

# Features and target definition (Strom und Gas)
feature_columns = [col for col in df_clean.columns if col not in ['electric_total_consumption_Wh', 'gas_total_consumption_Wh', 'homeid', 'roomid']]
target_column_electric = 'electric_total_consumption_Wh'
target_column_gas = 'gas_total_consumption_Wh'

# Create LSTM Time Series Data
def create_lstm_sequences(data, target_column, time_steps):
    X, y = [], []
    for i in range(len(data) - time_steps):
        X.append(data.iloc[i-i + time_steps][feature_columns].values)
        y.append(data.iloc[i + time_steps][target_column])
    return np.array(X), np.array(y)

# Prepare data for electric consumption prediction
X_electric, y_electric = create_lstm_sequences(df_clean, target_column_electric, time_steps)
X_train_electric, X_test_electric, y_train_electric, y_test_electric = train_test_split(X_electric, y_electric, test_size=0.2, shuffle=False)

# Prepare data for gas consumption prediction
X_gas, y_gas = create_lstm_sequences(df_clean, target_column_gas, time_steps)
X_train_gas, X_test_gas, y_train_gas, y_test_gas = train_test_split(X_gas, y_gas, test_size=0.2, shuffle=False)

# Show shapes of the data
train_test_summary = {
    'X_train_electric': X_train_electric.shape,
    'X_test_electric': X_test_electric.shape,
    'X_train_gas': X_train_gas.shape,
    'X_test_gas': X_test_gas.shape,
}
train_test_summary





**4. LSTM Modell erstellen**
1. Daten in das LSTM Format bringen (X_train, y_train)
2. LSTM schichten definieren (Tensorflow)
3. Modell kompilieren und trainieren
4. Hyperparameter-Tuning (z.B Anzahl Neuronen, Learning Rate,...)

**5. Modell evaluieren & Vorhersagen interpretieren**
1. Vorhersagen auf Testdaten durchführen
2. Metriken berechnen (RMSE, MAE, R^2)
3. XAI mit SHAP oder LIME anwenden