# 01 - Generación del dataset de entrenamiento  

El dataset final integra:  
- **Valores virtuales obtenidos de la simulación en Unreal Engine**, como la irradiancia final calibrada, las componentes solar geométrica y ambiental integrada, la integración de cada uno de los 3 canales por separado, la posición solar y diversos indicadores de oclusión y visibilidad del sol. Estos datos presentan una resolución temporal de 2 minutos.
- **Variables meteorológicas externas**, obtenidas a través del servicio Solcast. Incluyen información relevante como la nubosidad y el contenido de vapor de agua. Estos datos presentan una resolución temporal de 5 minutos. 
- **Medidas reales de irradiancia**, registradas mediante los piranómetros instalados en la universidad, con una resolución temporal de 5 segundos. 

Dado que las tres fuentes presentan resoluciones temporales distintas, se ha llevado a cabo un proceso de **alineado temporal**, mediante el cual las variables se han re-muestreado a una resolución común de **dos minutos**. 

## 1. Importación de librerías y configuración

In [19]:
import pandas as pd
import os

# Paths
DIR_SIM   = '../data_sim'
DIR_REAL  = '../data_real'
DIR_METEO = '../data_meteo'

DIR_OUTPUT = '../data'
OUTPUT_FILE = os.path.join(DIR_OUTPUT, 'dataset_master_tfg.csv')
SOLCAST_FILE = os.path.join(DIR_METEO, 'meteo_utc_2025.csv')

print("Librerías cargadas y listas.")

Librerías cargadas y listas.


## 2. Mapeo de archivos

In [20]:
FILES_CONFIG = [
    # (sim filename, real filename, real sensor, YYYY-MM-DD)

    # --- 03 april ---
    ('Sim_2025-04-03_ClaseA_2m.csv',    'rad_20250403.csv', 'P0',   '2025-04-03'),
    ('Sim_2025-04-03_Inclinado_2m.csv', 'rad_20250403.csv', 'Pinc', '2025-04-03'),
    ('Sim_2025-04-03_P1_2m.csv',        'rad_20250403.csv', 'P1',   '2025-04-03'),
    ('Sim_2025-04-03_P4_2m.csv',        'rad_20250403.csv', 'P4',   '2025-04-03'),

    # --- 05 april ---
    ('Sim_2025-04-05_ClaseA_2m.csv',    'rad_20250405.csv', 'P0',   '2025-04-05'),
    ('Sim_2025-04-05_Inclinado_2m.csv', 'rad_20250405.csv', 'Pinc', '2025-04-05'),
    ('Sim_2025-04-05_P1_2m.csv',        'rad_20250405.csv', 'P1',   '2025-04-05'),
    ('Sim_2025-04-05_P4_2m.csv',        'rad_20250405.csv', 'P4',   '2025-04-05'),

    # --- 07 april ---
    ('Sim_2025-04-07_ClaseA_2m.csv',    'rad_20250407.csv', 'P0',   '2025-04-07'),
    ('Sim_2025-04-07_Inclinado_2m.csv', 'rad_20250407.csv', 'Pinc', '2025-04-07'),
    ('Sim_2025-04-07_P1_2m.csv',        'rad_20250407.csv', 'P1',   '2025-04-07'),
    ('Sim_2025-04-07_P4_2m.csv',        'rad_20250407.csv', 'P4',   '2025-04-07'),

    # --- 08 april ---
    ('Sim_2025-04-08_ClaseA_2m.csv',    'rad_20250408.csv', 'P0',   '2025-04-08'),
    ('Sim_2025-04-08_Inclinado_2m.csv', 'rad_20250408.csv', 'Pinc', '2025-04-08'),
    ('Sim_2025-04-08_P1_2m.csv',        'rad_20250408.csv', 'P1',   '2025-04-08'),
    ('Sim_2025-04-08_P4_2m.csv',        'rad_20250408.csv', 'P4',   '2025-04-08'),

    # --- 10 april ---
    ('Sim_2025-04-10_ClaseA_2m.csv',    'rad_20250410.csv', 'P0',   '2025-04-10'),
    ('Sim_2025-04-10_Inclinado_2m.csv', 'rad_20250410.csv', 'Pinc', '2025-04-10'),
    ('Sim_2025-04-10_P1_2m.csv',        'rad_20250410.csv', 'P1',   '2025-04-10'),
    ('Sim_2025-04-10_P4_2m.csv',        'rad_20250410.csv', 'P4',   '2025-04-10'),

    # --- 11 april ---
    ('Sim_2025-04-11_ClaseA_2m.csv',    'rad_20250411.csv', 'P0',   '2025-04-11'),
    ('Sim_2025-04-11_Inclinado_2m.csv', 'rad_20250411.csv', 'Pinc', '2025-04-11'),
    ('Sim_2025-04-11_P1_2m.csv',        'rad_20250411.csv', 'P1',   '2025-04-11'),
    ('Sim_2025-04-11_P4_2m.csv',        'rad_20250411.csv', 'P4',   '2025-04-11'),

    # --- 17 april ---
    ('Sim_2025-04-17_ClaseA_2m.csv',    'rad_20250417.csv', 'P0',   '2025-04-17'),
    ('Sim_2025-04-17_Inclinado_2m.csv', 'rad_20250417.csv', 'Pinc', '2025-04-17'),
    ('Sim_2025-04-17_P1_2m.csv',        'rad_20250417.csv', 'P1',   '2025-04-17'),
    ('Sim_2025-04-17_P4_2m.csv',        'rad_20250417.csv', 'P4',   '2025-04-17'),

    # --- 20 april ---
    ('Sim_2025-04-20_ClaseA_2m.csv',    'rad_20250420.csv', 'P0',   '2025-04-20'),
    ('Sim_2025-04-20_Inclinado_2m.csv', 'rad_20250420.csv', 'Pinc', '2025-04-20'),
    ('Sim_2025-04-20_P1_2m.csv',        'rad_20250420.csv', 'P1',   '2025-04-20'),
    ('Sim_2025-04-20_P4_2m.csv',        'rad_20250420.csv', 'P4',   '2025-04-20'),

    # --- 23 april ---
    ('Sim_2025-04-23_ClaseA_2m.csv',    'rad_20250423.csv', 'P0',   '2025-04-23'),
    ('Sim_2025-04-23_Inclinado_2m.csv', 'rad_20250423.csv', 'Pinc', '2025-04-23'),
    ('Sim_2025-04-23_P1_2m.csv',        'rad_20250423.csv', 'P1',   '2025-04-23'),
    ('Sim_2025-04-23_P4_2m.csv',        'rad_20250423.csv', 'P4',   '2025-04-23'),

    # --- 24 april ---
    ('Sim_2025-04-24_ClaseA_2m.csv',    'rad_20250424.csv', 'P0',   '2025-04-24'),
    ('Sim_2025-04-24_Inclinado_2m.csv', 'rad_20250424.csv', 'Pinc', '2025-04-24'),
    ('Sim_2025-04-24_P1_2m.csv',        'rad_20250424.csv', 'P1',   '2025-04-24'),
    ('Sim_2025-04-24_P4_2m.csv',        'rad_20250424.csv', 'P4',   '2025-04-24'),

    # --- 26 april ---
    ('Sim_2025-04-26_ClaseA_2m.csv',    'rad_20250426.csv', 'P0',   '2025-04-26'),
    ('Sim_2025-04-26_Inclinado_2m.csv', 'rad_20250426.csv', 'Pinc', '2025-04-26'),
    ('Sim_2025-04-26_P1_2m.csv',        'rad_20250426.csv', 'P1',   '2025-04-26'),
    ('Sim_2025-04-26_P4_2m.csv',        'rad_20250426.csv', 'P4',   '2025-04-26'),

    # --- 07 october ---
    ('Sim_2025-10-07_ClaseA_2m.csv',    'rad_20251007.csv', 'P0',   '2025-10-07'),
    ('Sim_2025-10-07_Inclinado_2m.csv', 'rad_20251007.csv', 'Pinc', '2025-10-07'),
    ('Sim_2025-10-07_P1_2m.csv',        'rad_20251007.csv', 'P1',   '2025-10-07'),
    ('Sim_2025-10-07_P4_2m.csv',        'rad_20251007.csv', 'P4',   '2025-10-07'),
]

print("Mapeo de archivos realizado.")

Mapeo de archivos realizado.


## 3. Carga y preprocesado de datos meteorológicos

In [21]:
def process_solcast_full_year(filepath):
    """
    Carga el fichero anual de Solcast, gestiona zonas horarias,
    hace One-Hot encoding del tipo de clima y resamplea a 2 min.
    """
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"No se encuentra el fichero Solcast: {filepath}")
    
    print(f"Cargando datos Solcast desde: {filepath}...")
    df = pd.read_csv(filepath)
    
    # Convert to local time
    df['period_end'] = pd.to_datetime(df['period_end'])
    df['timestamp'] = df['period_end'].dt.tz_convert('Europe/Madrid')
    df['timestamp'] = df['timestamp'].dt.tz_localize(None)
    
    # Set index to timestamp
    df = df.set_index('timestamp')
    
    # Clean duplicated timestamps (from daylight saving time change in October)
    # It's nighttime, so it doesn't affect 
    if df.index.duplicated().any():
        df = df[~df.index.duplicated(keep='first')]
        
    # Delete unneeded columns
    drop_cols = ['period', 'period_end'] 
    df = df.drop(columns=[c for c in drop_cols if c in df.columns])
    
    # One-hot encoding weather_type
    if 'weather_type' in df.columns:
        dummies = pd.get_dummies(df['weather_type'], prefix='weather')
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(columns=['weather_type']) 
        
    # Resample to 2 minutes and interpolate linearly
    df_resampled = df.resample('2min').interpolate(method='linear')
    
    # Forward fill any remaining NaNs (e.g., at start)
    df_resampled = df_resampled.ffill()
    
    print(f"Solcast procesado. Rango: {df_resampled.index.min()} a {df_resampled.index.max()}")
    print(f"Variables disponibles: {list(df_resampled.columns)}")
    
    return df_resampled

df_meteo_master = process_solcast_full_year(SOLCAST_FILE)

Cargando datos Solcast desde: ../data_meteo\meteo_utc_2025.csv...


  df_resampled = df.resample('2min').interpolate(method='linear')


Solcast procesado. Rango: 2025-01-01 01:04:00 a 2025-12-24 01:00:00
Variables disponibles: ['air_temp', 'azimuth', 'clearsky_dhi', 'clearsky_dni', 'clearsky_ghi', 'cloud_opacity', 'dewpoint_temp', 'dhi', 'dni', 'ghi', 'precipitable_water', 'precipitation_rate', 'relative_humidity', 'surface_pressure', 'wind_direction_10m', 'wind_speed_10m', 'zenith', 'pm10', 'weather_CLEAR', 'weather_DRIZZLE', 'weather_FROST', 'weather_HAZY', 'weather_INTERMITTENT RAIN', 'weather_MOSTLY CLEAR', 'weather_MOSTLY CLOUDY', 'weather_MOSTLY SUNNY', 'weather_OVERCAST', 'weather_PARTLY CLOUDY', 'weather_RAIN', 'weather_SUNNY', 'weather_THUNDERSTORM', 'weather_WINDY']


## 4. Procesado de una configuración individual

In [None]:
def process_line(conf, df_meteo_full):
    """Carga y procesa simulación, real y meteo para una línea de configuración dada"""
    sim_f, real_f, sensor_real, date_str = conf
    
    path_s = os.path.join(DIR_SIM, sim_f)
    if not os.path.exists(path_s):
        print(f"  [SKIP] No se encuentra simulación: {sim_f}")
        return None
    
    df_s = pd.read_csv(path_s)
    df_s['utc'] = pd.to_datetime(df_s['utc'])
    
    # UTC+2 (Madrid) -> Local Time
    if df_s['utc'].dt.tz is None:
        df_s['utc'] = df_s['utc'].dt.tz_localize('UTC')
    
    df_s['local_time'] = df_s['utc'].dt.tz_convert('Europe/Madrid').dt.tz_localize(None)
    
    # Filter by date
    df_s = df_s[df_s['local_time'].dt.strftime('%Y-%m-%d') == date_str].copy()
    
    # Apply normalization (if not applied yet)
    if 'sim_irradiance_wm2' not in df_s.columns:
        df_s['sim_irradiance_wm2'] = (
            0.002401 * df_s['sim_comp_direct_lux'] + 
            5.03e-8  * df_s['sim_comp_direct_lux']**2 + 
            0.01305  * df_s['sim_comp_amb_lux']
        ).clip(lower=0)
    
    cols_rename = {
        'sim_irradiance_wm2': 'sim_irradiance_wm2',
        
        'sim_comp_amb_lux': 'sim_comp_amb_lux',
        'sim_comp_direct_lux': 'sim_comp_direct_lux',
        
        'azimuth_deg': 'sim_azimuth',
        'altitude_deg': 'sim_altitude',
        
        'sun_visibility': 'sim_sun_visibility',
        'sun_hit_distance_m': 'sim_center_hit_distance',
        'sky_view_factor': 'sim_sky_view_factor',
        
        'raw_r_lux': 'raw_r_lux',
        'raw_b_lux': 'raw_b_lux',
        'raw_g_lux': 'raw_g_lux',
        
        'geometric_factor': 'sim_geom_factor',
        'clearsky_ghi_wm2': 'sim_cs_ghi',
        'clearsky_dni_wm2': 'sim_cs_dni',
        'clearsky_dhi_wm2': 'sim_cs_dhi',
        
    }
    
    # Rename only existing columns
    actual_rename = {k: v for k, v in cols_rename.items() if k in df_s.columns}
    df_s = df_s.rename(columns=actual_rename).set_index('local_time')
    
    # Load real data
    path_r = os.path.join(DIR_REAL, real_f)
    if not os.path.exists(path_r):
        print(f"  [SKIP] No se encuentra real: {real_f}")
        return None
    
    # Load and filter real data by date
    df_r = pd.read_csv(path_r)
    df_r['timestamp'] = pd.to_datetime(df_r['timestamp'])
    df_r = df_r[df_r['timestamp'].dt.strftime('%Y-%m-%d') == date_str].copy()
    
    # Resample real 5s -> 2min
    df_r = df_r.set_index('timestamp').resample('2min').mean()
    
    if sensor_real not in df_r.columns:
        print(f"  [SKIP] El sensor {sensor_real} no está en {real_f}")
        return None
        
    # Keep only the relevant real sensor column and rename
    df_r = df_r[[sensor_real]].rename(columns={sensor_real: 'real_irradiance'})
    
    # Join simulation + real + meteo
    df_parcial = df_s.join(df_r, how='inner')
    df_final = df_parcial.join(df_meteo_full, how='inner')
    
    # Metadata for training
    df_final['date'] = date_str
    df_final['sensor'] = sensor_real
    
    return df_final

## 5. Generación del dataset completo

In [23]:
print("Procesando configuraciones y generando dataset master...")

all_dfs = []
print(f"Procesando {len(FILES_CONFIG)} configuraciones...")

for i, conf in enumerate(FILES_CONFIG):
    if i % 5 == 0: print(f"Procesando... {conf[0]}")
    
    # Process line
    df_res = process_line(conf, df_meteo_master)
    
    if df_res is not None and not df_res.empty:
        all_dfs.append(df_res)
    else:
        print(f"  [WARN] Dataframe vacío para {conf[0]}")
        
if all_dfs:
    master_df = pd.concat(all_dfs)
    
    # Clean NaNs
    len_orig = len(master_df)
    master_df = master_df.dropna()
    print(f"Filas limpias: {len(master_df)} (Perdidas: {len_orig - len(master_df)})")
    
    master_df.to_csv(OUTPUT_FILE)
    print(f"Guardado en: {OUTPUT_FILE}")
    
    # Preview
    print("Columnas finales:", master_df.columns.tolist())
    display(master_df.head(3))
else:
    print("No se generaron datos.")

Procesando configuraciones y generando dataset master...
Procesando 48 configuraciones...
Procesando... Sim_2025-04-03_ClaseA_2m.csv
Procesando... Sim_2025-04-05_Inclinado_2m.csv
Procesando... Sim_2025-04-07_P1_2m.csv
Procesando... Sim_2025-04-08_P4_2m.csv
Procesando... Sim_2025-04-11_ClaseA_2m.csv
Procesando... Sim_2025-04-17_Inclinado_2m.csv
Procesando... Sim_2025-04-20_P1_2m.csv
Procesando... Sim_2025-04-23_P4_2m.csv
Procesando... Sim_2025-04-26_ClaseA_2m.csv
Procesando... Sim_2025-10-07_Inclinado_2m.csv
Filas limpias: 19992 (Perdidas: 20)
Guardado en: ../data\dataset_master_tfg.csv
Columnas finales: ['sim_irradiance_wm2', 'sim_comp_amb_lux', 'sim_comp_direct_lux', 'sim_azimuth', 'sim_altitude', 'sim_sun_visibility', 'sim_center_hit_distance', 'sim_sky_view_factor', 'raw_r_lux', 'raw_b_lux', 'raw_g_lux', 'sim_geom_factor', 'sim_cs_ghi', 'sim_cs_dni', 'sim_cs_dhi', 'real_irradiance', 'air_temp', 'azimuth', 'clearsky_dhi', 'clearsky_dni', 'clearsky_ghi', 'cloud_opacity', 'dewpoint_tem

Unnamed: 0,sim_irradiance_wm2,sim_comp_amb_lux,sim_comp_direct_lux,sim_azimuth,sim_altitude,sim_sun_visibility,sim_center_hit_distance,sim_sky_view_factor,raw_r_lux,raw_b_lux,...,weather_MOSTLY CLOUDY,weather_MOSTLY SUNNY,weather_OVERCAST,weather_PARTLY CLOUDY,weather_RAIN,weather_SUNNY,weather_THUNDERSTORM,weather_WINDY,date,sensor
2025-04-03 10:08:00,331.244955,7211.141113,48825.210938,104.263,24.009,1.0,-1.0,0.99408,4087.593994,10848.214844,...,False,False,False,False,True,False,False,False,2025-04-03,P0
2025-04-03 10:10:00,337.046057,7259.229004,49529.265625,104.629,24.377,1.0,-1.0,0.99408,4102.404785,10938.510742,...,False,False,False,False,True,False,False,False,2025-04-03,P0
2025-04-03 10:14:00,348.642971,7349.081055,50927.757812,105.367,25.113,1.0,-1.0,0.99408,4132.727539,11102.77832,...,False,False,False,False,True,False,False,False,2025-04-03,P0
