# Limpieza y preparación de datos históricos para los modelos de predicción
Este notebook se utilizará para limpiar los datos históricos necesarios para los modelos de predicción que se utilizarán:

**Radiación del día anterior a la llamada** https://opendata.aemet.es/centrodedescargas/productosAEMET<br>
**Radiación solar de dos días antes del día de obtención de datos** http://www.soda-pro.com/web-services/radiation/cams-radiation-service<br>
**Datos climáticos de los cinco días anteriores a la llamada** https://openweathermap.org/api/one-call-api#history<br>
**Predicción climática de las 48 horas siguientes a la llamada** https://openweathermap.org/api/one-call-api<br>


- [Preparación](#Preparación)<br>

### 1. [Datos climatológicos de 5 días anteriores](#Datos-climatológicos-de-5-días-anteriores)

### 2. [Función final clima](#Función-final-clima)

### 3. [Predicciones climatológicas de los 2 días siguientes](#Predicciones-climatológicas-de-los-2-días-siguientes)

### 4. [Función final predicciones](#Función-final-predicciones)

### 5. [Datos de radiación del día anterior](#Datos-de-radiación-del-día-anterior)

### 6. [Función final radiación día anterior](#Función-final-radiación-día-anterior)

### 7. [Datos de radiación de dos días antes](#Datos-de-radiación-de-dos-días-antes)

### 8. [Función final de radiación](#Función-final-de-radiación)

### 9. [Generación de las filas con los días en columnas](#Generación-de-las-filas-con-los-días-en-columnas)


**PRECAUCIÓN:** Tarda muchas horas en correr

## Preparación

Importo las librerías y datasets necesarios

In [1]:
import numpy as np
import pandas as pd
import random
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from datetime import datetime, timedelta

Se utilizan solo horas con luz solar.

In [2]:
hora_ini = 4
hora_fin = 20

Se fija el directorio de trabajo

In [3]:
%cd /home/dsc/git/TFM/

/home/dsc/git/TFM


In [4]:
directorio = '/home/dsc/git/TFM/'

# Datos climáticos de 5 días anteriores
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

Estos datos se obtienen del portal OpenWeather (gracias a una licencia de estudiante que permite hacer un gran número de llamadas al día) (https://openweathermap.org/api/one-call-api#history). **Datos en UTC.** Se accede a los datos climáticos horarios de los 5 días anteriores a la llamada. Los campos obtenidos son:

- ``dt``: Time of historical data, Unix, UTC
- ``temp``: Temperature. Units: kelvin
- ``feels_like``:  Temperature. This accounts for the human perception of weather. Units: kelvin
- ``pressure``: Atmospheric pressure on the sea level, hPa
- ``humidity``: Humidity, %
- ``dew_point``: Atmospheric temperature below which water droplets begin to condense and dew can form. Units: kelvin
- ``clouds``: Cloudiness, %
- ``visibility``: Average visibility, metres
- ``wind_speed``: Wind speed. Wind speed. Units: m/s
- ``wind_gust``: Wind gust. Units: m/s
- ``wind_deg``: Wind direction, degrees (meteorological)
- ``rain``: Precipitation volume, mm
- ``snow``: Snow volume, mm
- ``we``: Incluye un id que indica el tipo de tiempo meteorológico

La hora X contiene los datos transcurridos entre las X:00 y las X:59

### Estudiamos los datos

Se utiliza primero uno de los datasets diarios

In [5]:
# Importo el csv de datos históricos climáticos. Sus campos han sido explicados en el notebook donde se obtienen

df_clima = pd.read_csv('./data/Clima_OW/clima_ow_2021-04-05', sep=',')
df_clima.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,,800,
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,,800,
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800,
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,,800,
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,


In [6]:
df_clima.shape

(41736, 16)

In [7]:
print("Hay {} estaciones".format(len(np.unique(df_clima["estacion"]))))

Hay 287 estaciones


Se carga la lista de estaciones. que servirá como lista de ubicaciones ejemplo

In [8]:
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

In [9]:
df_estaciones.head()

Unnamed: 0,latitud,provincia,altitud,indicativo,nombre,longitud
0,413515N,BARCELONA,74,0252D,ARENYS DE MAR,023224E
1,411734N,BARCELONA,4,0076,BARCELONA AEROPUERTO,020412E
2,412506N,BARCELONA,408,0200E,"BARCELONA, FABRA",020727E
3,412326N,BARCELONA,6,0201D,BARCELONA,021200E
4,414312N,BARCELONA,291,0149X,MANRESA,015025E


Busco las estaciones que no están en ambos data frames (estaciones y df_clima)

In [10]:
estacion_quitar = []
# Estaciones en la lista de estaciones
serie_indicativos = df_estaciones["indicativo"].unique().astype("str")
# Estaciones en el dataset de clima
serie_estaciones = list(set(df_clima["estacion"].unique().astype("str")))

diferencia = len(serie_indicativos) - len(serie_estaciones)
print("La diferencia es de: {}".format(diferencia))

# Guardo los indicativos de las estaciones de la lista que no estan en el dataset
for i in range(0, len(serie_indicativos)):
    estacion = serie_indicativos[i]
    if estacion not in serie_estaciones:
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia -= 1
        print("La diferencia es de: {}".format(diferencia))
# Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
for i in range(0, len(serie_estaciones)):
    estacion = serie_estaciones[i]
    if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia += 1
        print("La diferencia es de: {}".format(diferencia))
print(estacion_quitar)
print("La diferencia es de: {}".format(diferencia))

La diferencia es de: 0
No esta la 31, estacion 1249X
La diferencia es de: -1
No esta la 194, estacion 1249I
La diferencia es de: 0
['1249X', '1249I']
La diferencia es de: 0


Muestro la lista con los indicativos a quitar

In [11]:
estacion_quitar

['1249X', '1249I']

Elimino estas estaciones del dataset

In [12]:
cont = 0
for i in range(0, len(df_clima["estacion"])):
    if str(df_clima["estacion"].loc[i]) in estacion_quitar:
        cont += 1
print("Hay que eliminar {} filas de {}. Quedarán {}".format(cont, len(df_clima["estacion"]), len(df_clima["estacion"])-cont))

Hay que eliminar 240 filas de 41736. Quedarán 41496


Creo una columna que almacena un True si la estación de esa fila debe ser eliminada

In [13]:
estaciones_df = []

for i in range(0, len(df_clima["estacion"])):
    estacion = df_clima["estacion"].loc[i]
    if estacion in estacion_quitar:
        estaciones_df.append(True)
    else:
        estaciones_df.append(False)

In [14]:
df_clima.insert(len(df_clima.columns),"quitar",estaciones_df,True)

In [15]:
df_clima.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h,quitar
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,,800,,False
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,,800,,False
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800,,False
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,,800,,False
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,,False


In [16]:
df_clima.drop(df_clima[df_clima["quitar"] == True].index, inplace = True)

In [17]:
df_clima.reset_index(drop=True, inplace=True)
df_clima.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h,quitar
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,,800,,False
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,,800,,False
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800,,False
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,,800,,False
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,,False


In [18]:
print("Efectivamente, la nueva cantidad de filas es {}".format(len(df_clima["estacion"])))

Efectivamente, la nueva cantidad de filas es 41496


### Seleccionamos columnas interesantes

Eliminamos las columnas ``quitar``.
El resto de columnas son datos meteorológicos que utilizaremos para generar el modelo.

In [19]:
df_clima.drop(['quitar'], axis=1, inplace = True)
df_clima.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,,800,
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,,800,
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800,
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,,800,
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,


In [20]:
df_clima.shape

(41496, 16)

Usando ``.info()`` se ven los NAs.

In [21]:
df_clima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41496 entries, 0 to 41495
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              41496 non-null  object 
 1   hour              41496 non-null  object 
 2   fecha prediccion  41496 non-null  object 
 3   estacion          41496 non-null  object 
 4   temp              41496 non-null  float64
 5   feels_like        41496 non-null  float64
 6   pressure          41496 non-null  int64  
 7   humidity          41496 non-null  int64  
 8   dew_point         41496 non-null  float64
 9   clouds            41496 non-null  int64  
 10  visibility        28978 non-null  float64
 11  wind_speed        41496 non-null  float64
 12  wind_deg          41496 non-null  int64  
 13  wind_gust         14129 non-null  float64
 14  we                41496 non-null  int64  
 15  rain.1h           1910 non-null   float64
dtypes: float64(7), int64(5), object(4)
memor

Guardo solo las horas útiles

In [22]:
df_clima["hour"] = pd.to_numeric([np.nan if pd.isna(c) == True else c[:2] for c in df_clima["hour"]])

In [23]:
df_clima = df_clima[(df_clima["hour"] < hora_fin) & (df_clima["hour"] >= hora_ini)]
df_clima.reset_index(drop=True, inplace=True)
df_clima.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h
0,2021-04-04,4,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,
1,2021-04-04,5,2021-04-05,0252D,284.59,282.57,1014,93,283.5,0,10000.0,3.09,290,,800,
2,2021-04-04,6,2021-04-05,0252D,284.41,283.88,1015,87,282.33,0,10000.0,0.51,0,,800,
3,2021-04-04,7,2021-04-05,0252D,284.99,284.61,1016,87,282.9,0,10000.0,0.51,0,,800,
4,2021-04-04,8,2021-04-05,0252D,286.41,286.17,1016,82,283.41,0,10000.0,0.51,0,,800,


In [24]:
df_clima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27664 entries, 0 to 27663
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              27664 non-null  object 
 1   hour              27664 non-null  int64  
 2   fecha prediccion  27664 non-null  object 
 3   estacion          27664 non-null  object 
 4   temp              27664 non-null  float64
 5   feels_like        27664 non-null  float64
 6   pressure          27664 non-null  int64  
 7   humidity          27664 non-null  int64  
 8   dew_point         27664 non-null  float64
 9   clouds            27664 non-null  int64  
 10  visibility        19627 non-null  float64
 11  wind_speed        27664 non-null  float64
 12  wind_deg          27664 non-null  int64  
 13  wind_gust         8970 non-null   float64
 14  we                27664 non-null  int64  
 15  rain.1h           1380 non-null   float64
dtypes: float64(7), int64(6), object(3)
memor

Genero el dataset completo. Para ello se cargan todos los archivos diarios guardados y se concatenan.

In [25]:
fechas = []

# Se fija la priemra fecha (fecha del primer archivo diario)
now = datetime.now()
fecha_inicial = datetime(2021,4,5)

# Se fija la fecha más actual
fecha_final = datetime(now.year,now.month,now.day)

#Se obtiene la lista de días
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

fechas

['2021-04-05',
 '2021-04-06',
 '2021-04-07',
 '2021-04-08',
 '2021-04-09',
 '2021-04-10',
 '2021-04-11',
 '2021-04-12',
 '2021-04-13',
 '2021-04-14',
 '2021-04-15',
 '2021-04-16',
 '2021-04-17',
 '2021-04-18',
 '2021-04-19',
 '2021-04-20',
 '2021-04-21',
 '2021-04-22',
 '2021-04-23',
 '2021-04-24',
 '2021-04-25',
 '2021-04-26',
 '2021-04-27',
 '2021-04-28',
 '2021-04-29',
 '2021-04-30',
 '2021-05-01',
 '2021-05-02',
 '2021-05-03',
 '2021-05-04',
 '2021-05-05',
 '2021-05-06',
 '2021-05-07',
 '2021-05-08',
 '2021-05-09',
 '2021-05-10',
 '2021-05-11',
 '2021-05-12',
 '2021-05-13',
 '2021-05-14',
 '2021-05-15',
 '2021-05-16',
 '2021-05-17',
 '2021-05-18',
 '2021-05-19',
 '2021-05-20',
 '2021-05-21',
 '2021-05-22',
 '2021-05-23',
 '2021-05-24',
 '2021-05-25',
 '2021-05-26',
 '2021-05-27',
 '2021-05-28',
 '2021-05-29',
 '2021-05-30',
 '2021-05-31',
 '2021-06-01',
 '2021-06-02',
 '2021-06-03',
 '2021-06-04',
 '2021-06-05']

In [26]:
df_clima_total = pd.DataFrame()


# Para cada día, se añade el dataset guardado
for date in fechas:
    
    try:
        df_clima_ow = pd.read_csv(directorio + 'data/Clima_OW/clima_ow_{}'.format(date))
        df_clima_total = df_clima_total.append(df_clima_ow, ignore_index = True)
    except:
        continue
    
print(df_clima_total.shape)
df_clima_total.head()

(2142432, 17)


Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we,rain.1h,snow.1h
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,,800,,
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,,800,,
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800,,
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,,800,,
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,,800,,


### Gestión de los NAs

Muestro el número de NAs por columna y el % que supone del total

In [27]:
total = df_clima_total.isnull().sum().sort_values(ascending = False)
percent = (df_clima_total.isnull().sum() / df_clima_total.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
snow.1h,2141824,0.999716
rain.1h,1973824,0.921301
wind_gust,1420513,0.663038
visibility,639137,0.298323
pressure,0,0.0
hour,0,0.0
fecha prediccion,0,0.0
estacion,0,0.0
temp,0,0.0
feels_like,0,0.0


Hay un 29 % de Nas en ``visibility``. Sustituyo los valores faltantes por la media.

Hay un 66,5 % de Nas en ``wind_gust``. Sustituyo los valores faltantes por la media.

In [28]:
df_clima_total.fillna({'visibility': df_clima_total["visibility"].mean(), 'wind_gust': df_clima_total["wind_gust"].mean()}, inplace = True)

Hay un 90 % de Nas en ``rain.1h``. Esta columna no será utilizada en el modelo.

Hay un 99,9 % de Nas en ``snow.1h``. Esta columna no será utilizada en el modelo.

In [29]:
df_clima_total.drop(['rain.1h'], axis=1, inplace = True)
df_clima_total.drop(['snow.1h'], axis=1, inplace = True)

In [30]:
df_clima_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,4.284503,800
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,4.284503,800
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,4.284503,800
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.284503,800


In [31]:
df_clima_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2142432 entries, 0 to 2142431
Data columns (total 15 columns):
 #   Column            Dtype  
---  ------            -----  
 0   date              object 
 1   hour              object 
 2   fecha prediccion  object 
 3   estacion          object 
 4   temp              float64
 5   feels_like        float64
 6   pressure          int64  
 7   humidity          int64  
 8   dew_point         float64
 9   clouds            int64  
 10  visibility        float64
 11  wind_speed        float64
 12  wind_deg          int64  
 13  wind_gust         float64
 14  we                int64  
dtypes: float64(6), int64(5), object(4)
memory usage: 245.2+ MB


Se eliminan posibles filas repetidas

In [32]:
df_clima_total = df_clima_total.drop_duplicates(['date', 'hour', "fecha prediccion", "estacion"],
                        keep = 'first')
df_clima_total.reset_index(drop = True, inplace = True)
df_clima_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we
0,2021-04-04,00:00,2021-04-05,0252D,286.3,285.66,1014,82,283.3,0,10000.0,1.03,260,4.284503,800
1,2021-04-04,01:00,2021-04-05,0252D,285.94,284.13,1014,82,282.95,0,10000.0,2.57,290,4.284503,800
2,2021-04-04,02:00,2021-04-05,0252D,285.67,284.98,1014,87,283.57,0,10000.0,1.22,271,1.79,800
3,2021-04-04,03:00,2021-04-05,0252D,285.53,284.34,1014,82,282.55,0,10000.0,1.54,60,4.284503,800
4,2021-04-04,04:00,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.284503,800


# Función final clima
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

In [33]:
def clima_ow_clean(df_datos, df_puntos):
    
    #Elimino las estaciones que no estén en ambos
    
    estacion_quitar = []
    # Estaciones en la lista de estaciones
    serie_indicativos = df_puntos["indicativo"].unique().astype("str")
    # Estaciones en el dataset de clima
    serie_estaciones = list(set(df_datos["estacion"].unique().astype("str")))

    diferencia = len(serie_indicativos) - len(serie_estaciones)

    # Guardo los indicativos de las estaciones de la lista que no estan en el dataset
    for i in range(0, len(serie_indicativos)):
        estacion = serie_indicativos[i]
        if estacion not in serie_estaciones:
            estacion_quitar.append(str(estacion))
            diferencia -= 1
    # Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
    for i in range(0, len(serie_estaciones)):
        estacion = serie_estaciones[i]
        if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
            estacion_quitar.append(str(estacion))
            diferencia += 1
            

    estaciones_df = []
    for i in range(0, len(df_datos["estacion"])):
        estacion = df_datos["estacion"].loc[i]
        if estacion in estacion_quitar:
            estaciones_df.append(True)
        else:
            estaciones_df.append(False)
            
    df_datos.insert(len(df_datos.columns), "quitar", estaciones_df, True)
    df_datos.drop(df_datos[df_datos["quitar"] == True].index, inplace = True)
    df_datos.reset_index(drop = True, inplace = True)
    
    # Quito columnas innecesarias
    
    df_datos.drop(['quitar'], axis=1, inplace = True)
    df_datos.drop(['rain.1h'], axis=1, inplace = True)
    df_datos.drop(['snow.1h'], axis=1, inplace = True)
    
    # Convierto las columnas a los tipos de dato correctos
    
    df_datos["hour"] = pd.to_numeric([np.nan if pd.isna(c) == True else c[:2] for c in df_datos["hour"]])
    df_datos = df_datos[(df_datos["hour"] < hora_fin) & (df_datos["hour"] >= hora_ini)]
    df_datos.reset_index(drop=True, inplace=True)
    
    # Elimino Na's
    
    df_datos.fillna({'visibility': df_datos["visibility"].mean(), 'wind_gust': df_datos["wind_gust"].mean()}, inplace = True)
    
    # Se eliminan filas reptidas
    
    df_datos = df_datos.drop_duplicates(['date', 'hour', "fecha prediccion", "estacion"],
                        keep = 'first')
    df_datos.reset_index(drop = True, inplace = True)
    df_datos.head()

    # Guardo el archivo limpio
    nombre = './data/Historicos_modelo_2/historicos_climaticos_clean.csv'
    df_datos.to_csv(nombre, index = False)
    
    return df_datos

In [34]:
import numpy as np
import pandas as pd
import random
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from datetime import datetime, timedelta
hora_ini = 4
hora_fin = 20

# Leo csvs
fechas = []
now = datetime.now()
fecha_inicial = datetime(2021,4,5)
fecha_final = datetime(now.year,now.month,now.day)
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

directorio = '/home/dsc/git/TFM/'
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

df_clima_total = pd.DataFrame()
for date in fechas:
    try:
        df_clima_ow = pd.read_csv(directorio + 'data/Clima_OW/clima_ow_{}'.format(date))
        df_clima_total = df_clima_total.append(df_clima_ow, ignore_index = True)
    except:
        continue

# Llamo a la función
df_clean = clima_ow_clean(df_clima_total, df_estaciones)
    
df_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we
0,2021-04-04,4,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.782415,800
1,2021-04-04,5,2021-04-05,0252D,284.59,282.57,1014,93,283.5,0,10000.0,3.09,290,4.782415,800
2,2021-04-04,6,2021-04-05,0252D,284.41,283.88,1015,87,282.33,0,10000.0,0.51,0,4.782415,800
3,2021-04-04,7,2021-04-05,0252D,284.99,284.61,1016,87,282.9,0,10000.0,0.51,0,4.782415,800
4,2021-04-04,8,2021-04-05,0252D,286.41,286.17,1016,82,283.41,0,10000.0,0.51,0,4.782415,800


In [35]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1362000 entries, 0 to 1361999
Data columns (total 15 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   date              1362000 non-null  object 
 1   hour              1362000 non-null  int64  
 2   fecha prediccion  1362000 non-null  object 
 3   estacion          1362000 non-null  object 
 4   temp              1362000 non-null  float64
 5   feels_like        1362000 non-null  float64
 6   pressure          1362000 non-null  int64  
 7   humidity          1362000 non-null  int64  
 8   dew_point         1362000 non-null  float64
 9   clouds            1362000 non-null  int64  
 10  visibility        1362000 non-null  float64
 11  wind_speed        1362000 non-null  float64
 12  wind_deg          1362000 non-null  int64  
 13  wind_gust         1362000 non-null  float64
 14  we                1362000 non-null  int64  
dtypes: float64(6), int64(6), object(3)
memory usage: 

### Sanity check

In [36]:
print('Años distintos: ', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).year.nunique())
print('Meses diferentes:', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).month.nunique())
print('Debe haber 31 días distintos:', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).day.nunique())
print('Solo debe haber horas disitintas dentro de las horas de filtrado:', df_clean['hour'].nunique())


print('Longitud total: ', len(df_clean["date"]))
print(df_clean.describe())

Años distintos:  1
Meses diferentes: 4
Debe haber 31 días distintos: 31
Solo debe haber horas disitintas dentro de las horas de filtrado: 16
Longitud total:  1362000
               hour          temp    feels_like      pressure      humidity  \
count  1.362000e+06  1.362000e+06  1.362000e+06  1.362000e+06  1.362000e+06   
mean   1.150000e+01  2.894646e+02 -1.062627e+13  1.013698e+03 -1.083509e+14   
std    4.609774e+00  5.418496e+00  3.100337e+15  1.001832e+01  3.161249e+16   
min    4.000000e+00  2.664400e+02 -9.082920e+17  8.270000e+02 -9.223372e+18   
25%    7.750000e+00  2.858100e+02  2.830100e+02  1.012000e+03  4.900000e+01   
50%    1.150000e+01  2.893700e+02  2.868500e+02  1.015000e+03  6.600000e+01   
75%    1.525000e+01  2.931500e+02  2.907200e+02  1.018000e+03  8.000000e+01   
max    1.900000e+01  3.091500e+02  3.072300e+02  1.047000e+03  1.000000e+02   

          dew_point        clouds    visibility    wind_speed      wind_deg  \
count  1.362000e+06  1.362000e+06  1.362000

# Predicciones climatológicas de los 2 días siguientes
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

Estos datos se obtienen del portal OpenWeather (gracias a una licencia de estudiante que permite hacer un gran número de llamadas al día) (https://openweathermap.org/api/one-call-api). **Datos en UTC.** Se accede a la predicción climática horaria de las 48 horas siguientes a la llamada. Los campos obtenidos son:

- ``dt``: Time of the forecasted data, Unix, UTC
- ``temp``: Temperature. Units: kelvin
- ``feels_like``: Temperature. This accounts for the human perception of weather. Units: kelvin
- ``pressure``: Atmospheric pressure on the sea level, hPa
- ``humidity``: Humidity, %
- ``dew_point``: Atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form. Units: kelvin
- ``uvi``: UV index
- ``clouds``: Cloudiness, %
- ``visibility``: Average visibility, metres
- ``wind_speed``: Wind speed. Units: m/s
- ``wind_gust``: Wind gust. Units: m/s
- ``wind_deg``: Wind direction, degrees (meteorological)
- ``pop``: Probability of precipitation
- ``rain``: Rain volume for last hour, mm
- ``snow``: Snow volume for last hour, mm
- ``weather``: Incluye un id y otros parámetros

La hora X contiene los datos transcurridos entre las X:00 y las X:59

### Estudiamos los datos

In [37]:
# Importo el csv de predicciones climáticas.

%cd /home/dsc/git/TFM/
df_pred = pd.read_csv('./data/Pred_OW/pred_ow_2021-04-05', sep=',')
df_pred.head()

/home/dsc/git/TFM


Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,,800,
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,,800,
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,,801,
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,,802,
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,,803,


In [38]:
df_pred.shape

(22656, 19)

Cargo la lista de estaciones

In [39]:
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

In [40]:
df_estaciones.head()

Unnamed: 0,latitud,provincia,altitud,indicativo,nombre,longitud
0,413515N,BARCELONA,74,0252D,ARENYS DE MAR,023224E
1,411734N,BARCELONA,4,0076,BARCELONA AEROPUERTO,020412E
2,412506N,BARCELONA,408,0200E,"BARCELONA, FABRA",020727E
3,412326N,BARCELONA,6,0201D,BARCELONA,021200E
4,414312N,BARCELONA,291,0149X,MANRESA,015025E


Busco las estaciones que no están en ambos data frames (estaciones y df_pred)

In [41]:
estacion_quitar = []
# Estaciones en la lista de estaciones
serie_indicativos = df_estaciones["indicativo"].unique().astype("str")
# Estaciones en el dataset de clima
serie_estaciones = list(set(df_pred["estacion"].unique().astype("str")))

diferencia = len(serie_indicativos) - len(serie_estaciones)
print("La diferencia es de: {}".format(diferencia))

# Guardo los indicativos de las estaciones de la lista que no estan en el dataset
for i in range(0, len(serie_indicativos)):
    estacion = serie_indicativos[i]
    if estacion not in serie_estaciones:
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia -= 1
        print("La diferencia es de: {}".format(diferencia))
# Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
for i in range(0, len(serie_estaciones)):
    estacion = serie_estaciones[i]
    if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia += 1
        print("La diferencia es de: {}".format(diferencia))
print(estacion_quitar)
print("La diferencia es de: {}".format(diferencia))

La diferencia es de: 0
No esta la 31, estacion 1249X
La diferencia es de: -1
No esta la 194, estacion 1249I
La diferencia es de: 0
['1249X', '1249I']
La diferencia es de: 0


Muestro la lista con los indicativos a quitar

In [42]:
estacion_quitar

['1249X', '1249I']

Elimino estas estaciones del dataset

In [43]:
cont = 0
for i in range(0, len(df_pred["estacion"])):
    if str(df_pred["estacion"].loc[i]) in estacion_quitar:
        cont += 1
print("Hay que eliminar {} filas de {}. Quedarán {}".format(cont, len(df_pred["estacion"]), len(df_pred["estacion"])-cont))

Hay que eliminar 96 filas de 22656. Quedarán 22560


Creo una columna que almacena un True si la estación de esa fila debe ser eliminada

In [44]:
estaciones_df = []

for i in range(0, len(df_pred["estacion"])):
    estacion = df_pred["estacion"].loc[i]
    if estacion in estacion_quitar:
        estaciones_df.append(True)
    else:
        estaciones_df.append(False)

In [45]:
df_pred.insert(len(df_pred.columns),"quitar",estaciones_df,True)

In [46]:
df_pred.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h,quitar
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,,800,,False
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,,800,,False
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,,801,,False
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,,802,,False
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,,803,,False


In [47]:
df_pred.drop(df_pred[df_pred["quitar"] == True].index, inplace = True)

In [48]:
df_pred.reset_index(drop=True, inplace=True)
df_pred.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h,quitar
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,,800,,False
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,,800,,False
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,,801,,False
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,,802,,False
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,,803,,False


In [49]:
print("Efectivamente, la nueva cantidad de filas es {}".format(len(df_pred["estacion"])))

Efectivamente, la nueva cantidad de filas es 22560


### Seleccionamos columnas interesantes

Eliminamos las columnas ``quitar``.
El resto de columnas son datos meteorológicos que utilizaremos para generar el modelo.

In [50]:
df_pred.drop(['quitar'], axis=1, inplace = True)
df_pred.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,,800,
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,,800,
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,,801,
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,,802,
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,,803,


In [51]:
df_pred.shape

(22560, 19)

Usando ``.info()`` se ven los NAs.

In [52]:
df_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22560 entries, 0 to 22559
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              22560 non-null  object 
 1   hour              22560 non-null  object 
 2   fecha prediccion  22560 non-null  object 
 3   estacion          22560 non-null  object 
 4   temp              22560 non-null  float64
 5   feels_like        22560 non-null  float64
 6   pressure          22560 non-null  int64  
 7   humidity          22560 non-null  int64  
 8   dew_point         22560 non-null  float64
 9   uvi               22560 non-null  float64
 10  clouds            22560 non-null  int64  
 11  visibility        22560 non-null  int64  
 12  wind_speed        22560 non-null  float64
 13  wind_deg          22560 non-null  int64  
 14  wind_gust         22560 non-null  float64
 15  pop               22560 non-null  float64
 16  rain.1h           521 non-null    float6

Se utilizan solo horas con luz solar.

In [53]:
df_pred["hour"] = pd.to_numeric([np.nan if pd.isna(c) == True else c[:2] for c in df_pred["hour"]])

In [54]:
df_pred = df_pred[(df_pred["hour"] < hora_fin) & (df_pred["hour"] >= hora_ini)]
df_pred.reset_index(drop=True, inplace=True)
df_pred.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h
0,2021-04-05,4,2021-04-05,0252D,285.47,284.66,1017,73,280.28,0.0,63,10000,0.98,306,1.36,0.0,,803,
1,2021-04-05,5,2021-04-05,0252D,285.54,284.66,1017,70,279.91,0.0,65,10000,1.18,10,1.43,0.0,,803,
2,2021-04-05,6,2021-04-05,0252D,285.31,284.44,1018,71,279.75,0.11,61,10000,1.73,34,1.71,0.0,,803,
3,2021-04-05,7,2021-04-05,0252D,286.09,285.19,1017,67,279.64,0.55,16,10000,1.52,58,1.76,0.0,,801,
4,2021-04-05,8,2021-04-05,0252D,287.24,286.32,1017,62,279.58,1.46,10,10000,0.94,103,1.28,0.0,,800,


In [55]:
df_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15040 entries, 0 to 15039
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              15040 non-null  object 
 1   hour              15040 non-null  int64  
 2   fecha prediccion  15040 non-null  object 
 3   estacion          15040 non-null  object 
 4   temp              15040 non-null  float64
 5   feels_like        15040 non-null  float64
 6   pressure          15040 non-null  int64  
 7   humidity          15040 non-null  int64  
 8   dew_point         15040 non-null  float64
 9   uvi               15040 non-null  float64
 10  clouds            15040 non-null  int64  
 11  visibility        15040 non-null  int64  
 12  wind_speed        15040 non-null  float64
 13  wind_deg          15040 non-null  int64  
 14  wind_gust         15040 non-null  float64
 15  pop               15040 non-null  float64
 16  rain.1h           431 non-null    float6

Genero el dataset completo, cargando todos los archivos diarios guardados y concatenánddolos

In [56]:
fechas = []

# Se fija la priemra fecha (fecha del primer archivo diario)
now = datetime.now()
fecha_inicial = datetime(2021,4,5)

# Se fija la fecha más actual
fecha_final = datetime(now.year,now.month,now.day)

#Se obtiene la lista de días
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

fechas

['2021-04-05',
 '2021-04-06',
 '2021-04-07',
 '2021-04-08',
 '2021-04-09',
 '2021-04-10',
 '2021-04-11',
 '2021-04-12',
 '2021-04-13',
 '2021-04-14',
 '2021-04-15',
 '2021-04-16',
 '2021-04-17',
 '2021-04-18',
 '2021-04-19',
 '2021-04-20',
 '2021-04-21',
 '2021-04-22',
 '2021-04-23',
 '2021-04-24',
 '2021-04-25',
 '2021-04-26',
 '2021-04-27',
 '2021-04-28',
 '2021-04-29',
 '2021-04-30',
 '2021-05-01',
 '2021-05-02',
 '2021-05-03',
 '2021-05-04',
 '2021-05-05',
 '2021-05-06',
 '2021-05-07',
 '2021-05-08',
 '2021-05-09',
 '2021-05-10',
 '2021-05-11',
 '2021-05-12',
 '2021-05-13',
 '2021-05-14',
 '2021-05-15',
 '2021-05-16',
 '2021-05-17',
 '2021-05-18',
 '2021-05-19',
 '2021-05-20',
 '2021-05-21',
 '2021-05-22',
 '2021-05-23',
 '2021-05-24',
 '2021-05-25',
 '2021-05-26',
 '2021-05-27',
 '2021-05-28',
 '2021-05-29',
 '2021-05-30',
 '2021-05-31',
 '2021-06-01',
 '2021-06-02',
 '2021-06-03',
 '2021-06-04',
 '2021-06-05']

In [57]:
df_pred_total = pd.DataFrame()

# Para cada día, se añade el dataset guardado
for date in fechas:
    try:
        df_pred_ow = pd.read_csv(directorio + 'data/Pred_OW/pred_ow_{}'.format(date))
        df_pred_total = df_pred_total.append(df_pred_ow, ignore_index = True)
    except:
        continue
    
print(df_pred_total.shape)
df_pred_total.head()

(857424, 19)


Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,rain.1h,we,snow.1h
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,,800,
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,,800,
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,,801,
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,,802,
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,,803,


### Gestión de los NAs

Muestro el número de NAs por columna y el % que supone del total

In [58]:
total = df_pred_total.isnull().sum().sort_values(ascending = False)
percent = (df_pred_total.isnull().sum() / df_pred_total.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
snow.1h,857031,0.999542
rain.1h,756923,0.882787
dew_point,0,0.0
hour,0,0.0
fecha prediccion,0,0.0
estacion,0,0.0
temp,0,0.0
feels_like,0,0.0
pressure,0,0.0
humidity,0,0.0


Hay un 84 % de Nas en ``rain.1h``. Esta columna no será utilizada en el modelo.

Hay un 99,9 % de Nas en ``snow.1h``. Esta columna no será utilizada en el modelo.

In [59]:
df_pred_total.drop(['rain.1h'], axis=1, inplace = True)
df_pred_total.drop(['snow.1h'], axis=1, inplace = True)

In [60]:
df_pred_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,we
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,800
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,800
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,801
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,802
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,803


In [61]:
df_pred_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857424 entries, 0 to 857423
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              857424 non-null  object 
 1   hour              857424 non-null  object 
 2   fecha prediccion  857424 non-null  object 
 3   estacion          857424 non-null  object 
 4   temp              857424 non-null  float64
 5   feels_like        857424 non-null  float64
 6   pressure          857424 non-null  int64  
 7   humidity          857424 non-null  int64  
 8   dew_point         857424 non-null  float64
 9   uvi               857424 non-null  float64
 10  clouds            857424 non-null  int64  
 11  visibility        857424 non-null  int64  
 12  wind_speed        857424 non-null  float64
 13  wind_deg          857424 non-null  int64  
 14  wind_gust         857424 non-null  float64
 15  pop               857424 non-null  float64
 16  we                85

Se eliminan posibles filas repetidas

In [62]:
df_pred_total = df_pred_total.drop_duplicates(['date', 'hour', "fecha prediccion", "estacion"],
                        keep = 'first')
df_pred_total.reset_index(drop = True, inplace = True)
df_pred_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,we
0,2021-04-04,22:00,2021-04-05,0252D,285.63,284.53,1017,61,278.32,0.0,7,10000,1.97,283,1.83,0.0,800
1,2021-04-04,23:00,2021-04-05,0252D,285.49,284.48,1018,65,279.11,0.0,10,10000,2.19,272,2.15,0.0,800
2,2021-04-05,00:00,2021-04-05,0252D,285.39,284.47,1018,69,279.88,0.0,18,10000,2.02,267,2.1,0.0,801
3,2021-04-05,01:00,2021-04-05,0252D,285.27,284.39,1018,71,280.18,0.0,47,10000,1.86,266,2.05,0.0,802
4,2021-04-05,02:00,2021-04-05,0252D,285.3,284.42,1017,71,280.21,0.0,54,10000,1.86,263,2.14,0.0,803


# Función final predicciones
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

In [63]:
def pred_clean(df_datos, df_puntos):
    
    #Elimino las estaciones que no estén en ambos
    
    estacion_quitar = []
    # Estaciones en la lista de estaciones
    serie_indicativos = df_puntos["indicativo"].unique().astype("str")
    # Estaciones en el dataset de clima
    serie_estaciones = list(set(df_datos["estacion"].unique().astype("str")))

    diferencia = len(serie_indicativos) - len(serie_estaciones)

    # Guardo los indicativos de las estaciones de la lista que no estan en el dataset
    for i in range(0, len(serie_indicativos)):
        estacion = serie_indicativos[i]
        if estacion not in serie_estaciones:
            estacion_quitar.append(str(estacion))
            diferencia -= 1
    # Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
    for i in range(0, len(serie_estaciones)):
        estacion = serie_estaciones[i]
        if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
            estacion_quitar.append(str(estacion))
            diferencia += 1
    
    estaciones_df = []
    for i in range(0, len(df_datos["estacion"])):
        estacion = df_datos["estacion"].loc[i]
        if estacion in estacion_quitar:
            estaciones_df.append(True)
        else:
            estaciones_df.append(False)
    
    df_datos.insert(len(df_datos.columns), "quitar", estaciones_df, True)
    df_datos.drop(df_datos[df_datos["quitar"] == True].index, inplace = True)
    df_datos.reset_index(drop=True, inplace=True)

    # Quito columnas innecesarias
    
    df_datos.drop(['quitar'], axis=1, inplace = True)
    
    # Convierto las columnas a los tipos de dato correctos
    
    df_datos["hour"] = pd.to_numeric([np.nan if pd.isna(c) == True else c[:2] for c in df_datos["hour"]])
    df_datos = df_datos[(df_datos["hour"] < hora_fin) & (df_datos["hour"] >= hora_ini)]
    df_datos.reset_index(drop=True, inplace=True)
    
    # Elimino Na's
    
    df_datos.drop(['rain.1h'], axis=1, inplace = True)
    df_datos.drop(['snow.1h'], axis=1, inplace = True)
    
    # Se elimina posibles filas repetidas
    
    df_datos = df_datos.drop_duplicates(['date', 'hour', "fecha prediccion", "estacion"],
                        keep = 'first')
    df_datos.reset_index(drop = True, inplace = True)
    df_datos.head()

    # Guardo el archivo limpio
    nombre = './data/Historicos_modelo_2/predicciones_climaticas_clean.csv'
    df_datos.to_csv(nombre, index = False)
    
    return df_datos

In [64]:
import numpy as np
import pandas as pd
import random
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from datetime import datetime, timedelta
hora_ini = 4
hora_fin = 20

# Leo csvs
fechas = []
now = datetime.now()
fecha_inicial = datetime(2021,4,5)
fecha_final = datetime(now.year,now.month,now.day)
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

directorio = '/home/dsc/git/TFM/'
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

df_pred_total = pd.DataFrame()
for date in fechas:
    try:
        df_pred_ow = pd.read_csv(directorio + 'data/Pred_OW/pred_ow_{}'.format(date))
        df_pred_total = df_pred_total.append(df_pred_ow, ignore_index = True)
    except:
        continue


# Llamo a la función
df_clean = pred_clean(df_pred_total, df_estaciones)
    
df_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,we
0,2021-04-05,4,2021-04-05,0252D,285.47,284.66,1017,73,280.28,0.0,63,10000,0.98,306,1.36,0.0,803
1,2021-04-05,5,2021-04-05,0252D,285.54,284.66,1017,70,279.91,0.0,65,10000,1.18,10,1.43,0.0,803
2,2021-04-05,6,2021-04-05,0252D,285.31,284.44,1018,71,279.75,0.11,61,10000,1.73,34,1.71,0.0,803
3,2021-04-05,7,2021-04-05,0252D,286.09,285.19,1017,67,279.64,0.55,16,10000,1.52,58,1.76,0.0,801
4,2021-04-05,8,2021-04-05,0252D,287.24,286.32,1017,62,279.58,1.46,10,10000,0.94,103,1.28,0.0,800


In [65]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549208 entries, 0 to 549207
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   date              549208 non-null  object 
 1   hour              549208 non-null  int64  
 2   fecha prediccion  549208 non-null  object 
 3   estacion          549208 non-null  object 
 4   temp              549208 non-null  float64
 5   feels_like        549208 non-null  float64
 6   pressure          549208 non-null  int64  
 7   humidity          549208 non-null  int64  
 8   dew_point         549208 non-null  float64
 9   uvi               549208 non-null  float64
 10  clouds            549208 non-null  int64  
 11  visibility        549208 non-null  int64  
 12  wind_speed        549208 non-null  float64
 13  wind_deg          549208 non-null  int64  
 14  wind_gust         549208 non-null  float64
 15  pop               549208 non-null  float64
 16  we                54

### Sanity check

In [66]:
print('Años distintos: ', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).year.nunique())
print('Meses diferentes:', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).month.nunique())
print('Debe haber 31 días distintos:', pd.to_datetime([year for year in pd.to_datetime(df_clean['date'], format='%Y/%m/%d')]).day.nunique())
# Si hay 30 es porque solo se han tomado datos de un mes seguido de otro de 30 días
print('Solo debe haber horas disitintas dentro de las horas de filtrado:', df_clean['hour'].nunique())

print('Longitud total: ', len(df_clean["date"]))
print(df_clean.describe())

Años distintos:  1
Meses diferentes: 3
Debe haber 31 días distintos: 31
Solo debe haber horas disitintas dentro de las horas de filtrado: 16
Longitud total:  549208
                hour           temp     feels_like       pressure  \
count  549208.000000  549208.000000  549208.000000  549208.000000   
mean       11.480051     288.995420     288.223406    1016.102701   
std         4.605758       5.612776       5.941498       5.454355   
min         4.000000     264.670000     259.790000     729.000000   
25%         7.000000     285.220000     284.320000    1013.000000   
50%        11.000000     289.040000     288.330000    1016.000000   
75%        15.000000     292.790000     292.390000    1019.000000   
max        19.000000     309.630000     309.280000    1042.000000   

            humidity      dew_point            uvi         clouds  \
count  549208.000000  549208.000000  549208.000000  549208.000000   
mean       62.840346     280.907067       2.628406      61.187437   
std   

# Datos de radiación del día anterior
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

**Estos datos solo están disponibles para las diferentes estaciones de radiación.**

Datos horarios (**HORA SOLAR VERDADERA**) acumulados de radiación global, directa, difusa e infrarroja. Estos datos se obtienen del portal Opendata de AEMET (https://opendata.aemet.es/centrodedescargas/productosAEMET). Los campos obtenidos para cada día son:

- ``Estación``: Nombre Estación
- ``Indicativo``: Indicativo Climatológico Estación
- ``Tipo``: Variable medida (Global/Difusa/Directa/UV Eritemática/Infrarroja)
- ``GL/DF/DT``: Radiación horaria acumulada entre: (hora indicada -1) y (hora indicada) entre las 5 y las 20 Hora Solar Verdadera. Variables: Global/Difusa/Directa (10*kJ/m²)
- ``UVER``: Radiación semihoraria acumulada entre: (hora:minutos indicados - 30 minutos y (hora:minutos indicados) entre las 4:30 y las 20 Hora  Solar Verdadera. Variables: Radiación Ultravioleta Eritemática (J/m²)
- ``IR``: Radiación horaria acumulada entre (hora indicada -1) y (hora indicada) entre las 1 y las 24 Hora Solar Verdadera. Variables: Radiación Infrarroja (10*kJ/m²)
- ...

La hora X contiene los datos transcurridos entre las X-1:00 y las X:00

### Estudiamos los datos


In [67]:
# Importo un csv de datos de radiación del día anterior

%cd /home/dsc/git/TFM/
df_rad_aemet = pd.read_csv('./data/Rad_AEMET/rad_aemet_2021-04-05', sep=',')
df_rad_aemet.head()

/home/dsc/git/TFM


Unnamed: 0,fecha,Estación,Indicativo,Tipo,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,SUMA,Tipo.1,5.1,6.1,7.1,8.1,9.1,10.1,11.1,12.1,13.1,14.1,15.1,16.1,17.1,18.1,19.1,20.1,SUMA.1,Tipo.2,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2,19.2,20.2,SUMA.2,Tipo.3,4.5,5.3,5.5,6.3,6.5,7.3,7.5,8.3,8.5,9.3,9.5,10.3,10.5,11.3,11.5,12.3,12.5,13.3,13.5,14.3,14.5,15.3,15.5,16.3,16.5,17.3,17.5,18.3,18.5,19.3,19.5,20.3,SUMA.3,Tipo.4,1,2,3,4,5.4,6.4,7.4,8.4,9.4,10.4,11.4,12.4,13.4,14.4,15.4,16.4,17.4,18.4,19.4,20.4,21,22,23,24,SUMA.4
0,04-04-21,A CORUÑA,1387,GL,0.0,3.0,47.0,118.0,189.0,247.0,287.0,308.0,307.0,285.0,243.0,185.0,115.0,46.0,3.0,0.0,2382.0,DF,0.0,2.0,16.0,25.0,29.0,32.0,34.0,36.0,34.0,34.0,35.0,31.0,25.0,15.0,2.0,0.0,348.0,DT,0.0,16.0,166.0,260.0,303.0,324.0,333.0,335.0,339.0,330.0,313.0,293.0,252.0,162.0,12.0,0.0,3437.0,UVB,0.0,0.0,0.0,2.0,9.0,20.0,38.0,62.0,91.0,125.0,161.0,195.0,231.0,262.0,272.0,291.0,291.0,266.0,248.0,226.0,194.0,155.0,127.0,93.0,61.0,37.0,19.0,7.0,1.0,0.0,0.0,0.0,3485.0,IR,96.0,96.0,97.0,96.0,96.0,96.0,96.0,96.0,98.0,100.0,103.0,104.0,106.0,107.0,105.0,106.0,106.0,104.0,101.0,99.0,98.0,98.0,98.0,,
1,04-04-21,ALBACETE,8178D,GL,0.0,3.0,41.0,116.0,190.0,255.0,294.0,156.0,269.0,164.0,238.0,195.0,130.0,46.0,2.0,0.0,2104.0,DF,0.0,2.0,25.0,43.0,55.0,73.0,102.0,111.0,134.0,115.0,61.0,46.0,45.0,26.0,2.0,0.0,843.0,DT,0.0,1.0,77.0,190.0,241.0,258.0,235.0,52.0,153.0,58.0,248.0,259.0,211.0,81.0,0.0,0.0,2065.0,UVB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IR,100.0,100.0,99.0,103.0,98.0,97.0,97.0,98.0,100.0,103.0,109.0,124.0,122.0,124.0,109.0,108.0,110.0,117.0,123.0,111.0,119.0,121.0,121.0,124.0,2637.0
2,04-04-21,ALMERÍA AEROPUERTO,6325O,GL,0.0,0.0,18.0,59.0,122.0,174.0,201.0,285.0,262.0,245.0,121.0,94.0,41.0,10.0,1.0,0.0,1632.0,DF,0.0,0.0,19.0,61.0,96.0,123.0,141.0,138.0,100.0,98.0,91.0,65.0,41.0,10.0,1.0,0.0,983.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,1.0,5.0,16.0,28.0,50.0,61.0,84.0,170.0,159.0,182.0,214.0,289.0,202.0,270.0,228.0,162.0,78.0,106.0,79.0,39.0,30.0,17.0,7.0,2.0,0.0,0.0,0.0,0.0,2479.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,
3,04-04-21,b'Badajoz',4478G,GL,0.0,1.0,22.0,120.0,192.0,259.0,301.0,322.0,319.0,294.0,252.0,193.0,119.0,43.0,2.0,0.0,2439.0,DF,0.0,1.0,21.0,55.0,84.0,61.0,60.0,62.0,67.0,70.0,61.0,51.0,40.0,22.0,2.0,0.0,656.0,DT,0.0,0.0,1.0,154.0,186.0,274.0,293.0,297.0,288.0,272.0,263.0,244.0,195.0,97.0,2.0,1.0,2566.0,UVB,0.0,0.0,0.0,0.0,2.0,9.0,22.0,44.0,70.0,100.0,143.0,183.0,219.0,249.0,271.0,284.0,283.0,271.0,248.0,217.0,182.0,144.0,106.0,72.0,44.0,23.0,10.0,3.0,0.0,0.0,0.0,0.0,3200.0,IR,113.0,111.0,116.0,124.0,127.0,130.0,123.0,106.0,108.0,105.0,108.0,111.0,114.0,116.0,117.0,117.0,117.0,116.0,115.0,114.0,112.0,112.0,112.0,112.0,2755.0
4,04-04-21,BARCELONA,0201D,GL,0.0,1.0,37.0,127.0,195.0,143.0,110.0,136.0,155.0,284.0,230.0,173.0,94.0,36.0,1.0,0.0,1723.0,DF,0.0,1.0,22.0,58.0,84.0,96.0,116.0,136.0,147.0,145.0,117.0,91.0,56.0,23.0,1.0,0.0,1095.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,3.0,8.0,18.0,30.0,38.0,85.0,98.0,91.0,95.0,89.0,99.0,137.0,134.0,140.0,199.0,175.0,143.0,115.0,84.0,58.0,35.0,19.0,9.0,3.0,0.0,0.0,0.0,0.0,1905.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,


In [68]:
df_rad_aemet.shape

(35, 117)

Cargo la lista de estaciones

In [69]:
df_estaciones_rad = pd.read_csv(directorio + 'data/estaciones_rad.csv')

In [70]:
df_estaciones_rad.head()

Unnamed: 0,Estación,indicativo,latitud,longitud
0,b'A Coru\xc3\xb1a',1387,432157N,082517W
1,b'Albacete',8178D,390020N,015144W
2,b'Almer\xc3\xada Aeropuerto',6325O,365047N,022125W
3,b'Badajoz',4478G,413800N,005256W
4,b'Barcelona',0201D,412326N,021200E


Busco las estaciones que no están en ambos data frames (estaciones_rad y df_rad_aemet)

In [71]:
estacion_quitar = []
# Estaciones en la lista de estaciones
serie_indicativos = df_estaciones_rad["indicativo"].unique().astype("str")
# Estaciones en el dataset de clima
serie_estaciones = list(set(df_rad_aemet["Indicativo"].unique().astype("str")))

diferencia = len(serie_indicativos) - len(serie_estaciones)
print("La diferencia es de: {}".format(diferencia))

# Guardo los indicativos de las estaciones de la lista que no estan en el dataset
for i in range(0, len(serie_indicativos)):
    estacion = serie_indicativos[i]
    if estacion not in serie_estaciones:
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia -= 1
        print("La diferencia es de: {}".format(diferencia))
# Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
for i in range(0, len(serie_estaciones)):
    estacion = serie_estaciones[i]
    if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia += 1
        print("La diferencia es de: {}".format(diferencia))
print(estacion_quitar)
print("La diferencia es de: {}".format(diferencia))

La diferencia es de: 0
[]
La diferencia es de: 0


Muestro la lista con los indicativos a quitar

In [72]:
estacion_quitar

[]

Elimino estas estaciones del dataset

In [73]:
cont = 0
for i in range(0, len(df_rad_aemet["Indicativo"])):
    if str(df_rad_aemet["Indicativo"].loc[i]) in estacion_quitar:
        cont += 1
print("Hay que eliminar {} filas de {}. Quedarán {}".format(cont, len(df_rad_aemet["Indicativo"]), len(df_rad_aemet["Indicativo"])-cont))

Hay que eliminar 0 filas de 35. Quedarán 35


Creo una columna que almacena un True si la estación de esa fila debe ser eliminada

In [74]:
estaciones_df = []

for i in range(0, len(df_rad_aemet["Indicativo"])):
    estacion = df_rad_aemet["Indicativo"].loc[i]
    if estacion in estacion_quitar:
        estaciones_df.append(True)
    else:
        estaciones_df.append(False)

In [75]:
df_rad_aemet.insert(len(df_rad_aemet.columns), "quitar", estaciones_df, True)

In [76]:
df_rad_aemet.head()

Unnamed: 0,fecha,Estación,Indicativo,Tipo,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,SUMA,Tipo.1,5.1,6.1,7.1,8.1,9.1,10.1,11.1,12.1,13.1,14.1,15.1,16.1,17.1,18.1,19.1,20.1,SUMA.1,Tipo.2,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2,19.2,20.2,SUMA.2,Tipo.3,4.5,5.3,5.5,6.3,6.5,7.3,7.5,8.3,8.5,9.3,9.5,10.3,10.5,11.3,11.5,12.3,12.5,13.3,13.5,14.3,14.5,15.3,15.5,16.3,16.5,17.3,17.5,18.3,18.5,19.3,19.5,20.3,SUMA.3,Tipo.4,1,2,3,4,5.4,6.4,7.4,8.4,9.4,10.4,11.4,12.4,13.4,14.4,15.4,16.4,17.4,18.4,19.4,20.4,21,22,23,24,SUMA.4,quitar
0,04-04-21,A CORUÑA,1387,GL,0.0,3.0,47.0,118.0,189.0,247.0,287.0,308.0,307.0,285.0,243.0,185.0,115.0,46.0,3.0,0.0,2382.0,DF,0.0,2.0,16.0,25.0,29.0,32.0,34.0,36.0,34.0,34.0,35.0,31.0,25.0,15.0,2.0,0.0,348.0,DT,0.0,16.0,166.0,260.0,303.0,324.0,333.0,335.0,339.0,330.0,313.0,293.0,252.0,162.0,12.0,0.0,3437.0,UVB,0.0,0.0,0.0,2.0,9.0,20.0,38.0,62.0,91.0,125.0,161.0,195.0,231.0,262.0,272.0,291.0,291.0,266.0,248.0,226.0,194.0,155.0,127.0,93.0,61.0,37.0,19.0,7.0,1.0,0.0,0.0,0.0,3485.0,IR,96.0,96.0,97.0,96.0,96.0,96.0,96.0,96.0,98.0,100.0,103.0,104.0,106.0,107.0,105.0,106.0,106.0,104.0,101.0,99.0,98.0,98.0,98.0,,,False
1,04-04-21,ALBACETE,8178D,GL,0.0,3.0,41.0,116.0,190.0,255.0,294.0,156.0,269.0,164.0,238.0,195.0,130.0,46.0,2.0,0.0,2104.0,DF,0.0,2.0,25.0,43.0,55.0,73.0,102.0,111.0,134.0,115.0,61.0,46.0,45.0,26.0,2.0,0.0,843.0,DT,0.0,1.0,77.0,190.0,241.0,258.0,235.0,52.0,153.0,58.0,248.0,259.0,211.0,81.0,0.0,0.0,2065.0,UVB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IR,100.0,100.0,99.0,103.0,98.0,97.0,97.0,98.0,100.0,103.0,109.0,124.0,122.0,124.0,109.0,108.0,110.0,117.0,123.0,111.0,119.0,121.0,121.0,124.0,2637.0,False
2,04-04-21,ALMERÍA AEROPUERTO,6325O,GL,0.0,0.0,18.0,59.0,122.0,174.0,201.0,285.0,262.0,245.0,121.0,94.0,41.0,10.0,1.0,0.0,1632.0,DF,0.0,0.0,19.0,61.0,96.0,123.0,141.0,138.0,100.0,98.0,91.0,65.0,41.0,10.0,1.0,0.0,983.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,1.0,5.0,16.0,28.0,50.0,61.0,84.0,170.0,159.0,182.0,214.0,289.0,202.0,270.0,228.0,162.0,78.0,106.0,79.0,39.0,30.0,17.0,7.0,2.0,0.0,0.0,0.0,0.0,2479.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,04-04-21,b'Badajoz',4478G,GL,0.0,1.0,22.0,120.0,192.0,259.0,301.0,322.0,319.0,294.0,252.0,193.0,119.0,43.0,2.0,0.0,2439.0,DF,0.0,1.0,21.0,55.0,84.0,61.0,60.0,62.0,67.0,70.0,61.0,51.0,40.0,22.0,2.0,0.0,656.0,DT,0.0,0.0,1.0,154.0,186.0,274.0,293.0,297.0,288.0,272.0,263.0,244.0,195.0,97.0,2.0,1.0,2566.0,UVB,0.0,0.0,0.0,0.0,2.0,9.0,22.0,44.0,70.0,100.0,143.0,183.0,219.0,249.0,271.0,284.0,283.0,271.0,248.0,217.0,182.0,144.0,106.0,72.0,44.0,23.0,10.0,3.0,0.0,0.0,0.0,0.0,3200.0,IR,113.0,111.0,116.0,124.0,127.0,130.0,123.0,106.0,108.0,105.0,108.0,111.0,114.0,116.0,117.0,117.0,117.0,116.0,115.0,114.0,112.0,112.0,112.0,112.0,2755.0,False
4,04-04-21,BARCELONA,0201D,GL,0.0,1.0,37.0,127.0,195.0,143.0,110.0,136.0,155.0,284.0,230.0,173.0,94.0,36.0,1.0,0.0,1723.0,DF,0.0,1.0,22.0,58.0,84.0,96.0,116.0,136.0,147.0,145.0,117.0,91.0,56.0,23.0,1.0,0.0,1095.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,3.0,8.0,18.0,30.0,38.0,85.0,98.0,91.0,95.0,89.0,99.0,137.0,134.0,140.0,199.0,175.0,143.0,115.0,84.0,58.0,35.0,19.0,9.0,3.0,0.0,0.0,0.0,0.0,1905.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [77]:
df_rad_aemet.drop(df_rad_aemet[df_rad_aemet["quitar"] == True].index, inplace = True)

In [78]:
df_rad_aemet.reset_index(drop=True, inplace=True)
df_rad_aemet.head()

Unnamed: 0,fecha,Estación,Indicativo,Tipo,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,SUMA,Tipo.1,5.1,6.1,7.1,8.1,9.1,10.1,11.1,12.1,13.1,14.1,15.1,16.1,17.1,18.1,19.1,20.1,SUMA.1,Tipo.2,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2,19.2,20.2,SUMA.2,Tipo.3,4.5,5.3,5.5,6.3,6.5,7.3,7.5,8.3,8.5,9.3,9.5,10.3,10.5,11.3,11.5,12.3,12.5,13.3,13.5,14.3,14.5,15.3,15.5,16.3,16.5,17.3,17.5,18.3,18.5,19.3,19.5,20.3,SUMA.3,Tipo.4,1,2,3,4,5.4,6.4,7.4,8.4,9.4,10.4,11.4,12.4,13.4,14.4,15.4,16.4,17.4,18.4,19.4,20.4,21,22,23,24,SUMA.4,quitar
0,04-04-21,A CORUÑA,1387,GL,0.0,3.0,47.0,118.0,189.0,247.0,287.0,308.0,307.0,285.0,243.0,185.0,115.0,46.0,3.0,0.0,2382.0,DF,0.0,2.0,16.0,25.0,29.0,32.0,34.0,36.0,34.0,34.0,35.0,31.0,25.0,15.0,2.0,0.0,348.0,DT,0.0,16.0,166.0,260.0,303.0,324.0,333.0,335.0,339.0,330.0,313.0,293.0,252.0,162.0,12.0,0.0,3437.0,UVB,0.0,0.0,0.0,2.0,9.0,20.0,38.0,62.0,91.0,125.0,161.0,195.0,231.0,262.0,272.0,291.0,291.0,266.0,248.0,226.0,194.0,155.0,127.0,93.0,61.0,37.0,19.0,7.0,1.0,0.0,0.0,0.0,3485.0,IR,96.0,96.0,97.0,96.0,96.0,96.0,96.0,96.0,98.0,100.0,103.0,104.0,106.0,107.0,105.0,106.0,106.0,104.0,101.0,99.0,98.0,98.0,98.0,,,False
1,04-04-21,ALBACETE,8178D,GL,0.0,3.0,41.0,116.0,190.0,255.0,294.0,156.0,269.0,164.0,238.0,195.0,130.0,46.0,2.0,0.0,2104.0,DF,0.0,2.0,25.0,43.0,55.0,73.0,102.0,111.0,134.0,115.0,61.0,46.0,45.0,26.0,2.0,0.0,843.0,DT,0.0,1.0,77.0,190.0,241.0,258.0,235.0,52.0,153.0,58.0,248.0,259.0,211.0,81.0,0.0,0.0,2065.0,UVB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IR,100.0,100.0,99.0,103.0,98.0,97.0,97.0,98.0,100.0,103.0,109.0,124.0,122.0,124.0,109.0,108.0,110.0,117.0,123.0,111.0,119.0,121.0,121.0,124.0,2637.0,False
2,04-04-21,ALMERÍA AEROPUERTO,6325O,GL,0.0,0.0,18.0,59.0,122.0,174.0,201.0,285.0,262.0,245.0,121.0,94.0,41.0,10.0,1.0,0.0,1632.0,DF,0.0,0.0,19.0,61.0,96.0,123.0,141.0,138.0,100.0,98.0,91.0,65.0,41.0,10.0,1.0,0.0,983.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,1.0,5.0,16.0,28.0,50.0,61.0,84.0,170.0,159.0,182.0,214.0,289.0,202.0,270.0,228.0,162.0,78.0,106.0,79.0,39.0,30.0,17.0,7.0,2.0,0.0,0.0,0.0,0.0,2479.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,04-04-21,b'Badajoz',4478G,GL,0.0,1.0,22.0,120.0,192.0,259.0,301.0,322.0,319.0,294.0,252.0,193.0,119.0,43.0,2.0,0.0,2439.0,DF,0.0,1.0,21.0,55.0,84.0,61.0,60.0,62.0,67.0,70.0,61.0,51.0,40.0,22.0,2.0,0.0,656.0,DT,0.0,0.0,1.0,154.0,186.0,274.0,293.0,297.0,288.0,272.0,263.0,244.0,195.0,97.0,2.0,1.0,2566.0,UVB,0.0,0.0,0.0,0.0,2.0,9.0,22.0,44.0,70.0,100.0,143.0,183.0,219.0,249.0,271.0,284.0,283.0,271.0,248.0,217.0,182.0,144.0,106.0,72.0,44.0,23.0,10.0,3.0,0.0,0.0,0.0,0.0,3200.0,IR,113.0,111.0,116.0,124.0,127.0,130.0,123.0,106.0,108.0,105.0,108.0,111.0,114.0,116.0,117.0,117.0,117.0,116.0,115.0,114.0,112.0,112.0,112.0,112.0,2755.0,False
4,04-04-21,BARCELONA,0201D,GL,0.0,1.0,37.0,127.0,195.0,143.0,110.0,136.0,155.0,284.0,230.0,173.0,94.0,36.0,1.0,0.0,1723.0,DF,0.0,1.0,22.0,58.0,84.0,96.0,116.0,136.0,147.0,145.0,117.0,91.0,56.0,23.0,1.0,0.0,1095.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,3.0,8.0,18.0,30.0,38.0,85.0,98.0,91.0,95.0,89.0,99.0,137.0,134.0,140.0,199.0,175.0,143.0,115.0,84.0,58.0,35.0,19.0,9.0,3.0,0.0,0.0,0.0,0.0,1905.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [79]:
print("Efectivamente, la nueva cantidad de filas es {}".format(len(df_rad_aemet["Indicativo"])))

Efectivamente, la nueva cantidad de filas es 35


Se genera el dataset completo, cargando todos los archivos diarios guardados y concatenándolos.

In [80]:
fechas = []

# Se fija la priemra fecha (fecha del primer archivo diario)
now = datetime.now()
fecha_inicial = datetime(2021,4,5)

# Se fija la fecha más actual
fecha_final = datetime(now.year,now.month,now.day)

#Se obtiene la lista de días
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

fechas

['2021-04-05',
 '2021-04-06',
 '2021-04-07',
 '2021-04-08',
 '2021-04-09',
 '2021-04-10',
 '2021-04-11',
 '2021-04-12',
 '2021-04-13',
 '2021-04-14',
 '2021-04-15',
 '2021-04-16',
 '2021-04-17',
 '2021-04-18',
 '2021-04-19',
 '2021-04-20',
 '2021-04-21',
 '2021-04-22',
 '2021-04-23',
 '2021-04-24',
 '2021-04-25',
 '2021-04-26',
 '2021-04-27',
 '2021-04-28',
 '2021-04-29',
 '2021-04-30',
 '2021-05-01',
 '2021-05-02',
 '2021-05-03',
 '2021-05-04',
 '2021-05-05',
 '2021-05-06',
 '2021-05-07',
 '2021-05-08',
 '2021-05-09',
 '2021-05-10',
 '2021-05-11',
 '2021-05-12',
 '2021-05-13',
 '2021-05-14',
 '2021-05-15',
 '2021-05-16',
 '2021-05-17',
 '2021-05-18',
 '2021-05-19',
 '2021-05-20',
 '2021-05-21',
 '2021-05-22',
 '2021-05-23',
 '2021-05-24',
 '2021-05-25',
 '2021-05-26',
 '2021-05-27',
 '2021-05-28',
 '2021-05-29',
 '2021-05-30',
 '2021-05-31',
 '2021-06-01',
 '2021-06-02',
 '2021-06-03',
 '2021-06-04',
 '2021-06-05']

In [81]:
df_rad_aemet_total = pd.DataFrame()

# Para cada día, se añade el dataset guardado
for date in fechas:
    try:
        df_rad_aemet = pd.read_csv(directorio + 'data/Rad_AEMET/rad_aemet_{}'.format(date))
        df_rad_aemet["fecha"] = (pd.to_datetime(date)- timedelta(days = 1)).strftime('%Y-%m-%d')
        df_rad_aemet_total = df_rad_aemet_total.append(df_rad_aemet, ignore_index = True)
    except:
        continue
    
print(df_rad_aemet_total.shape)
df_rad_aemet_total.head()

(2240, 117)


Unnamed: 0,fecha,Estación,Indicativo,Tipo,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,SUMA,Tipo.1,5.1,6.1,7.1,8.1,9.1,10.1,11.1,12.1,13.1,14.1,15.1,16.1,17.1,18.1,19.1,20.1,SUMA.1,Tipo.2,5.2,6.2,7.2,8.2,9.2,10.2,11.2,12.2,13.2,14.2,15.2,16.2,17.2,18.2,19.2,20.2,SUMA.2,Tipo.3,4.5,5.3,5.5,6.3,6.5,7.3,7.5,8.3,8.5,9.3,9.5,10.3,10.5,11.3,11.5,12.3,12.5,13.3,13.5,14.3,14.5,15.3,15.5,16.3,16.5,17.3,17.5,18.3,18.5,19.3,19.5,20.3,SUMA.3,Tipo.4,1,2,3,4,5.4,6.4,7.4,8.4,9.4,10.4,11.4,12.4,13.4,14.4,15.4,16.4,17.4,18.4,19.4,20.4,21,22,23,24,SUMA.4
0,2021-04-04,A CORUÑA,1387,GL,0.0,3.0,47.0,118.0,189.0,247.0,287.0,308.0,307.0,285.0,243.0,185.0,115.0,46.0,3.0,0.0,2382.0,DF,0.0,2.0,16.0,25.0,29.0,32.0,34.0,36.0,34.0,34.0,35.0,31.0,25.0,15.0,2.0,0.0,348.0,DT,0.0,16.0,166.0,260.0,303.0,324.0,333.0,335.0,339.0,330.0,313.0,293.0,252.0,162.0,12.0,0.0,3437.0,UVB,0.0,0.0,0.0,2.0,9.0,20.0,38.0,62.0,91.0,125.0,161.0,195.0,231.0,262.0,272.0,291.0,291.0,266.0,248.0,226.0,194.0,155.0,127.0,93.0,61.0,37.0,19.0,7.0,1.0,0.0,0.0,0.0,3485.0,IR,96.0,96.0,97.0,96.0,96.0,96.0,96.0,96.0,98.0,100.0,103.0,104.0,106.0,107.0,105.0,106.0,106.0,104.0,101.0,99.0,98.0,98.0,98.0,,
1,2021-04-04,ALBACETE,8178D,GL,0.0,3.0,41.0,116.0,190.0,255.0,294.0,156.0,269.0,164.0,238.0,195.0,130.0,46.0,2.0,0.0,2104.0,DF,0.0,2.0,25.0,43.0,55.0,73.0,102.0,111.0,134.0,115.0,61.0,46.0,45.0,26.0,2.0,0.0,843.0,DT,0.0,1.0,77.0,190.0,241.0,258.0,235.0,52.0,153.0,58.0,248.0,259.0,211.0,81.0,0.0,0.0,2065.0,UVB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IR,100.0,100.0,99.0,103.0,98.0,97.0,97.0,98.0,100.0,103.0,109.0,124.0,122.0,124.0,109.0,108.0,110.0,117.0,123.0,111.0,119.0,121.0,121.0,124.0,2637.0
2,2021-04-04,ALMERÍA AEROPUERTO,6325O,GL,0.0,0.0,18.0,59.0,122.0,174.0,201.0,285.0,262.0,245.0,121.0,94.0,41.0,10.0,1.0,0.0,1632.0,DF,0.0,0.0,19.0,61.0,96.0,123.0,141.0,138.0,100.0,98.0,91.0,65.0,41.0,10.0,1.0,0.0,983.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,1.0,5.0,16.0,28.0,50.0,61.0,84.0,170.0,159.0,182.0,214.0,289.0,202.0,270.0,228.0,162.0,78.0,106.0,79.0,39.0,30.0,17.0,7.0,2.0,0.0,0.0,0.0,0.0,2479.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,
3,2021-04-04,b'Badajoz',4478G,GL,0.0,1.0,22.0,120.0,192.0,259.0,301.0,322.0,319.0,294.0,252.0,193.0,119.0,43.0,2.0,0.0,2439.0,DF,0.0,1.0,21.0,55.0,84.0,61.0,60.0,62.0,67.0,70.0,61.0,51.0,40.0,22.0,2.0,0.0,656.0,DT,0.0,0.0,1.0,154.0,186.0,274.0,293.0,297.0,288.0,272.0,263.0,244.0,195.0,97.0,2.0,1.0,2566.0,UVB,0.0,0.0,0.0,0.0,2.0,9.0,22.0,44.0,70.0,100.0,143.0,183.0,219.0,249.0,271.0,284.0,283.0,271.0,248.0,217.0,182.0,144.0,106.0,72.0,44.0,23.0,10.0,3.0,0.0,0.0,0.0,0.0,3200.0,IR,113.0,111.0,116.0,124.0,127.0,130.0,123.0,106.0,108.0,105.0,108.0,111.0,114.0,116.0,117.0,117.0,117.0,116.0,115.0,114.0,112.0,112.0,112.0,112.0,2755.0
4,2021-04-04,BARCELONA,0201D,GL,0.0,1.0,37.0,127.0,195.0,143.0,110.0,136.0,155.0,284.0,230.0,173.0,94.0,36.0,1.0,0.0,1723.0,DF,0.0,1.0,22.0,58.0,84.0,96.0,116.0,136.0,147.0,145.0,117.0,91.0,56.0,23.0,1.0,0.0,1095.0,DT,,,,,,,,,,,,,,,,,,UVB,0.0,0.0,0.0,0.0,3.0,8.0,18.0,30.0,38.0,85.0,98.0,91.0,95.0,89.0,99.0,137.0,134.0,140.0,199.0,175.0,143.0,115.0,84.0,58.0,35.0,19.0,9.0,3.0,0.0,0.0,0.0,0.0,1905.0,IR,,,,,,,,,,,,,,,,,,,,,,,,,


Al corresponder el nombre de la hora a la hora que finaliza, las horas útiles se definen de manera diferente:

In [82]:
hora_ini = 5
hora_fin = 21
dif = int(int(hora_fin)-int(hora_ini))

Se genera un dataset donde las filas correspondan a una hora, para cada día y estación.

In [83]:
df_rad_horas = pd.DataFrame(columns = ["fecha", "hora", "estacion", "indicativo", "GL", "DF", "DT", "UVB", "IR"])


for i, fila in df_rad_aemet_total.iterrows():
    
    for j in range(0, dif):
        
        # Para cada fila, se guardan los valores por hora, en base al nombre de cada columna (llamadas, para cada variable, como la hora del dato)
        # En el caso de la variable UVB los valores son por medias horas, luego se obtiene la suma de ambos
        
        hora = 5+j
        col_gl = str(hora)
        col_df = str(hora) + ".1"
        col_dt = str(hora) + ".2"
        col_uvb = str(hora) + ".3"
        col_uvb_2 = str(hora-1) + ".5"
        col_ir = str(hora) + ".4"
        df_rad_horas = df_rad_horas.append({'fecha' : fila["fecha"], 'estacion' : fila["Estación"], 'indicativo' : fila["Indicativo"], 'GL' : fila[col_gl], 'DF' : fila[col_df], 'DT' : fila[col_dt], 'UVB' : (fila[col_uvb] + fila[col_uvb_2]), 'IR' : fila[col_ir], 'hora' : hora-1}, ignore_index = True)

df_rad_horas.head()

Unnamed: 0,fecha,hora,estacion,indicativo,GL,DF,DT,UVB,IR
0,2021-04-04,4,A CORUÑA,1387,0.0,0.0,0.0,0.0,96.0
1,2021-04-04,5,A CORUÑA,1387,3.0,2.0,16.0,2.0,96.0
2,2021-04-04,6,A CORUÑA,1387,47.0,16.0,166.0,29.0,96.0
3,2021-04-04,7,A CORUÑA,1387,118.0,25.0,260.0,100.0,96.0
4,2021-04-04,8,A CORUÑA,1387,189.0,29.0,303.0,216.0,98.0


### Seleccionamos columnas interesantes

Se eliminan las columnas de radiación directa y difusa, derivadas de la global.

In [84]:
df_rad_horas.drop(['DF'], axis=1, inplace = True)
df_rad_horas.drop(['DT'], axis=1, inplace = True)

In [85]:
df_rad_horas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35840 entries, 0 to 35839
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   fecha       35840 non-null  object 
 1   hora        35840 non-null  object 
 2   estacion    35840 non-null  object 
 3   indicativo  35840 non-null  object 
 4   GL          34377 non-null  float64
 5   UVB         25373 non-null  float64
 6   IR          22291 non-null  float64
dtypes: float64(3), object(4)
memory usage: 1.9+ MB


### Convertimos las columnas numéricas al tipo de dato correcto

In [86]:
df_rad_horas["hora"] = pd.to_numeric([np.nan if pd.isna(c) == True else int(c) for c in df_rad_horas["hora"]])

Converimos los 10 KJ/m² a W/m², J/m² a W/m² y 10 KJ/m² a W/m²

In [87]:
df_rad_horas["GL"] = df_rad_horas["GL"] *10/3.6
df_rad_horas["UVB"] = df_rad_horas["UVB"] *1/(3.6*1000)
df_rad_horas["IR"] = df_rad_horas["IR"] *10/3.6

In [88]:
df_rad_horas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35840 entries, 0 to 35839
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   fecha       35840 non-null  object 
 1   hora        35840 non-null  int64  
 2   estacion    35840 non-null  object 
 3   indicativo  35840 non-null  object 
 4   GL          34377 non-null  float64
 5   UVB         25373 non-null  float64
 6   IR          22291 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 1.9+ MB


In [89]:
df_rad_horas.head()

Unnamed: 0,fecha,hora,estacion,indicativo,GL,UVB,IR
0,2021-04-04,4,A CORUÑA,1387,0.0,0.0,266.666667
1,2021-04-04,5,A CORUÑA,1387,8.333333,0.000556,266.666667
2,2021-04-04,6,A CORUÑA,1387,130.555556,0.008056,266.666667
3,2021-04-04,7,A CORUÑA,1387,327.777778,0.027778,266.666667
4,2021-04-04,8,A CORUÑA,1387,525.0,0.06,272.222222


### Gestión de los NAs

Muestro el número de NAs por columna y el % que supone del total

In [90]:
total = df_rad_horas.isnull().sum().sort_values(ascending = False)
percent = (df_rad_horas.isnull().sum() / df_rad_horas.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
IR,13549,0.378041
UVB,10467,0.292048
GL,1463,0.04082
indicativo,0,0.0
estacion,0,0.0
hora,0,0.0
fecha,0,0.0


Hay entre un 30 y 39 % de NAs en ``IR`` y ``UVB``. Sustituyo los valores faltantes por la media.

In [91]:
df_rad_horas.fillna({'IR': df_rad_horas["IR"].mean(), "UVB": df_rad_horas["UVB"].mean()}, inplace = True)

Hay un 4,5 % de NAs en ``GL``. Elimino las filas sin datos.

In [92]:
df_rad_horas.dropna(inplace = True)
df_rad_horas.reset_index(drop=True, inplace=True)
df_rad_horas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34377 entries, 0 to 34376
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   fecha       34377 non-null  object 
 1   hora        34377 non-null  int64  
 2   estacion    34377 non-null  object 
 3   indicativo  34377 non-null  object 
 4   GL          34377 non-null  float64
 5   UVB         34377 non-null  float64
 6   IR          34377 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 1.8+ MB


Se eliminan posibles filas repetidas

In [93]:
df_rad_horas = df_rad_horas.drop_duplicates(['fecha', 'hora', "indicativo"],
                        keep = 'first')
df_rad_horas.reset_index(drop = True, inplace = True)
df_rad_horas.head()

Unnamed: 0,fecha,hora,estacion,indicativo,GL,UVB,IR
0,2021-04-04,4,A CORUÑA,1387,0.0,0.0,266.666667
1,2021-04-04,5,A CORUÑA,1387,8.333333,0.000556,266.666667
2,2021-04-04,6,A CORUÑA,1387,130.555556,0.008056,266.666667
3,2021-04-04,7,A CORUÑA,1387,327.777778,0.027778,266.666667
4,2021-04-04,8,A CORUÑA,1387,525.0,0.06,272.222222


# Función final radiación día anterior
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

In [94]:
def rad_aemet_clean(df_datos, df_puntos):
    
    #Elimino las estaciones que no estén en ambos datasets
    
    estacion_quitar = []
    # Estaciones en la lista de estaciones
    serie_indicativos = df_puntos["indicativo"].unique().astype("str")
    # Estaciones en el dataset de clima
    serie_estaciones = list(set(df_datos["Indicativo"].unique().astype("str")))

    diferencia = len(serie_indicativos) - len(serie_estaciones)

    # Guardo los indicativos de las estaciones de la lista que no estan en el dataset
    for i in range(0, len(serie_indicativos)):
        estacion = serie_indicativos[i]
        if estacion not in serie_estaciones:
            estacion_quitar.append(str(estacion))
            diferencia -= 1
            
    # Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
    for i in range(0, len(serie_estaciones)):
        estacion = serie_estaciones[i]
        if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
            estacion_quitar.append(str(estacion))
            diferencia += 1
    
    estaciones_df = []

    for i in range(0, len(df_datos["Indicativo"])):
        estacion = df_datos["Indicativo"].loc[i]
        if estacion in estacion_quitar:
            estaciones_df.append(True)
        else:
            estaciones_df.append(False)

    df_datos.insert(len(df_datos.columns), "quitar", estaciones_df, True)
    df_datos.drop(df_datos[df_datos["quitar"] == True].index, inplace = True)
    df_datos.reset_index(drop=True, inplace=True)
    
    # Genero el dataset por hora
    
    dif = int(int(hora_fin)-int(hora_ini))
    df_rad_horas = pd.DataFrame(columns = ["fecha", "hora", "estacion", "indicativo", "GL", "DF", "DT", "UVB", "IR"])

    for i, fila in df_datos.iterrows():

        for j in range(0, dif):
            hora = 5+j
            col_gl = str(hora)
            col_df = str(hora) + ".1"
            col_dt = str(hora) + ".2"
            col_uvb = str(hora) + ".3"
            col_uvb_2 = str(hora-1) + ".5"
            col_ir = str(hora) + ".4"
            df_rad_horas = df_rad_horas.append({'fecha' : fila["fecha"], 'estacion' : fila["Estación"], 'indicativo' : fila["Indicativo"], 'GL' : fila[col_gl], 'DF' : fila[col_df], 'DT' : fila[col_dt], 'UVB' : (fila[col_uvb] + fila[col_uvb_2]), 'IR' : fila[col_ir], 'hora' : hora-1}, ignore_index = True)
    df_rad_horas.drop(['DF'], axis=1, inplace = True)
    df_rad_horas.drop(['DT'], axis=1, inplace = True)
    
    # Convierto las columnas a los tipos de dato correctos
    
    df_rad_horas["hora"] = pd.to_numeric([np.nan if pd.isna(c) == True else int(c) for c in df_rad_horas["hora"]])
    df_rad_horas["GL"] = df_rad_horas["GL"] *10/3.6
    df_rad_horas["UVB"] = df_rad_horas["UVB"] *1/(3.6*1000)
    df_rad_horas["IR"] = df_rad_horas["IR"] *10/3.6
    
    # Elimino Na's
    
    df_rad_horas.fillna({'IR': df_rad_horas["IR"].mean(), "UVB": df_rad_horas["UVB"].mean()}, inplace = True)
    df_rad_horas.dropna(inplace = True)
    df_rad_horas.reset_index(drop=True, inplace=True)
    
    # Se eliminan posibles filas repetidas
    
    df_rad_horas = df_rad_horas.drop_duplicates(['fecha', 'hora', "indicativo"], keep = 'first')
    df_rad_horas.reset_index(drop = True, inplace = True)
    df_rad_horas.head()

    # Guardo el archivo limpio
    nombre = './data/Historicos_modelo_2/rad_aemet_clean.csv'
    df_rad_horas.to_csv(nombre, index = False)
    
    return df_rad_horas

In [95]:
import numpy as np
import pandas as pd
import random
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from datetime import datetime, timedelta
hora_ini = 5
hora_fin = 21

# Leo csvs
fechas = []
now = datetime.now()
fecha_inicial = datetime(2021,4,5)
fecha_final = datetime(now.year,now.month,now.day)
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

df_rad_aemet_total = pd.DataFrame()
directorio = '/home/dsc/git/TFM/'

for date in fechas:
    try:
        df_rad_aemet = pd.read_csv(directorio + 'data/Rad_AEMET/rad_aemet_{}'.format(date))
        df_rad_aemet["fecha"] = (pd.to_datetime(date)- timedelta(days = 1)).strftime('%Y-%m-%d')
        df_rad_aemet_total = df_rad_aemet_total.append(df_rad_aemet, ignore_index = True)
    except:
        continue

df_estaciones_rad = pd.read_csv(directorio + 'data/estaciones_rad.csv')

# Llamo a la función
df_clean = rad_aemet_clean(df_rad_aemet_total, df_estaciones_rad)
    
df_clean.head()

Unnamed: 0,fecha,hora,estacion,indicativo,GL,UVB,IR
0,2021-04-04,4,A CORUÑA,1387,0.0,0.0,266.666667
1,2021-04-04,5,A CORUÑA,1387,8.333333,0.000556,266.666667
2,2021-04-04,6,A CORUÑA,1387,130.555556,0.008056,266.666667
3,2021-04-04,7,A CORUÑA,1387,327.777778,0.027778,266.666667
4,2021-04-04,8,A CORUÑA,1387,525.0,0.06,272.222222


In [96]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32106 entries, 0 to 32105
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   fecha       32106 non-null  object 
 1   hora        32106 non-null  int64  
 2   estacion    32106 non-null  object 
 3   indicativo  32106 non-null  object 
 4   GL          32106 non-null  float64
 5   UVB         32106 non-null  float64
 6   IR          32106 non-null  float64
dtypes: float64(3), int64(1), object(3)
memory usage: 1.7+ MB


### Sanity check

In [97]:
print('Años distintos: ', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).year.nunique())
print('Meses diferentes:', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).month.nunique())
print('Días diferentes:', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).day.nunique())

print('Solo debe haber horas distintas dentro de las horas de filtrado:', df_clean['hora'].nunique())

Años distintos:  1
Meses diferentes: 3
Días diferentes: 31
Solo debe haber horas distintas dentro de las horas de filtrado: 16


# Datos de radiación de dos días antes
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

Estos datos se obtienen del portal CAMS Radiation Service de la Unión Europea (http://www.soda-pro.com/web-services/radiation/cams-radiation-service). **En hora UTC.** Proporciona la radiación para cualquier fecha hasta 2 días antes de la llamada. Los campos obtenidos para cada día son:

- ``Observation period``: Beginning/end of the time period with the format "yyyy-mm-ddTHH:MM:SS.S/yyyy-mm-ddTHH:MM:SS.S"
- ``TOA``: Irradiation on horizontal plane at the top of atmosphere (Wh/m2) computed from Solar Geometry 2
- ``Clear sky GHI``: Clear sky global irradiation on horizontal plane at ground level (Wh/m2)
- ``Clear sky BHI``: Clear sky beam irradiation on horizontal plane at ground level (Wh/m2)
- ``Clear sky DHI``: Clear sky diffuse irradiation on horizontal plane at ground level (Wh/m2)
- ``Clear sky BNI``: Clear sky beam irradiation on mobile plane following the sun at normal incidence (Wh/m2)
- ``GHI``: Global irradiation on horizontal plane at ground level (Wh/m2)
- ``BHI``: Beam irradiation on horizontal plane at ground level (Wh/m2)
- ``DHI``: Diffuse irradiation on horizontal plane at ground level (Wh/m2)
- ``BNI``: Beam irradiation on mobile plane following the sun at normal incidence (Wh/m2)
- ``Reliability``: Proportion of reliable data in the summarization (0-1)

La hora X contiene los datos transcurridos entre las X:00 y las X:59

### Estudiamos los datos

Se prueba primero con un archivo diario.

In [121]:
# Importo el csv de datos de radiación de hace dos días.

%cd /home/dsc/git/TFM/
df_soda = pd.read_csv('./data/Rad_SODA/rad_soda_2021-04-19', sep=',')
df_soda.head()

/home/dsc/git/TFM


Unnamed: 0,dateBegins,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion
0,2021-04-16T00:00:00.0,2021-04-16T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
1,2021-04-16T01:00:00.0,2021-04-16T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
2,2021-04-16T02:00:00.0,2021-04-16T03:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
3,2021-04-16T03:00:00.0,2021-04-16T04:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
4,2021-04-16T04:00:00.0,2021-04-16T05:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D


In [122]:
df_soda.shape

(6888, 13)

Cargo la lista de estaciones

In [123]:
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

In [124]:
df_estaciones.head()

Unnamed: 0,latitud,provincia,altitud,indicativo,nombre,longitud
0,413515N,BARCELONA,74,0252D,ARENYS DE MAR,023224E
1,411734N,BARCELONA,4,0076,BARCELONA AEROPUERTO,020412E
2,412506N,BARCELONA,408,0200E,"BARCELONA, FABRA",020727E
3,412326N,BARCELONA,6,0201D,BARCELONA,021200E
4,414312N,BARCELONA,291,0149X,MANRESA,015025E


Busco las estaciones que no están en ambos data frames (estaciones y df_soda)

In [125]:
estacion_quitar = []
# Estaciones en la lista de estaciones
serie_indicativos = df_estaciones["indicativo"].unique().astype("str")
# Estaciones en el dataset de clima
serie_estaciones = list(set(df_soda["estacion"].unique().astype("str")))

diferencia = len(serie_indicativos) - len(serie_estaciones)
print("La diferencia es de: {}".format(diferencia))

# Guardo los indicativos de las estaciones de la lista que no estan en el dataset
for i in range(0, len(serie_indicativos)):
    estacion = serie_indicativos[i]
    if estacion not in serie_estaciones:
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia -= 1
        print("La diferencia es de: {}".format(diferencia))
# Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
for i in range(0, len(serie_estaciones)):
    estacion = serie_estaciones[i]
    if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
        print("No esta la {}, estacion {}".format(i, estacion))
        estacion_quitar.append(str(estacion))
        diferencia += 1
        print("La diferencia es de: {}".format(diferencia))
print(estacion_quitar)
print("La diferencia es de: {}".format(diferencia))

La diferencia es de: 0
No esta la 31, estacion 1249X
La diferencia es de: -1
No esta la 194, estacion 1249I
La diferencia es de: 0
['1249X', '1249I']
La diferencia es de: 0


Muestro la lista con los indicativos a quitar

In [126]:
estacion_quitar

['1249X', '1249I']

Elimino estas estaciones del dataset

In [127]:
cont = 0
for i in range(0, len(df_soda["estacion"])):
    if str(df_soda["estacion"].loc[i]) in estacion_quitar:
        cont += 1
print("Hay que eliminar {} filas de {}. Quedarán {}".format(cont, len(df_soda["estacion"]), len(df_soda["estacion"])-cont))

Hay que eliminar 24 filas de 6888. Quedarán 6864


Creo una columna que almacena un True si la estación de esa fila debe ser eliminada

In [128]:
estaciones_df = []

for i in range(0, len(df_soda["estacion"])):
    estacion = df_soda["estacion"].loc[i]
    if estacion in estacion_quitar:
        estaciones_df.append(True)
    else:
        estaciones_df.append(False)

In [129]:
df_soda.insert(len(df_soda.columns),"quitar",estaciones_df,True)

In [130]:
df_soda.head()

Unnamed: 0,dateBegins,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,quitar
0,2021-04-16T00:00:00.0,2021-04-16T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
1,2021-04-16T01:00:00.0,2021-04-16T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
2,2021-04-16T02:00:00.0,2021-04-16T03:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
3,2021-04-16T03:00:00.0,2021-04-16T04:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
4,2021-04-16T04:00:00.0,2021-04-16T05:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False


In [131]:
df_soda.drop(df_soda[df_soda["quitar"] == True].index, inplace = True)

In [132]:
df_soda.reset_index(drop=True, inplace=True)
df_soda.head()

Unnamed: 0,dateBegins,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,quitar
0,2021-04-16T00:00:00.0,2021-04-16T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
1,2021-04-16T01:00:00.0,2021-04-16T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
2,2021-04-16T02:00:00.0,2021-04-16T03:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
3,2021-04-16T03:00:00.0,2021-04-16T04:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
4,2021-04-16T04:00:00.0,2021-04-16T05:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False


In [133]:
print("Efectivamente, la nueva cantidad de filas es {}".format(len(df_soda["estacion"])))

Efectivamente, la nueva cantidad de filas es 6864


### Seleccionamos columnas interesantes

Necesitaremos:
- ``dateBegins``: Fecha del día al que corresponden los datos
- ``estacion``: Código indicativo de la estación meteorológica

Las columnas ``bhi``, ``dhi``, ``bni``, ``toa``, ``reliability`` y ``dateEnds`` no serán necesarias. Igualmente eliminamos la columna ``quitar`` y las columnas de valores para cielo despejado (cs).

In [134]:
df_soda.head()

Unnamed: 0,dateBegins,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,quitar
0,2021-04-16T00:00:00.0,2021-04-16T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
1,2021-04-16T01:00:00.0,2021-04-16T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
2,2021-04-16T02:00:00.0,2021-04-16T03:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
3,2021-04-16T03:00:00.0,2021-04-16T04:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False
4,2021-04-16T04:00:00.0,2021-04-16T05:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,False


In [135]:
df_soda.drop(['dateEnds'], axis=1, inplace = True)
df_soda.drop(['toa'], axis=1, inplace = True)
df_soda.drop(['cs_ghi'], axis=1, inplace = True)
df_soda.drop(['cs_bhi'], axis=1, inplace = True)
df_soda.drop(['cs_dhi'], axis=1, inplace = True)
df_soda.drop(['cs_bni'], axis=1, inplace = True)
df_soda.drop(['bhi'], axis=1, inplace = True)
df_soda.drop(['dhi'], axis=1, inplace = True)
df_soda.drop(['bni'], axis=1, inplace = True)
df_soda.drop(['reliability'], axis=1, inplace = True)
df_soda.drop(['quitar'], axis=1, inplace = True)
df_soda.head()

Unnamed: 0,dateBegins,ghi,estacion
0,2021-04-16T00:00:00.0,0.0,0252D
1,2021-04-16T01:00:00.0,0.0,0252D
2,2021-04-16T02:00:00.0,0.0,0252D
3,2021-04-16T03:00:00.0,0.0,0252D
4,2021-04-16T04:00:00.0,0.0,0252D


In [136]:
df_soda.shape

(6864, 3)

Usando ``.info()`` se ven los NAs.

In [137]:
df_soda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6864 entries, 0 to 6863
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dateBegins  6864 non-null   object 
 1   ghi         6864 non-null   float64
 2   estacion    6864 non-null   object 
dtypes: float64(1), object(2)
memory usage: 161.0+ KB


Se genera el dataset compelto concatenando todos los archivos diarios guardados

In [138]:
fechas = []

# Se fija la priemra fecha (fecha del primer archivo diario)
now = datetime.now()
fecha_inicial = datetime(2021,4,5)

# Se fija la fecha más actual
fecha_final = datetime(now.year,now.month,now.day)

#Se obtiene la lista de días
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

fechas

['2021-04-05',
 '2021-04-06',
 '2021-04-07',
 '2021-04-08',
 '2021-04-09',
 '2021-04-10',
 '2021-04-11',
 '2021-04-12',
 '2021-04-13',
 '2021-04-14',
 '2021-04-15',
 '2021-04-16',
 '2021-04-17',
 '2021-04-18',
 '2021-04-19',
 '2021-04-20',
 '2021-04-21',
 '2021-04-22',
 '2021-04-23',
 '2021-04-24',
 '2021-04-25',
 '2021-04-26',
 '2021-04-27',
 '2021-04-28',
 '2021-04-29',
 '2021-04-30',
 '2021-05-01',
 '2021-05-02',
 '2021-05-03',
 '2021-05-04',
 '2021-05-05',
 '2021-05-06',
 '2021-05-07',
 '2021-05-08',
 '2021-05-09',
 '2021-05-10',
 '2021-05-11',
 '2021-05-12',
 '2021-05-13',
 '2021-05-14',
 '2021-05-15',
 '2021-05-16',
 '2021-05-17',
 '2021-05-18',
 '2021-05-19',
 '2021-05-20',
 '2021-05-21',
 '2021-05-22',
 '2021-05-23',
 '2021-05-24',
 '2021-05-25',
 '2021-05-26',
 '2021-05-27',
 '2021-05-28',
 '2021-05-29',
 '2021-05-30',
 '2021-05-31',
 '2021-06-01',
 '2021-06-02',
 '2021-06-03',
 '2021-06-04',
 '2021-06-05']

In [139]:
df_soda_total = pd.DataFrame()

# Para cada archivo diario, añado el dataset
for date in fechas:
    try:
        df_soda = pd.read_csv(directorio + 'data/Rad_SODA/rad_soda_{}'.format(date))
        df_soda_total = df_soda_total.append(df_soda, ignore_index = True)
    except:
        continue
    
print(df_soda_total.shape)
df_soda_total.head()

(3210699, 13)


Unnamed: 0,dateBegins,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion
0,2021-04-17T21:00:00.0,2021-04-17T22:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
1,2021-04-17T22:00:00.0,2021-04-17T23:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
2,2021-04-17T23:00:00.0,2021-04-18T00:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
3,2021-04-05T00:00:00.0,2021-04-05T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D
4,2021-04-05T01:00:00.0,2021-04-05T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D


### Gestión de los NAs

Muestro el número de NAs por columna y el % que supone del total

In [140]:
total = df_soda_total.isnull().sum().sort_values(ascending = False)
percent = (df_soda_total.isnull().sum() / df_soda_total.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
dhi,16764,0.005221
bhi,16764,0.005221
ghi,16764,0.005221
bni,16380,0.005102
estacion,0,0.0
reliability,0,0.0
cs_bni,0,0.0
cs_dhi,0,0.0
cs_bhi,0,0.0
cs_ghi,0,0.0


Para las columnas con pequeño % de NAs, se eliminan las filas donde falten datos

In [141]:
df_soda_total.dropna(inplace = True)
df_soda_total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3193935 entries, 0 to 3210698
Data columns (total 13 columns):
 #   Column       Dtype  
---  ------       -----  
 0   dateBegins   object 
 1   dateEnds     object 
 2   toa          float64
 3   cs_ghi       float64
 4   cs_bhi       float64
 5   cs_dhi       float64
 6   cs_bni       float64
 7   ghi          float64
 8   bhi          float64
 9   dhi          float64
 10  bni          float64
 11  reliability  float64
 12  estacion     object 
dtypes: float64(10), object(3)
memory usage: 341.1+ MB


Obtengo la columna hora

In [142]:
df_soda_total['dateBegins'] = pd.to_datetime(df_soda_total['dateBegins'])
df_soda_total = df_soda_total.rename(columns={'dateBegins':'date'})
df_soda_total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3193935 entries, 0 to 3210698
Data columns (total 13 columns):
 #   Column       Dtype         
---  ------       -----         
 0   date         datetime64[ns]
 1   dateEnds     object        
 2   toa          float64       
 3   cs_ghi       float64       
 4   cs_bhi       float64       
 5   cs_dhi       float64       
 6   cs_bni       float64       
 7   ghi          float64       
 8   bhi          float64       
 9   dhi          float64       
 10  bni          float64       
 11  reliability  float64       
 12  estacion     object        
dtypes: datetime64[ns](1), float64(10), object(2)
memory usage: 341.1+ MB


In [143]:
df_soda_total['hora'] = pd.to_datetime(df_soda_total['date']).dt.hour
df_soda_total['fecha'] = [str(a)[0:10] for a in df_soda_total['date']]

In [144]:
df_soda_total.head()

Unnamed: 0,date,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,hora,fecha
0,2021-04-17 21:00:00,2021-04-17T22:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,21,2021-04-17
1,2021-04-17 22:00:00,2021-04-17T23:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,22,2021-04-17
2,2021-04-17 23:00:00,2021-04-18T00:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,23,2021-04-17
3,2021-04-05 00:00:00,2021-04-05T01:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,0,2021-04-05
4,2021-04-05 01:00:00,2021-04-05T02:00:00.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0252D,1,2021-04-05


In [145]:
df_soda_total = df_soda_total[(df_soda_total["hora"] < hora_fin) & (df_soda_total["hora"] >= hora_ini)]
df_soda_total.reset_index(drop=True, inplace=True)
df_soda_total.head()

Unnamed: 0,date,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,hora,fecha
0,2021-04-05 05:00:00,2021-04-05T06:00:00.0,0.0,0.0,0.0,0.0,0.0,6.3289,0.4241,5.9048,7.6825,0.9667,0252D,5,2021-04-05
1,2021-04-05 06:00:00,2021-04-05T07:00:00.0,0.0,0.0,0.0,0.0,0.0,120.0964,57.4725,62.6239,268.0484,1.0,0252D,6,2021-04-05
2,2021-04-05 07:00:00,2021-04-05T08:00:00.0,0.0,0.0,0.0,0.0,0.0,303.6657,194.8799,108.7858,509.9916,1.0,0252D,7,2021-04-05
3,2021-04-05 08:00:00,2021-04-05T09:00:00.0,0.0,0.0,0.0,0.0,0.0,484.0124,348.3073,135.7052,640.0478,1.0,0252D,8,2021-04-05
4,2021-04-05 09:00:00,2021-04-05T10:00:00.0,0.0,0.0,0.0,0.0,0.0,637.7261,486.6163,151.1098,720.5049,1.0,0252D,9,2021-04-05


Se eliminan posibles filas repetidas

In [146]:
df_soda_total = df_soda_total.drop_duplicates(["date", 'fecha', 'hora', "estacion"],
                        keep = 'first')
df_soda_total.reset_index(drop = True, inplace = True)
df_soda_total.head()

Unnamed: 0,date,dateEnds,toa,cs_ghi,cs_bhi,cs_dhi,cs_bni,ghi,bhi,dhi,bni,reliability,estacion,hora,fecha
0,2021-04-05 05:00:00,2021-04-05T06:00:00.0,0.0,0.0,0.0,0.0,0.0,6.3289,0.4241,5.9048,7.6825,0.9667,0252D,5,2021-04-05
1,2021-04-05 06:00:00,2021-04-05T07:00:00.0,0.0,0.0,0.0,0.0,0.0,120.0964,57.4725,62.6239,268.0484,1.0,0252D,6,2021-04-05
2,2021-04-05 07:00:00,2021-04-05T08:00:00.0,0.0,0.0,0.0,0.0,0.0,303.6657,194.8799,108.7858,509.9916,1.0,0252D,7,2021-04-05
3,2021-04-05 08:00:00,2021-04-05T09:00:00.0,0.0,0.0,0.0,0.0,0.0,484.0124,348.3073,135.7052,640.0478,1.0,0252D,8,2021-04-05
4,2021-04-05 09:00:00,2021-04-05T10:00:00.0,0.0,0.0,0.0,0.0,0.0,637.7261,486.6163,151.1098,720.5049,1.0,0252D,9,2021-04-05


# Función final de radiación
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

In [147]:
def soda_clean(df_datos, df_puntos):
    
    #Elimino las estaciones que no estén en ambos
    
    estacion_quitar = []
    # Estaciones en la lista de estaciones
    serie_indicativos = df_estaciones["indicativo"].unique().astype("str")
    # Estaciones en el dataset de clima
    serie_estaciones = list(set(df_datos["estacion"].unique().astype("str")))

    diferencia = len(serie_indicativos) - len(serie_estaciones)

    # Guardo los indicativos de las estaciones de la lista que no estan en el dataset
    for i in range(0, len(serie_indicativos)):
        estacion = serie_indicativos[i]
        if estacion not in serie_estaciones:
            estacion_quitar.append(str(estacion))
            diferencia -= 1
    # Guardo los indicativos de las estaciones del dataset que no estan en la lista de estaciones
    for i in range(0, len(serie_estaciones)):
        estacion = serie_estaciones[i]
        if (estacion not in list(serie_indicativos) and estacion not in estacion_quitar):
            estacion_quitar.append(str(estacion))
            diferencia += 1
    
    estaciones_df = []
    
    for i in range(0, len(df_datos["estacion"])):
        estacion = df_datos["estacion"].loc[i]
        if estacion in estacion_quitar:
            estaciones_df.append(True)
        else:
            estaciones_df.append(False)
    df_datos.insert(len(df_datos.columns),"quitar",estaciones_df,True)
    df_datos.drop(df_datos[df_datos["quitar"] == True].index, inplace = True)
    df_datos.reset_index(drop=True, inplace=True)
    
    # Quito columnas innecesarias
    
    df_datos.drop(['dateEnds'], axis=1, inplace = True)
    df_datos.drop(['toa'], axis=1, inplace = True)
    df_datos.drop(['cs_ghi'], axis=1, inplace = True)
    df_datos.drop(['cs_bhi'], axis=1, inplace = True)
    df_datos.drop(['cs_dhi'], axis=1, inplace = True)
    df_datos.drop(['cs_bni'], axis=1, inplace = True)
    df_datos.drop(['bhi'], axis=1, inplace = True)
    df_datos.drop(['dhi'], axis=1, inplace = True)
    df_datos.drop(['bni'], axis=1, inplace = True)
    df_datos.drop(['reliability'], axis=1, inplace = True)
    df_datos.drop(['quitar'], axis=1, inplace = True)
    
    # Convierto las columnas a los tipos de dato correctos
    
    df_datos['dateBegins'] = pd.to_datetime(df_datos['dateBegins'])
    df_datos = df_datos.rename(columns={'dateBegins':'date'})
    df_datos['hora'] = pd.to_datetime(df_datos['date']).dt.hour
    df_datos['fecha'] = [str(a)[0:10] for a in df_datos['date']]
    df_datos = df_datos[(df_datos["hora"] < hora_fin) & (df_datos["hora"] >= hora_ini)]
    df_datos.reset_index(drop = True, inplace = True)
    
    # Elimino Na's
    
    df_datos.dropna(inplace = True)
    df_datos.reset_index(drop = True, inplace = True)
    
    # Se eliminan posibles filas repetidas
    
    df_datos = df_datos.drop_duplicates(["date", 'fecha', 'hora', "estacion"], keep = 'first')
    df_datos.reset_index(drop = True, inplace = True)
    df_datos.head()

    # Guardo el archivo limpio
    nombre = './data/Historicos_modelo_2/rad_soda_clean.csv'
    df_datos.to_csv(nombre, index = False)
    
    return df_datos

In [148]:
import numpy as np
import pandas as pd
import random
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import matplotlib.pyplot as plt
plt.style.use("seaborn")
from datetime import datetime, timedelta
hora_ini = 4
hora_fin = 20

# Leo csvs

fechas = []
now = datetime.now()
fecha_inicial = datetime(2021,4,5)
fecha_final = datetime(now.year,now.month,now.day)
rango_fechas = range((fecha_final - fecha_inicial).days + 1)
lista_fechas = [fechas.append("{}-{}-{}".format((fecha_inicial + timedelta(days = d)).year, "%02d" % (fecha_inicial + timedelta(days = d)).month, "%02d" % (fecha_inicial + timedelta(days = d)).day)) for d in rango_fechas] 

directorio = '/home/dsc/git/TFM/'
df_estaciones = pd.read_csv(directorio + 'data/estaciones.csv')

df_soda_total = pd.DataFrame()
for date in fechas:
    try:
        df_soda = pd.read_csv(directorio + 'data/Rad_SODA/rad_soda_{}'.format(date))
        df_soda_total = df_soda_total.append(df_soda, ignore_index = True)
    except:
        continue

# Llamo a la función
df_clean = soda_clean(df_soda_total, df_estaciones)
    
df_clean.head()

Unnamed: 0,date,ghi,estacion,hora,fecha
0,2021-04-05 04:00:00,0.0,0252D,4,2021-04-05
1,2021-04-05 05:00:00,6.3289,0252D,5,2021-04-05
2,2021-04-05 06:00:00,120.0964,0252D,6,2021-04-05
3,2021-04-05 07:00:00,303.6657,0252D,7,2021-04-05
4,2021-04-05 08:00:00,484.0124,0252D,8,2021-04-05


In [149]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207440 entries, 0 to 207439
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   date      207440 non-null  datetime64[ns]
 1   ghi       207440 non-null  float64       
 2   estacion  207440 non-null  object        
 3   hora      207440 non-null  int64         
 4   fecha     207440 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 7.9+ MB


### Sanity check

In [150]:
print('Años distintos: ', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).year.nunique())
print('Meses diferentes:', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).month.nunique())
print('Debe haber 31 días distintos:', pd.to_datetime([year for year in pd.to_datetime(df_clean['fecha'])]).day.nunique())


print('Solo debe haber horas distintas dentro de las horas de filtrado:', df_clean['hora'].nunique())

Años distintos:  1
Meses diferentes: 2
Debe haber 31 días distintos: 30
Solo debe haber horas distintas dentro de las horas de filtrado: 16


# Generación de las filas con los días en columnas
<div style = "float:right"><a style="text-decoration:none" href = "#Limpieza-y-preparación-de-datos-históricos-para-los-modelos-de-predicción">

Para cada día que se toman datos, hay que tener solamente una fila por hora y estación. Por ejemplo, para los datos históricos de clima de los últimos 5 días, los datos de cada día deben ser columnas asociadas a las horas del día que se descargó el data frame (para cada estación, de 4 a 19, columnas con los datos del día anterior, columnas con los del previo...).

# Datos históricos de clima

Abro el csv

In [5]:
df_clima_total = pd.read_csv('./data/Historicos_modelo_2/historicos_climaticos_clean.csv', sep=',')
df_clima_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,clouds,visibility,wind_speed,wind_deg,wind_gust,we
0,2021-04-04,4,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.782415,800
1,2021-04-04,5,2021-04-05,0252D,284.59,282.57,1014,93,283.5,0,10000.0,3.09,290,4.782415,800
2,2021-04-04,6,2021-04-05,0252D,284.41,283.88,1015,87,282.33,0,10000.0,0.51,0,4.782415,800
3,2021-04-04,7,2021-04-05,0252D,284.99,284.61,1016,87,282.9,0,10000.0,0.51,0,4.782415,800
4,2021-04-04,8,2021-04-05,0252D,286.41,286.17,1016,82,283.41,0,10000.0,0.51,0,4.782415,800


Genero filas con todos los días de cada llamada. Etiqueto la columnas en función del día (d-1, d-2...)

In [6]:
columnas_1 = [col for col in df_clima_total.columns[1:4]] + [str(col+"_d-1") for col in df_clima_total.columns[4:]]
columnas_2 = [col for col in df_clima_total.columns[1:4]] + [str(col+"_d-2") for col in df_clima_total.columns[4:]]
columnas_3 = [col for col in df_clima_total.columns[1:4]] + [str(col+"_d-3") for col in df_clima_total.columns[4:]]
columnas_4 = [col for col in df_clima_total.columns[1:4]] + [str(col+"_d-4") for col in df_clima_total.columns[4:]]
columnas_5 = [col for col in df_clima_total.columns[1:4]] + [str(col+"_d-5") for col in df_clima_total.columns[4:]]

**PRECAUCIÓN:** Tarda muchas horas en correr

In [7]:
df_clima_dias_1 = pd.DataFrame(columns = columnas_1)
df_clima_dias_2 = pd.DataFrame(columns = columnas_2)
df_clima_dias_3 = pd.DataFrame(columns = columnas_3)
df_clima_dias_4 = pd.DataFrame(columns = columnas_4)
df_clima_dias_5 = pd.DataFrame(columns = columnas_5)

for i, fila in df_clima_total.iterrows():
    
    if (i in list(range(0, len(df_clima_total["date"]), 5000))) | (i == len(df_clima_total["date"])-1):
        print("Procesando fila {} de {}".format(i, len(df_clima_total["date"])))
        print("La cantidad de filas de los datasets (aproximadamente 1/5) es {}".format(len(df_clima_dias_1["hour"])))
        with open("./data/Historicos_modelo_2/Datasets_por_fecha/Registro_clima.txt",'a') as outFile:
            outFile.write('\n' + "Fila: " + str(i))
    
    # Para cada fila horaria, detecto a que día pertenece y la adjunto al dataset correspondiente
    if (pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 1:
        df_clima_dias_1.loc[len(df_clima_dias_1["fecha prediccion"])] = [elem for elem in fila][1:]

    if (pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 2:
        df_clima_dias_2.loc[len(df_clima_dias_2["fecha prediccion"])] = [elem for elem in fila][1:]
    
    if (pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 3:
        df_clima_dias_3.loc[len(df_clima_dias_3["fecha prediccion"])] = [elem for elem in fila][1:]
        
    if (pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 4:
        df_clima_dias_4.loc[len(df_clima_dias_4["fecha prediccion"])] = [elem for elem in fila][1:]
        
    if (pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 5:
        df_clima_dias_5.loc[len(df_clima_dias_5["fecha prediccion"])] = [elem for elem in fila][1:]
        
    # Guardo los datatsets periódicamente, por si la memoria se satura 
    if (i in list(range(0, len(df_clima_total["date"]), 1000))) | (i == len(df_clima_total["date"])-1):
        
        # Guardo el archivo
        nombre_1 = './data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_1.csv'
        nombre_2 = './data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_2.csv'
        nombre_3 = './data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_3.csv'
        nombre_4 = './data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_4.csv'
        nombre_5 = './data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_5.csv'
        df_clima_dias_1.to_csv(nombre_1, index = False)
        df_clima_dias_2.to_csv(nombre_2, index = False)
        df_clima_dias_3.to_csv(nombre_3, index = False)
        df_clima_dias_4.to_csv(nombre_4, index = False)
        df_clima_dias_5.to_csv(nombre_5, index = False)
        

df_clima_dias_1.head()

Procesando fila 0 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 0
Procesando fila 5000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 1008
Procesando fila 10000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 2000
Procesando fila 15000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 3008
Procesando fila 20000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 4000
Procesando fila 25000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 5008
Procesando fila 30000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 6000
Procesando fila 35000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 7008
Procesando fila 40000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 8000
Procesando fila 45000 de 1362000
La cantidad de filas de los datasets (aproximadamente 1/5) es 9008
Procesan

Unnamed: 0,hour,fecha prediccion,estacion,temp_d-1,feels_like_d-1,pressure_d-1,humidity_d-1,dew_point_d-1,clouds_d-1,visibility_d-1,wind_speed_d-1,wind_deg_d-1,wind_gust_d-1,we_d-1
0,4,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.782415,800
1,5,2021-04-05,0252D,284.59,282.57,1014,93,283.5,0,10000.0,3.09,290,4.782415,800
2,6,2021-04-05,0252D,284.41,283.88,1015,87,282.33,0,10000.0,0.51,0,4.782415,800
3,7,2021-04-05,0252D,284.99,284.61,1016,87,282.9,0,10000.0,0.51,0,4.782415,800
4,8,2021-04-05,0252D,286.41,286.17,1016,82,283.41,0,10000.0,0.51,0,4.782415,800


Cargo los datasets guardados

In [8]:
df_clima_dias_1 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_1.csv', sep=',')

df_clima_dias_2 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_2.csv', sep=',')

df_clima_dias_3 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_3.csv', sep=',')

df_clima_dias_4 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_4.csv', sep=',')

df_clima_dias_5 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/clima_por_horas_5.csv', sep=',')
df_clima_dias_5.head()

Unnamed: 0,hour,fecha prediccion,estacion,temp_d-5,feels_like_d-5,pressure_d-5,humidity_d-5,dew_point_d-5,clouds_d-5,visibility_d-5,wind_speed_d-5,wind_deg_d-5,wind_gust_d-5,we_d-5
0,4,2021-04-05,0252D,284.67,282.93,1026,68,278.98,42,9770.795014,1.12,338,1.18,802
1,5,2021-04-05,0252D,284.59,282.86,1026,67,278.69,50,9770.795014,1.02,336,1.1,800
2,6,2021-04-05,0252D,284.61,282.96,1026,68,278.92,41,9770.795014,0.97,324,1.07,802
3,7,2021-04-05,0252D,286.04,284.8,1026,64,279.4,47,9770.795014,0.54,329,0.85,802
4,8,2021-04-05,0252D,287.52,286.42,1026,58,279.37,54,9770.795014,0.32,121,0.67,803


In [9]:
print("Shape del dataset día-1: {}".format(df_clima_dias_1.shape))
print("Shape del dataset día-2: {}".format(df_clima_dias_2.shape))
print("Shape del dataset día-3: {}".format(df_clima_dias_3.shape))
print("Shape del dataset día-4: {}".format(df_clima_dias_4.shape))
print("Shape del dataset día-5: {}".format(df_clima_dias_5.shape))

Shape del dataset día-1: (272464, 14)
Shape del dataset día-2: (272416, 14)
Shape del dataset día-3: (272384, 14)
Shape del dataset día-4: (272384, 14)
Shape del dataset día-5: (272352, 14)


Los uno por fecha predicción, estación y hora, para generar filas por cada día de llamada, hora y estación

In [10]:
df_total = pd.merge(df_clima_dias_1, df_clima_dias_2, how = "inner", on = ["hour", "fecha prediccion", "estacion"])
df_total = pd.merge(df_total, df_clima_dias_3, how = "inner", on = ["hour", "fecha prediccion", "estacion"])
df_total = pd.merge(df_total, df_clima_dias_4, how = "inner", on = ["hour", "fecha prediccion", "estacion"])
df_total = pd.merge(df_total, df_clima_dias_5, how = "inner", on = ["hour", "fecha prediccion", "estacion"])
df_total.head()

Unnamed: 0,hour,fecha prediccion,estacion,temp_d-1,feels_like_d-1,pressure_d-1,humidity_d-1,dew_point_d-1,clouds_d-1,visibility_d-1,wind_speed_d-1,wind_deg_d-1,wind_gust_d-1,we_d-1,temp_d-2,feels_like_d-2,pressure_d-2,humidity_d-2,dew_point_d-2,clouds_d-2,visibility_d-2,wind_speed_d-2,wind_deg_d-2,wind_gust_d-2,we_d-2,temp_d-3,feels_like_d-3,pressure_d-3,humidity_d-3,dew_point_d-3,clouds_d-3,visibility_d-3,wind_speed_d-3,wind_deg_d-3,wind_gust_d-3,we_d-3,temp_d-4,feels_like_d-4,pressure_d-4,humidity_d-4,dew_point_d-4,clouds_d-4,visibility_d-4,wind_speed_d-4,wind_deg_d-4,wind_gust_d-4,we_d-4,temp_d-5,feels_like_d-5,pressure_d-5,humidity_d-5,dew_point_d-5,clouds_d-5,visibility_d-5,wind_speed_d-5,wind_deg_d-5,wind_gust_d-5,we_d-5
0,4,2021-04-05,0252D,284.3,282.56,1014,93,283.21,0,10000.0,2.57,280,4.782415,800,285.17,283.96,1012,76,281.08,0,10000.0,1.03,280,4.782415,800,284.75,282.73,1014,76,280.67,0,10000.0,2.06,290,4.782415,800,282.28,280.06,1020,75,278.09,0,10000.0,1.54,310,4.782415,800,284.67,282.93,1026,68,278.98,42,9770.795014,1.12,338,1.18,802
1,5,2021-04-05,0252D,284.59,282.57,1014,93,283.5,0,10000.0,3.09,290,4.782415,800,284.66,283.21,1012,81,281.52,0,10000.0,1.54,320,4.782415,800,283.71,282.39,1014,81,280.59,0,10000.0,1.03,0,4.782415,800,281.3,279.0,1020,81,278.24,0,10000.0,1.7,288,1.76,800,284.59,282.86,1026,67,278.69,50,9770.795014,1.02,336,1.1,800
2,6,2021-04-05,0252D,284.41,283.88,1015,87,282.33,0,10000.0,0.51,0,4.782415,800,285.18,284.34,1012,76,281.09,0,10000.0,0.51,0,4.782415,800,285.04,283.22,1014,71,279.96,0,10000.0,1.54,300,4.782415,800,281.42,279.47,1020,87,279.39,0,10000.0,1.54,280,4.782415,800,284.61,282.96,1026,68,278.92,41,9770.795014,0.97,324,1.07,802
3,7,2021-04-05,0252D,284.99,284.61,1016,87,282.9,0,10000.0,0.51,0,4.782415,800,287.22,286.25,1012,71,282.05,0,10000.0,1.03,0,4.782415,800,285.78,283.43,1015,71,280.67,0,10000.0,2.53,31,2.92,800,284.12,282.78,1021,70,278.87,0,10000.0,0.51,0,4.782415,800,286.04,284.8,1026,64,279.4,47,9770.795014,0.54,329,0.85,802
4,8,2021-04-05,0252D,286.41,286.17,1016,82,283.41,0,10000.0,0.51,0,4.782415,800,288.86,287.74,1013,55,279.85,0,10000.0,0.51,0,4.782415,800,288.53,287.51,1015,58,280.32,0,10000.0,0.51,0,4.782415,800,286.82,285.25,1021,54,277.69,0,10000.0,0.51,150,4.782415,800,287.52,286.42,1026,58,279.37,54,9770.795014,0.32,121,0.67,803


In [11]:
nombre = './data/Historicos_modelo_2/clima_por_horas.csv'
df_total.to_csv(nombre, index = False)

# Datos de predicción de clima

In [12]:
df_pred_total = pd.read_csv('./data/Historicos_modelo_2/predicciones_climaticas_clean.csv', sep=',')
df_pred_total.head()

Unnamed: 0,date,hour,fecha prediccion,estacion,temp,feels_like,pressure,humidity,dew_point,uvi,clouds,visibility,wind_speed,wind_deg,wind_gust,pop,we
0,2021-04-05,4,2021-04-05,0252D,285.47,284.66,1017,73,280.28,0.0,63,10000,0.98,306,1.36,0.0,803
1,2021-04-05,5,2021-04-05,0252D,285.54,284.66,1017,70,279.91,0.0,65,10000,1.18,10,1.43,0.0,803
2,2021-04-05,6,2021-04-05,0252D,285.31,284.44,1018,71,279.75,0.11,61,10000,1.73,34,1.71,0.0,803
3,2021-04-05,7,2021-04-05,0252D,286.09,285.19,1017,67,279.64,0.55,16,10000,1.52,58,1.76,0.0,801
4,2021-04-05,8,2021-04-05,0252D,287.24,286.32,1017,62,279.58,1.46,10,10000,0.94,103,1.28,0.0,800


Genero filas con todos los días de cada llamada. Clasifico los días en 2 grupos. El primero contendrá las horas del día que se obtienen los datos y de dos días después. El segundo los datos del día siguiente al que se obtienen los datos. Como se descragan las predicciones de las 48 horas siguientes a la llamada, estos dos grupos tendrán 14 valores cada uno para cada día y ubicación (16 tras el filtrado de horas útiles).

In [14]:
columnas_1 = [col for col in df_pred_total.columns[1:4]] + [str(col+"_pred_1") for col in df_pred_total.columns[4:]]
columnas_2 = [col for col in df_pred_total.columns[1:4]] + [str(col+"_pred_2") for col in df_pred_total.columns[4:]]

**PRECAUCIÓN:** Tarda muchas horas en correr

In [15]:
df_pred_dias_1 = pd.DataFrame(columns = columnas_1)
df_pred_dias_2 = pd.DataFrame(columns = columnas_2)

for i, fila in df_pred_total.iterrows():
    
    if (i in list(range(0,len(df_pred_total["date"]), 5000))) | (i == len(df_pred_total["date"])-1):
        print("Procesando fila {} de {}".format(i, len(df_pred_total["date"])))
        print("La cantidad de filas de los datasets (aproximadamente 1/2) es {}".format(len(df_pred_dias_1["hour"])))
        with open("./data/Historicos_modelo_2/Datasets_por_fecha/Registro_pred.txt",'a') as outFile:
            outFile.write('\n' + "Fila: " + str(i))
       
    # Para cada fila horaria, detecto a que día pertenece y la adjunto al dataset correspondiente
    if ((pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == 0) | ((pd.to_datetime(fila["fecha prediccion"]) - pd.to_datetime(fila["date"])).days == -2):
        df_pred_dias_1.loc[len(df_pred_dias_1["fecha prediccion"])] = [elem for elem in fila][1:] 
        
    if (pd.to_datetime(fila["date"]) - pd.to_datetime(fila["fecha prediccion"])).days == 1:
        df_pred_dias_2.loc[len(df_pred_dias_2["fecha prediccion"])] = [elem for elem in fila][1:]

    #Guardo los datasets periodicamente, por si la memoria satura
    if (i in list(range(0,len(df_pred_total["date"]),1000))) | (i == len(df_pred_total["date"])-1):
        # Guardo el archivo
        nombre_1 = './data/Historicos_modelo_2/Datasets_por_fecha/pred_por_horas_1.csv'
        nombre_2 = './data/Historicos_modelo_2/Datasets_por_fecha/pred_por_horas_2.csv'
        df_pred_dias_1.to_csv(nombre_1, index = False)
        df_pred_dias_2.to_csv(nombre_2, index = False)
        

df_pred_dias_2.head()

Procesando fila 0 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 0
Procesando fila 5000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 2504
Procesando fila 10000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 5824
Procesando fila 15000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 8328
Procesando fila 20000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 10832
Procesando fila 25000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 13328
Procesando fila 30000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 15824
Procesando fila 35000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 18328
Procesando fila 40000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 20832
Procesando fila 45000 de 549208
La cantidad de filas de los datasets (aproximadamente 1/2) es 23331
Procesando f

Unnamed: 0,hour,fecha prediccion,estacion,temp_pred_2,feels_like_pred_2,pressure_pred_2,humidity_pred_2,dew_point_pred_2,uvi_pred_2,clouds_pred_2,visibility_pred_2,wind_speed_pred_2,wind_deg_pred_2,wind_gust_pred_2,pop_pred_2,we_pred_2
0,4,2021-04-05,0252D,284.28,283.56,1011,81,280.72,0.0,0,10000,4.15,31,5.29,0.0,800
1,5,2021-04-05,0252D,284.28,283.64,1011,84,281.19,0.0,3,10000,4.36,33,5.97,0.0,800
2,6,2021-04-05,0252D,284.93,284.36,1011,84,281.82,0.12,6,10000,4.17,46,6.1,0.0,800
3,7,2021-04-05,0252D,285.77,285.2,1012,81,282.26,0.38,49,10000,4.13,78,5.67,0.0,802
4,8,2021-04-05,0252D,285.76,285.19,1012,81,282.09,1.04,58,10000,3.4,90,4.5,0.0,803


Cargo los datasets guardados

In [16]:
df_pred_dias_1 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/pred_por_horas_1.csv', sep=',')

df_pred_dias_2 = pd.read_csv('./data/Historicos_modelo_2/Datasets_por_fecha/pred_por_horas_2.csv', sep=',')

df_pred_dias_2.head()

Unnamed: 0,hour,fecha prediccion,estacion,temp_pred_2,feels_like_pred_2,pressure_pred_2,humidity_pred_2,dew_point_pred_2,uvi_pred_2,clouds_pred_2,visibility_pred_2,wind_speed_pred_2,wind_deg_pred_2,wind_gust_pred_2,pop_pred_2,we_pred_2
0,4,2021-04-05,0252D,284.28,283.56,1011,81,280.72,0.0,0,10000,4.15,31,5.29,0.0,800
1,5,2021-04-05,0252D,284.28,283.64,1011,84,281.19,0.0,3,10000,4.36,33,5.97,0.0,800
2,6,2021-04-05,0252D,284.93,284.36,1011,84,281.82,0.12,6,10000,4.17,46,6.1,0.0,800
3,7,2021-04-05,0252D,285.77,285.2,1012,81,282.26,0.38,49,10000,4.13,78,5.67,0.0,802
4,8,2021-04-05,0252D,285.76,285.19,1012,81,282.09,1.04,58,10000,3.4,90,4.5,0.0,803


In [17]:
print("Shape del dataset día/dia+2: {}".format(df_pred_dias_1.shape))
print("Shape del dataset día+1: {}".format(df_pred_dias_2.shape))

Shape del dataset día/dia+2: (276296, 16)
Shape del dataset día+1: (272912, 16)


Los uno por fecha predicción, estación y hora, para generar filas por cada día de llamada, hora y estación

In [18]:
# Se eliminan posibles filas duplicadas

df_pred_dias_1 = df_pred_dias_1.drop_duplicates(['hour', "fecha prediccion", "estacion"],
                        keep = 'first')
df_pred_dias_1.reset_index(drop = True, inplace = True)
df_pred_dias_2 = df_pred_dias_2.drop_duplicates(['hour', "fecha prediccion", "estacion"],
                        keep = 'first')
df_pred_dias_2.reset_index(drop = True, inplace = True)

# Se unen los datasets
df_total_previo = pd.merge(df_pred_dias_1, df_pred_dias_2, how = "inner", on = ["hour", "fecha prediccion", "estacion"])
df_total_previo.head()

Unnamed: 0,hour,fecha prediccion,estacion,temp_pred_1,feels_like_pred_1,pressure_pred_1,humidity_pred_1,dew_point_pred_1,uvi_pred_1,clouds_pred_1,visibility_pred_1,wind_speed_pred_1,wind_deg_pred_1,wind_gust_pred_1,pop_pred_1,we_pred_1,temp_pred_2,feels_like_pred_2,pressure_pred_2,humidity_pred_2,dew_point_pred_2,uvi_pred_2,clouds_pred_2,visibility_pred_2,wind_speed_pred_2,wind_deg_pred_2,wind_gust_pred_2,pop_pred_2,we_pred_2
0,4,2021-04-05,0252D,285.47,284.66,1017,73,280.28,0.0,63,10000,0.98,306,1.36,0.0,803,284.28,283.56,1011,81,280.72,0.0,0,10000,4.15,31,5.29,0.0,800
1,5,2021-04-05,0252D,285.54,284.66,1017,70,279.91,0.0,65,10000,1.18,10,1.43,0.0,803,284.28,283.64,1011,84,281.19,0.0,3,10000,4.36,33,5.97,0.0,800
2,6,2021-04-05,0252D,285.31,284.44,1018,71,279.75,0.11,61,10000,1.73,34,1.71,0.0,803,284.93,284.36,1011,84,281.82,0.12,6,10000,4.17,46,6.1,0.0,800
3,7,2021-04-05,0252D,286.09,285.19,1017,67,279.64,0.55,16,10000,1.52,58,1.76,0.0,801,285.77,285.2,1012,81,282.26,0.38,49,10000,4.13,78,5.67,0.0,802
4,8,2021-04-05,0252D,287.24,286.32,1017,62,279.58,1.46,10,10000,0.94,103,1.28,0.0,800,285.76,285.19,1012,81,282.09,1.04,58,10000,3.4,90,4.5,0.0,803


Para cada día y estación, obtengo 2 valores de predicción para cada hora (dos para las 4, dos para las 5...) correpsondientes a la prediccion de las 48 horas siguientes a la llamada. Genero después le dataset con el valor medio de predicción de cada hora.

In [19]:
columnas_total = [col for col in df_pred_total.columns[1:4]] + [str(col+"_pred") for col in df_pred_total.columns[4:]]
df_total = pd.DataFrame(columns = columnas_total)

for i, fila in df_total_previo.iterrows():
    
    if (i in list(range(0,len(df_total_previo["hour"]),5000))) | (i == len(df_total_previo["hour"])-1):
        print("Procesando fila {} de {}".format(i, len(df_total_previo["hour"])))
       
    # Para cada hora, obtengo la media de los datos de las dos predichas
    fila_nueva = []
    df_new = pd.DataFrame()
    for j in range(0, len(columnas_total)):
        if j in [0,1,2]:
            fila_nueva.append(fila[j]) 
        else:
            fila_nueva.append(np.mean([fila[j], fila[j + (int((len(columnas_total)-3)))]]))
    df_new = pd.DataFrame([tuple(fila_nueva)], columns = columnas_total)
    df_total = df_total.append(df_new, ignore_index = True)
    
    
df_total.head()

Procesando fila 0 de 272912
Procesando fila 5000 de 272912
Procesando fila 10000 de 272912
Procesando fila 15000 de 272912
Procesando fila 20000 de 272912
Procesando fila 25000 de 272912
Procesando fila 30000 de 272912
Procesando fila 35000 de 272912
Procesando fila 40000 de 272912
Procesando fila 45000 de 272912
Procesando fila 50000 de 272912
Procesando fila 55000 de 272912
Procesando fila 60000 de 272912
Procesando fila 65000 de 272912
Procesando fila 70000 de 272912
Procesando fila 75000 de 272912
Procesando fila 80000 de 272912
Procesando fila 85000 de 272912
Procesando fila 90000 de 272912
Procesando fila 95000 de 272912
Procesando fila 100000 de 272912
Procesando fila 105000 de 272912
Procesando fila 110000 de 272912
Procesando fila 115000 de 272912
Procesando fila 120000 de 272912
Procesando fila 125000 de 272912
Procesando fila 130000 de 272912
Procesando fila 135000 de 272912
Procesando fila 140000 de 272912
Procesando fila 145000 de 272912
Procesando fila 150000 de 272912
Pr

Unnamed: 0,hour,fecha prediccion,estacion,temp_pred,feels_like_pred,pressure_pred,humidity_pred,dew_point_pred,uvi_pred,clouds_pred,visibility_pred,wind_speed_pred,wind_deg_pred,wind_gust_pred,pop_pred,we_pred
0,4,2021-04-05,0252D,284.875,284.11,1014.0,77.0,280.5,0.0,31.5,10000.0,2.565,168.5,3.325,0.0,801.5
1,5,2021-04-05,0252D,284.91,284.15,1014.0,77.0,280.55,0.0,34.0,10000.0,2.77,21.5,3.7,0.0,801.5
2,6,2021-04-05,0252D,285.12,284.4,1014.5,77.5,280.785,0.115,33.5,10000.0,2.95,40.0,3.905,0.0,801.5
3,7,2021-04-05,0252D,285.93,285.195,1014.5,74.0,280.95,0.465,32.5,10000.0,2.825,68.0,3.715,0.0,801.5
4,8,2021-04-05,0252D,286.5,285.755,1014.5,71.5,280.835,1.25,34.0,10000.0,2.17,96.5,2.89,0.0,801.5


In [20]:
nombre = './data/Historicos_modelo_2/pred_por_horas.csv'
df_total.to_csv(nombre, index = False)