# 02 - Definindo Alvos (Ys) para Previsão de TS

## Introdução

Vamos ver 3 maneiras que podemos utilizar para definirmos o alvo para utilizarmos na previsão de uma série temporal.

**Esse Y (ou Ys) que criamos é o valor que nosso modelo de série temporal terá de prever com base nos dados que restaram!**

## Importação

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

## Carga dos Dados

In [2]:
full = pd.read_csv("data-processed/data_full.csv")
full['DATE_TIME'] = pd.to_datetime(full['DATE_TIME'], format='%Y-%m-%d %H:%M:%S')
full.sort_values("DATE_TIME")
full.iloc[50:55]

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
50,2020-05-15 00:30:00,4135001,ZoEaEvLYb1n2sOq,0.0,0.0,0.0,7098099.0,24.935753,22.592306,0.0
51,2020-05-15 00:30:00,4135001,adLQvlD726eNBSB,0.0,0.0,0.0,6271355.0,24.935753,22.592306,0.0
52,2020-05-15 00:30:00,4135001,bvBOhCH3iADSZry,0.0,0.0,0.0,6316803.0,24.935753,22.592306,0.0
53,2020-05-15 00:30:00,4135001,iCRJl6heRkivqQ3,0.0,0.0,0.0,7177992.0,24.935753,22.592306,0.0
54,2020-05-15 00:30:00,4135001,ih0vzX44oOqAx2f,0.0,0.0,0.0,6185184.0,24.935753,22.592306,0.0


## Modo de Previsão Recursivo

Nesse modo estamos tentando prever o dia (período de tempo) seguinte. Por exemplo, numa semana de 7 dias, usamos os dados dos dias de 1 a 3 para prever o dia 4. Já prever o dia 5, usamos os dados dos dias 1 a 3 mais a previsão que foi realizada para o dia 4.

Ao usar previsões para fazer previsões, podemos ir tendo um acumulo de erros, mas mesmo assim esse método pode ser um dos melhores possíveis.

In [3]:
full_recursive = []

# Vai agrupar os registros pelo gerador da energia (temos 22 no total)
for source_key, source_key_df in full.groupby('SOURCE_KEY'):
    
    source_key_df = source_key_df.copy()
    
    # DAILY_YIELD é a soma acumulada ao longo do dia (soma a cada intervalo de 15 min)
    # Essa variável armazena a geração entre as duas últimas acumulações (15 min)
    source_key_df['15M_YIELD'] = source_key_df['DAILY_YIELD'].diff() # t-(t-1)
    
    # Temos de tratar os casos onde acontece a virada de dia, como DAILY_YIELD zera,
    # a subtração (00:15 - 00:00) vai gerar um valor negativo que não é real
    first_data_points_of_day = (source_key_df['DATE_TIME'].dt.hour == 0) & (source_key_df['DATE_TIME'].dt.minute <= 15)
    source_key_df.loc[first_data_points_of_day, '15M_YIELD'] = 0.
    
    # Há geradores onde a primeira leitura do dia não é as 00:15, então para lidar com isso
    # pegamos a primeira leitura/registro do dia para aquele gerador
    record_number = source_key_df.groupby(source_key_df['DATE_TIME'].dt.date)['DATE_TIME'].rank()
    # e depois localizamos esse registro e zeramos a diff dos 15 min
    source_key_df.loc[record_number <= 1, '15M_YIELD'] = 0
    
    # Note que ainda tem algumas coisas estranhas nas nossas difs, mas não vamos tentar ajustar mais
    # Provavelmente temos alguma anomália nos dados
    #source_key_df['15M_YIELD'].plot()
    
    # Agora definimos Y, ele vai ser a diferença entre dois intervalos de 15
    source_key_df['Y'] = source_key_df['15M_YIELD'].shift(-1)
    source_key_df = source_key_df.iloc[:-1]
    
    full_recursive.append(source_key_df)
    
    #print(source_key)

    #break


full_recursive_df = pd.concat(full_recursive, axis=0, ignore_index=True)
full_recursive_df.iloc[50:60]

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION,15M_YIELD,Y
50,2020-05-15 12:30:00,4135001,1BY6WEcLGh8j5v7,8517.0,832.9625,2860.0,6262419.0,32.147685,52.353255,0.649248,245.428571,202.0
51,2020-05-15 12:45:00,4135001,1BY6WEcLGh8j5v7,8006.285714,783.557143,3062.0,6262621.0,32.39142,50.63124,0.761243,202.0,161.25
52,2020-05-15 13:00:00,4135001,1BY6WEcLGh8j5v7,6089.375,596.5625,3223.25,6262782.25,32.622796,49.610768,0.416035,161.25,161.607143
53,2020-05-15 13:15:00,4135001,1BY6WEcLGh8j5v7,6359.714286,623.042857,3384.857143,6262943.857,32.497064,47.011161,0.489244,161.607143,158.142857
54,2020-05-15 13:30:00,4135001,1BY6WEcLGh8j5v7,7588.0,742.914286,3543.0,6263102.0,32.524621,46.669863,0.574561,158.142857,183.875
55,2020-05-15 13:45:00,4135001,1BY6WEcLGh8j5v7,7471.375,731.5,3726.875,6263285.875,32.678471,47.516884,0.560986,183.875,209.839286
56,2020-05-15 14:00:00,4135001,1BY6WEcLGh8j5v7,9555.0,934.471429,3936.714286,6263495.714,33.763185,49.803904,0.735083,209.839286,254.285714
57,2020-05-15 14:15:00,4135001,1BY6WEcLGh8j5v7,10642.75,1039.35,4191.0,6263750.0,34.13077,55.030613,0.893661,254.285714,207.142857
58,2020-05-15 14:30:00,4135001,1BY6WEcLGh8j5v7,5429.857143,532.228571,4398.142857,6263957.143,34.081384,54.519137,0.466789,207.142857,130.357143
59,2020-05-15 14:45:00,4135001,1BY6WEcLGh8j5v7,6742.25,659.875,4528.5,6264087.5,33.695722,47.618195,0.542138,130.357143,169.214286


## Modo de Previsão Direto

Esse método tenta prever exatamente um momento no tempo a frente. Aqui não tentamos prever um passo a frente, e sim n passos a frente. Por exemplo, temos os dias 1, 2 e 3 e com isso prevemos direto o dia 5, sem passar pela previsão do dia 4.

Caso seja necessário fazer previsões para n, n+1, n+2, etc. devemos construir um modelo para cada janela de tempo de previsão. Então geramos datasets com alvos com esses espaços de tempo e criamos um modelo em cima de cada um desses datasets.

In [4]:
full_direct = []
lead_t = 4

for source_key, source_key_df in full.groupby('SOURCE_KEY'):
    
    source_key_df = source_key_df.copy()
    source_key_df['15M_YIELD'] = source_key_df['DAILY_YIELD'].diff() # t - t-1
    
    first_data_points_of_day = (source_key_df['DATE_TIME'].dt.hour == 0) & (source_key_df['DATE_TIME'].dt.minute <= 15)
    source_key_df.loc[first_data_points_of_day, '15M_YIELD'] = 0.
    
    record_number = source_key_df.groupby(source_key_df['DATE_TIME'].dt.date)['DATE_TIME'].rank()
    #print(record_number)
    source_key_df.loc[record_number <= 1, '15M_YIELD'] = 0
    #source_key_df['15M_YIELD'].plot()
    
    # Mudamos aqui apenas, em vez de prever um passo a frente, vamos prever n passos a frente
    source_key_df['Y{}'.format(lead_t)] = source_key_df['15M_YIELD'].shift(-lead_t)
    source_key_df = source_key_df.iloc[:-lead_t]
    
    full_direct.append(source_key_df)
    
    #break
    
    
full_direct_df = pd.concat(full_direct, axis=0, ignore_index=True)
full_direct_df.iloc[50:60]

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION,15M_YIELD,Y4
50,2020-05-15 12:30:00,4135001,1BY6WEcLGh8j5v7,8517.0,832.9625,2860.0,6262419.0,32.147685,52.353255,0.649248,245.428571,158.142857
51,2020-05-15 12:45:00,4135001,1BY6WEcLGh8j5v7,8006.285714,783.557143,3062.0,6262621.0,32.39142,50.63124,0.761243,202.0,183.875
52,2020-05-15 13:00:00,4135001,1BY6WEcLGh8j5v7,6089.375,596.5625,3223.25,6262782.25,32.622796,49.610768,0.416035,161.25,209.839286
53,2020-05-15 13:15:00,4135001,1BY6WEcLGh8j5v7,6359.714286,623.042857,3384.857143,6262943.857,32.497064,47.011161,0.489244,161.607143,254.285714
54,2020-05-15 13:30:00,4135001,1BY6WEcLGh8j5v7,7588.0,742.914286,3543.0,6263102.0,32.524621,46.669863,0.574561,158.142857,207.142857
55,2020-05-15 13:45:00,4135001,1BY6WEcLGh8j5v7,7471.375,731.5,3726.875,6263285.875,32.678471,47.516884,0.560986,183.875,130.357143
56,2020-05-15 14:00:00,4135001,1BY6WEcLGh8j5v7,9555.0,934.471429,3936.714286,6263495.714,33.763185,49.803904,0.735083,209.839286,169.214286
57,2020-05-15 14:15:00,4135001,1BY6WEcLGh8j5v7,10642.75,1039.35,4191.0,6263750.0,34.13077,55.030613,0.893661,254.285714,142.535714
58,2020-05-15 14:30:00,4135001,1BY6WEcLGh8j5v7,5429.857143,532.228571,4398.142857,6263957.143,34.081384,54.519137,0.466789,207.142857,175.892857
59,2020-05-15 14:45:00,4135001,1BY6WEcLGh8j5v7,6742.25,659.875,4528.5,6264087.5,33.695722,47.618195,0.542138,130.357143,121.607143


## Modo de Previsão Nativo

Aqui prevemos mais de um valor ao mesmo tempo.

In [5]:
full_native = []
min_lead_t = 1
max_lead_t = 4


for source_key, source_key_df in full.groupby('SOURCE_KEY'):
    
    source_key_df = source_key_df.copy()
    source_key_df['15M_YIELD'] = source_key_df['DAILY_YIELD'].diff() # t - t-1

    first_data_points_of_day = (source_key_df['DATE_TIME'].dt.hour == 0) & (source_key_df['DATE_TIME'].dt.minute <= 15)
    source_key_df.loc[first_data_points_of_day, '15M_YIELD'] = 0.
    
    record_number = source_key_df.groupby(source_key_df['DATE_TIME'].dt.date)['DATE_TIME'].rank()
    #print(record_number)
    source_key_df.loc[record_number <= 1, '15M_YIELD'] = 0
    #source_key_df['15M_YIELD'].plot()
    
    for lead_t in range(min_lead_t, max_lead_t+1):
        source_key_df['Y{}'.format(lead_t)] = source_key_df['15M_YIELD'].shift(-lead_t)
        source_key_df = source_key_df.iloc[:-lead_t]
    
    full_native.append(source_key_df)
    
    #break
    
    
full_native_df = pd.concat(full_native, axis=0, ignore_index=True)
full_native_df.iloc[50:60]

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION,15M_YIELD,Y1,Y2,Y3,Y4
50,2020-05-15 12:30:00,4135001,1BY6WEcLGh8j5v7,8517.0,832.9625,2860.0,6262419.0,32.147685,52.353255,0.649248,245.428571,202.0,161.25,161.607143,158.142857
51,2020-05-15 12:45:00,4135001,1BY6WEcLGh8j5v7,8006.285714,783.557143,3062.0,6262621.0,32.39142,50.63124,0.761243,202.0,161.25,161.607143,158.142857,183.875
52,2020-05-15 13:00:00,4135001,1BY6WEcLGh8j5v7,6089.375,596.5625,3223.25,6262782.25,32.622796,49.610768,0.416035,161.25,161.607143,158.142857,183.875,209.839286
53,2020-05-15 13:15:00,4135001,1BY6WEcLGh8j5v7,6359.714286,623.042857,3384.857143,6262943.857,32.497064,47.011161,0.489244,161.607143,158.142857,183.875,209.839286,254.285714
54,2020-05-15 13:30:00,4135001,1BY6WEcLGh8j5v7,7588.0,742.914286,3543.0,6263102.0,32.524621,46.669863,0.574561,158.142857,183.875,209.839286,254.285714,207.142857
55,2020-05-15 13:45:00,4135001,1BY6WEcLGh8j5v7,7471.375,731.5,3726.875,6263285.875,32.678471,47.516884,0.560986,183.875,209.839286,254.285714,207.142857,130.357143
56,2020-05-15 14:00:00,4135001,1BY6WEcLGh8j5v7,9555.0,934.471429,3936.714286,6263495.714,33.763185,49.803904,0.735083,209.839286,254.285714,207.142857,130.357143,169.214286
57,2020-05-15 14:15:00,4135001,1BY6WEcLGh8j5v7,10642.75,1039.35,4191.0,6263750.0,34.13077,55.030613,0.893661,254.285714,207.142857,130.357143,169.214286,142.535714
58,2020-05-15 14:30:00,4135001,1BY6WEcLGh8j5v7,5429.857143,532.228571,4398.142857,6263957.143,34.081384,54.519137,0.466789,207.142857,130.357143,169.214286,142.535714,175.892857
59,2020-05-15 14:45:00,4135001,1BY6WEcLGh8j5v7,6742.25,659.875,4528.5,6264087.5,33.695722,47.618195,0.542138,130.357143,169.214286,142.535714,175.892857,121.607143


## Qual Usar?

Em nosso exemplo vamos utilizar o método direto. Mas cada métodos tem seus prós e contras.

In [6]:
full_direct_df.to_csv('data-processed/full_direct_df.csv')

# Fim