#  Description

## EnerNed Hourly Energy Consumption Data

EnerNed is a regional transmission organization in the United States. It is part of the Eastern Interconnection grid operating an electric transmission system in various states.

The company noticed that sometimes it is hard for them to assume what the future demand for power will be. They would like to have a predictive model that would help them in being ready for the increase in demand as well as regulating the transmission in the system once the demand is lower. They would like to also understand how the trends are shaped over time.

The hourly power consumption data comes from EnerNed's website and are in megawatts (MW). Follow the steps below to create a time series model. Please fill the code in the cells where indicated. Don’t worry about the time. If you don’t manage to finish all the exercises, you can always do it at home and compare your answers with the solutions provided in our repository.

Good luck!


# Dependencies

In [1]:
import pandas as pd
from pycaret.regression import *
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb

# Read Raw Data

In [4]:
raw_data = pd.read_csv("PJMW_hourly.csv").copy()

Define constants

In [6]:
date_col = "Datetime"
target = "PJMW_MW"

Transform columns to appropriate data types

In [None]:
# Make the date column datetime type 
raw_data[date_col] =

# Make the target column "float32"
raw_data[target] = 

Sort Dataframe and set index:

In [None]:
# Sort values by the date column
raw_data = raw_data.

# Set the date columns as an index
raw_data = raw_data.

# Feature Engineering

__ALWAYS THINK OF THE PROBLEM BEFORE CREATING FEATURES!__

In this case, we need to predict hourly consumption of energy for 1 year ahead. Therefore, we will not be able to create features such as lag_1_day, lag_2_months and so forth. Instead, we must create features such as lag_1_year, lag_14_months and so forth. Always think of data availability. You can not create lag/rolling features in the future, except for those that go back at least for the same time frame as the forecasting horizon, in this case 1 year.

Something that can be done is to incorporate other features such as GDP, PCI and so forth, but when we are forecasting out of sample, therefore when we are forecasting the future, we will need the __forecasts for those features__, because obviously we will not have them at the origin time when creating the future forecast.

Note that it is stil possible to use lag and rolling windows features that have a shorter time frame than the forecasting horizon, but this will mean that we will train a model with say 10 features, but at the origin we will make forecasts using only 5 features, excluding lags and rolling windows. This can have several implications:

- __Out-of-Sample Performance__: The model's performance on the test data (and potentially on out-of-sample data within the testing period) might be reasonably accurate due to the availability of lag and rolling window features during evaluation. However, this performance doesn't guarantee how well the model will generalize to future periods when those features cannot be created.
- __Potential Performance Drop-off__: The model's performance may degrade when forecasting into the future without the lag and rolling window features. This is because the model has learned to rely on those specific features to make predictions, and without them, it might struggle to capture certain patterns or trends.
- __Reduced Feature Space__: In the real-world scenario, if you cannot create lag and rolling window features, you'll be limited to using only the features available at that time (e.g., the 5 features you mentioned). This means the model might not have access to all the information it was trained on, which can limit its forecasting accuracy.

In [None]:
# Create a copy of the original dataframe
fe_data = 

Create a simple features function

In [None]:
def extract_dt(df):
    """
    Extracts several datetime objects from a datetime index.
    """
    df['hour'] = df.index.hour                           
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month           # Month
    df['year'] = df.index.year             # Year 
    df['dayofyear'] = df.index.dayofyear   # Day of the year
    df['dayofmonth'] = df.index.day        # Day of the month
    df['weekofyear'] = df.index.isocalendar().week # Week of the year based on the isocalendaryear #TODO delete thecode

    return df

In [None]:
# Run the function to create the dataset (input df = fe_data)

fe_data = 

We could already train and test a model with only these features, as they can be available in the future, because they simply extract datetime information from dates. Instead we are going to add lags as well.

In [8]:
def add_lags(df, target, lags_dict):
    """
    Creates a mapping between index and target that is used to create lags based on
    a dictionary of lag keys and values.
    """
    target_map = df[target].to_dict()
    for lag, lag_days in lags_dict.items():
        df[f'lag_{lag}_year'] = (df.index - pd.Timedelta(lag_days)).map(target_map)
    
    return df

In [10]:
lags_dict = {
    1: "364 days",
    2: "728 days",
    3: "1092 days"
}

fe_data = add_lags(fe_data, target, lags_dict)