# Energy price forecasting competition

This notebook has two purposes:

1. Explain the data to be used in the energy price forecasting competition 
2. Provide a template for importing the data and uploading results to the evaluation server

## Background

Energy prices in two-day US energy markets are made up of two distinct components:

1. A cost of energy, this is constant across all locations on the power grid within an hour, but varies hour by hour
2. A location-specific variable cost

The two-day structure of the market is such that participants bid each morning how much power they are willing to buy or sell at specific locations in each hour of the following day

The power grid operator aggregates these individual bids and produces a market clearing price at each hour and location such that total supply and demand can be met

Depending on where an individual market participant's bid prices were relative to the market clearing price, they will either be awarded bids or not

The next day when the bids are "active", unforseen circumstances often arise and the actual price of electricity at the specific locations will vary from the price announced by the power grid operator

The price that was declared by the grid operator is referred to as the day ahead price and the price that prevails when bids are active in the next day is called the real time price

Your task is to use fundamental data from the a US power grid to formulate 1-day ahead hourly forecasts for the day ahead and real time cost of energy (the component of prices that is constant across the entire power grid)

For each day you will report a projected day-ahead marginal cost of energy (acronym `damce`) and a real time marginal cost of energy (acronym `rtmce`) for each of the 24 hours in the next day

In [1]:
import pandas as pd
import requests

Let's now import the data and describe various properties of it

In [7]:
kw = dict(parse_dates=["date"], index_col=["date", "hour"])
train_X = pd.read_csv("train_X.csv", **kw)
train_y = pd.read_csv("train_y.csv", **kw)
test_X = pd.read_csv("test_X.csv", **kw)
weather = pd.read_csv("weather_data.csv", **kw)

In [9]:
train_X.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,load,zone_1_wind_production,zone_2_wind_production,neighbor_region_3_load,wind,zone_4_wind_production,neighbor_region_1_load,zone_5_wind_production,natural_gas,zone_3_wind_production,rtmce,nuclear,damce,neighbor_region_2_load,coal
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-01-01,1,36094.722,281.333,370.242,25957.52,3199.791667,2123.067,18294.1,607.925,6169.508333,30.267,120.7296,2034.558333,38.2173,42119.1,21181.208333
2018-01-01,2,36045.346,275.667,303.42,26144.27,3174.883333,2073.511,18085.56,506.675,6163.45,13.083,29.2681,2034.008333,38.0748,41849.51,21196.3
2018-01-01,3,36047.534667,275.942,247.975,26361.89,3149.233333,2280.695,18000.78,359.767,6049.808333,3.183,50.5867,2034.416667,38.8437,41929.49,21257.625
2018-01-01,4,36350.962167,257.808,261.5,26689.45,2917.175,2336.192,18014.06,259.0,6681.791667,1.85,29.1771,2034.425,41.6665,42307.9,21143.833333
2018-01-01,5,37008.104333,203.783,228.567,27360.46,2664.925,2222.349,18275.17,201.717,7428.383333,3.875,128.1511,2034.125,42.9746,43089.32,21061.4


Notice that the data is given in an hourly frequency

Let's get more info on all the columns:

In [8]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10198 entries, (2018-01-01 00:00:00, 1) to (2019-07-31 00:00:00, 24)
Data columns (total 15 columns):
load                      10181 non-null float64
zone_1_wind_production    10181 non-null float64
zone_2_wind_production    10181 non-null float64
neighbor_region_3_load    10198 non-null float64
wind                      10181 non-null float64
zone_4_wind_production    10181 non-null float64
neighbor_region_1_load    10198 non-null float64
zone_5_wind_production    10181 non-null float64
natural_gas               10181 non-null float64
zone_3_wind_production    10181 non-null float64
rtmce                     10198 non-null float64
nuclear                   10181 non-null float64
damce                     10198 non-null float64
neighbor_region_2_load    10198 non-null float64
coal                      10181 non-null float64
dtypes: float64(15)
memory usage: 1.2 MB


The columns are:

- `damce`: day ahead marginal cost of energy (units dollars)
- `rtmce`: real time marginal cost of energy (units dollars)
- `load`: total load (demand for energy) across the power grid (units MWh)
- `zone_1_wind_production`: total production of energy from wind farms in zone 1 (units MWh)
- `zone_2_wind_production`: total production of energy from wind farms in zone 2 (units MWh)
- `zone_3_wind_production`: total production of energy from wind farms in zone 3 (units MWh)
- `zone_4_wind_production`: total production of energy from wind farms in zone 2 (units MWh)
- `zone_5_wind_production`: total production of energy from wind farms in zone 5 (units MWh)
- `neighbor_region_1_load`: total demand for energy in region 1 of a neighboring electricity market (units MWh)
- `neighbor_region_2_load`: total demand for energy in region 2 of a neighboring electricity market (units MWh)
- `neighbor_region_3_load`: total demand for energy in region 3 of a neighboring electricity market (units MWh)
- `wind`: total amount of energy produced from wind farms (units MWh)
- `natural_gas`: total amount of energy produced from natural gas plants (units MWh)
- `nuclear`: total amount of energy produced from nuclear power plants (units MWh)
- `coal`: total production of energy from coal plants (units MWh)


Note that there is some **missing data**.  You WILL have to determine how to handle this

Let's look at the targets:

In [11]:
train_y.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,target1,target2
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01,1,26.2499,33.1818
2018-01-01,2,25.9304,18.8723
2018-01-01,3,26.6468,28.8056
2018-01-01,4,25.6137,30.0683
2018-01-01,5,29.8101,32.5069


The targets stored in a two column DataFrame

target1 is the `damce` and target2 is the `rtmce`

Note that the target data has been shifted forward by two full days to account for the availability of data each morning before the market participants submit their bids

The two day time shift is necessary because if I were submitting bids on 2019-08-06, I would only have access to data through 2019-08-05, but would be submitting bids that are active in the real time market on 2019-08-07

There is also another set of data imported into the `weather` variable

Let's take a look at that

In [12]:
weather.info()

weather.head()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9260 entries, (2018-01-02 00:00:00, 1) to (2019-08-07 00:00:00, 22)
Data columns (total 18 columns):
temp_KC          9260 non-null float64
temp_KS          9260 non-null float64
temp_MT          9260 non-null float64
temp_ND          9260 non-null float64
temp_OK          9260 non-null float64
temp_SD          9260 non-null float64
wind_east_KC     9260 non-null float64
wind_east_KS     9260 non-null float64
wind_east_MT     9260 non-null float64
wind_east_ND     9260 non-null float64
wind_east_OK     9260 non-null float64
wind_east_SD     9260 non-null float64
wind_north_KC    9260 non-null float64
wind_north_KS    9260 non-null float64
wind_north_MT    9260 non-null float64
wind_north_ND    9260 non-null float64
wind_north_OK    9260 non-null float64
wind_north_SD    9260 non-null float64
dtypes: float64(18)
memory usage: 1.3 MB


Unnamed: 0_level_0,Unnamed: 1_level_0,temp_KC,temp_KS,temp_MT,temp_ND,temp_OK,temp_SD,wind_east_KC,wind_east_KS,wind_east_MT,wind_east_ND,wind_east_OK,wind_east_SD,wind_north_KC,wind_north_KS,wind_north_MT,wind_north_ND,wind_north_OK,wind_north_SD
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-01-02,1,257.61108,261.01688,248.36969,257.56012,263.6771,248.98686,-0.707576,-1.917161,4.305661,-0.680807,-2.297761,3.532831,-0.952418,-2.318179,4.030571,-0.905243,-2.443977,2.949128
2018-01-02,2,257.82318,260.97153,248.57384,257.80292,263.7624,249.35081,-0.519425,-1.586541,4.395525,-0.530371,-2.285378,3.72169,-0.961979,-2.215663,4.149198,-0.959009,-2.335447,2.981026
2018-01-02,3,258.02634,260.73227,248.59917,258.0081,263.95245,249.85002,-0.13207,-1.140821,4.363935,-0.151521,-2.23167,3.71776,-0.707359,-1.859291,4.142098,-0.717786,-2.306572,2.979758
2018-01-02,4,258.15787,260.77277,248.93546,258.13416,264.00702,250.74298,-0.00124,-0.845597,4.414314,-0.001432,-2.244833,3.61895,-0.369424,-1.41402,4.095286,-0.377892,-2.268881,3.007561
2018-01-02,5,258.13266,260.7961,250.16428,258.1163,263.705,251.8965,0.019917,-0.524373,4.698027,0.028193,-1.979061,3.713986,-0.261934,-1.104718,4.439171,-0.25018,-2.242912,3.134491


This DataFrame has hourly weather forecasts for locations in several of the states in the power grid we are studying

The columns are named `(variable)_(XX)` where `variable` is shorthand for the variable and `XX` is the two letter abbreviation of the state

The variables are:

- `temp`: temperature in degrees farenheit
- `wind_east`: the magnitude of wind flow in the east direction in miles per hour
- `wind_north`: the magnitude of wind flow in the north direction in miles per hour

We did not include the columns of this DataFrame in `train_X` or `test_X` because it is not available for all hours of the day:

In [14]:
weather.reset_index()["hour"].value_counts().sort_index()

1     579
2     577
3     578
4     579
5     579
6     579
7     579
8     579
9     579
10    578
11    579
12    579
13    389
15    190
16    389
18    190
19    389
21    190
22    389
24    190
Name: hour, dtype: int64

There should be 579 hours for all days, but there is not for two reasons:

1. The hourly forecasts turn to 3-hourly forecasts between 1 and 2 PM each day
2. The time shift that occurs due to daylight savings time causes some hours to appear only in winter months and some to appear only in summer months (e.g. hour 19 shows up in the winter whereas hour 18 appears in the summer)

This data is likely helpful and informative for your task, but if you desire to use it you will have to come up with a strategy for handling the missing hours in this dataset relative to what is in `train_X` and `test_X`

Note that because these are weather forecasts, you are permitted to join them with the `train_X` and `test_X` (on the date, hour columns) DataFrame and use them without worrying about if the data would be available at market participant bid deadline time

## Competition rules

Your tasks is to use data included in `train_X` (and potentially `weather`) to construct a regression model that predicts the day-ahead and real-time marginal cost of energy one day forward

The targets are already comptued for you in `train_y`, so you do not need to worry about shifting data yourself

This is inherently a time-series task, but you can apply non-time series methods without a problem (in fact, time series methods are more advanced/difficult, so we reccomend starting with classic regression algorithms)

Because of the time series nature of the problem, you could potentially look in `train_X` and find the corresponding values for `train_y`

If you figure out the pattern you could apply it to `test_X` and exactly produce some values for `test_y`

Please do not do this -- you won't learn

We will review all code used to make submissions and will disqualify any submissions that "cheat" in this way

You are permitted (encouraged) to work in teams

There is no limit on the number of responses you can submit

In order to submit responses we have created a function `upload_responses` below

Please read the documentation for how this function works

As an example of usage, the code below would make a properly formatted submission:

```python
predictions = np.random.randn(test_X.shape[0], 2)
upload_response("Gryffindor", predictions)
```

The performance of all submitted responses will be evaluated using the MSE loss function

In [27]:
def upload_response(team_name, predictions):
    """
    Upload a response to evaluation server and return feedback
    
    Parameters
    ==========
    team_name: string
        A string representing your team name. This will appear 
        on the leaderboard and will be used to identify the 
        winning team
    
    predictions: pd.DataFrame or numpy array or list of lists
        A 2-dimensional numpy array, pandas DataFrame, or list
        of lists containing the predictions. The shape of this 
        object MUST have two columns and the same number of rows
        as test_X.
    
    Returns
    =======
    rank: int
        The rank of the current submission, relative to all others
        that have been recieved
        
    leaderboard: pd.DataFrame
        A pandas DataFrame representing a leaderboard of the top
        50 responses recieved so far
    
    """
    import numpy as np
    import requests
    import pandas as pd
    url = "http://jupyter.valorumdata.com:5000/submit"
    payload = dict(name=team_name, prediction=np.asarray(predictions).tolist())
    res = requests.post(url, json=payload)
    
    if not res.ok:
        msg = res.content
        raise ValueError("Failed with message: {}".format(msg))
    
    print("Response successfully submitted")
    
    data = res.json()
    rank = data["rank"]
    print("Your current rank is {}".format(rank))
    
    leaderboard = pd.DataFrame(res.json()["leaders"])

    leaderboard["timestamp"] = pd.to_datetime(leaderboard["timestamp"])
    return rank, leaderboard

## Workspace

Ok, that's it! 

Let's get to work

Do your best to build the winning model

Good luck!