# Energy price forecasting competition

This notebook has two purposes:

1. Explain the data to be used in the energy price forecasting competition 
2. Provide a template for importing the data and uploading results to the evaluation server

## Background

Energy prices in two-day US energy markets are made up of two distinct components:

1. A cost of energy, this is constant across all locations on the power grid within an hour, but varies hour by hour
2. A location-specific variable cost

The two-day structure of the market is such that participants bid each morning how much power they are willing to buy or sell at specific locations in each hour of the following day

The power grid operator aggregates these individual bids and produces a market clearing price at each hour and location such that total supply and demand can be met

Depending on where an individual market participant's bid prices were relative to the market clearing price, they will either be awarded bids or not

The next day when the bids are "active", unforseen circumstances often arise and the actual price of electricity at the specific locations will vary from the price announced by the power grid operator

The price that was declared by the grid operator is referred to as the day ahead price and the price that prevails when bids are active in the next day is called the real time price

Your task is to use fundamental data from the a US power grid to formulate 1-day ahead hourly forecasts for the day ahead and real time cost of energy (the component of prices that is constant across the entire power grid)

For each day you will report a projected day-ahead marginal cost of energy (acronym `damce`) and a real time marginal cost of energy (acronym `rtmce`) for each of the 24 hours in the next day

In [2]:
import pandas as pd
import requests

Let's now import the data and describe various properties of it

In [3]:
kw = dict(parse_dates=["date"], index_col=["date", "hour"])
train_X = pd.read_csv("train_X.csv", **kw)
train_y = pd.read_csv("train_y.csv", **kw)
test_X = pd.read_csv("test_X.csv", **kw)
weather = pd.read_csv("weather_data.csv", **kw)

In [4]:
train_X.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,load,zone_1_wind_production,zone_2_wind_production,neighbor_region_3_load,wind,zone_4_wind_production,neighbor_region_1_load,zone_5_wind_production,natural_gas,zone_3_wind_production,rtmce,nuclear,damce,neighbor_region_2_load,coal
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-01-01,1,36094.722,281.333,370.242,25957.52,3199.791667,2123.067,18294.1,607.925,6169.508333,30.267,120.7296,2034.558333,38.2173,42119.1,21181.208333
2018-01-01,2,36045.346,275.667,303.42,26144.27,3174.883333,2073.511,18085.56,506.675,6163.45,13.083,29.2681,2034.008333,38.0748,41849.51,21196.3
2018-01-01,3,36047.534667,275.942,247.975,26361.89,3149.233333,2280.695,18000.78,359.767,6049.808333,3.183,50.5867,2034.416667,38.8437,41929.49,21257.625
2018-01-01,4,36350.962167,257.808,261.5,26689.45,2917.175,2336.192,18014.06,259.0,6681.791667,1.85,29.1771,2034.425,41.6665,42307.9,21143.833333
2018-01-01,5,37008.104333,203.783,228.567,27360.46,2664.925,2222.349,18275.17,201.717,7428.383333,3.875,128.1511,2034.125,42.9746,43089.32,21061.4


Notice that the data is given in an hourly frequency

Let's get more info on all the columns:

In [5]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10198 entries, (2018-01-01 00:00:00, 1) to (2019-07-31 00:00:00, 24)
Data columns (total 15 columns):
load                      10181 non-null float64
zone_1_wind_production    10181 non-null float64
zone_2_wind_production    10181 non-null float64
neighbor_region_3_load    10198 non-null float64
wind                      10181 non-null float64
zone_4_wind_production    10181 non-null float64
neighbor_region_1_load    10198 non-null float64
zone_5_wind_production    10181 non-null float64
natural_gas               10181 non-null float64
zone_3_wind_production    10181 non-null float64
rtmce                     10198 non-null float64
nuclear                   10181 non-null float64
damce                     10198 non-null float64
neighbor_region_2_load    10198 non-null float64
coal                      10181 non-null float64
dtypes: float64(15)
memory usage: 1.2 MB


In [6]:
train_X.loc[train_X.isna().any(axis=1),:]

Unnamed: 0_level_0,Unnamed: 1_level_0,load,zone_1_wind_production,zone_2_wind_production,neighbor_region_3_load,wind,zone_4_wind_production,neighbor_region_1_load,zone_5_wind_production,natural_gas,zone_3_wind_production,rtmce,nuclear,damce,neighbor_region_2_load,coal
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-03-04,15,25642.6936,,,16831.14,13828.2,,15103.29,,1791.98,,2.7641,2014.8,10.4669,33201.31,7216.6
2018-03-06,13,,830.533,1853.279,17002.53,,7112.2,17849.6,1325.5,,135.2,16.5613,,16.2972,40651.41,
2018-03-06,14,29164.3835,,,16947.12,11369.15,,17766.16,,4455.55,,22.6869,2020.275,15.6783,40399.8,9531.375
2018-08-21,21,,69.65,1544.831,25097.82,,1095.55,17330.08,93.375,,418.175,23.902,,24.722,43178.74,
2018-08-21,22,,,,23286.65,,,16017.8,,,,21.1776,,21.1776,39870.33,
2018-10-11,21,27237.3769,,,18618.71,2185.16,,16794.86,,7999.38,,32.4558,1215.21,35.6311,36055.63,12831.41
2018-12-02,4,,1012.098,3623.55,14509.04,,6977.388,14678.81,712.725,,177.383,11.2677,,12.5665,30895.77,
2018-12-02,5,27233.752,,,14931.5,12345.1,,14949.73,,2299.5,,22.5862,2030.4,14.8681,31761.44,9942.1
2019-03-29,3,,578.768,2611.493,14009.57,,4020.899,14416.73,275.6,,1205.9,14.4682,,15.1997,30444.32,
2019-03-29,4,,,,14255.26,,,14704.64,,,,14.3985,,15.4058,31744.37,


The columns are:

- `damce`: day ahead marginal cost of energy (units dollars)
- `rtmce`: real time marginal cost of energy (units dollars)
- `load`: total load (demand for energy) across the power grid (units MWh)
- `zone_1_wind_production`: total production of energy from wind farms in zone 1 (units MWh)
- `zone_2_wind_production`: total production of energy from wind farms in zone 2 (units MWh)
- `zone_3_wind_production`: total production of energy from wind farms in zone 3 (units MWh)
- `zone_4_wind_production`: total production of energy from wind farms in zone 2 (units MWh)
- `zone_5_wind_production`: total production of energy from wind farms in zone 5 (units MWh)
- `neighbor_region_1_load`: total demand for energy in region 1 of a neighboring electricity market (units MWh)
- `neighbor_region_2_load`: total demand for energy in region 2 of a neighboring electricity market (units MWh)
- `neighbor_region_3_load`: total demand for energy in region 3 of a neighboring electricity market (units MWh)
- `wind`: total amount of energy produced from wind farms (units MWh)
- `natural_gas`: total amount of energy produced from natural gas plants (units MWh)
- `nuclear`: total amount of energy produced from nuclear power plants (units MWh)
- `coal`: total production of energy from coal plants (units MWh)


Note that there is some **missing data**.  You WILL have to determine how to handle this

Let's look at the targets:

In [7]:
train_y.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,target1,target2
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01,1,26.2499,33.1818
2018-01-01,2,25.9304,18.8723
2018-01-01,3,26.6468,28.8056
2018-01-01,4,25.6137,30.0683
2018-01-01,5,29.8101,32.5069


The targets stored in a two column DataFrame

target1 is the `damce` and target2 is the `rtmce`

Note that the target data has been shifted forward by two full days to account for the availability of data each morning before the market participants submit their bids

The two day time shift is necessary because if I were submitting bids on 2019-08-06, I would only have access to data through 2019-08-05, but would be submitting bids that are active in the real time market on 2019-08-07

There is also another set of data imported into the `weather` variable

Let's take a look at that

In [8]:
weather.info()

weather.head()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9260 entries, (2018-01-02 00:00:00, 1) to (2019-08-07 00:00:00, 22)
Data columns (total 18 columns):
temp_KC          9260 non-null float64
temp_KS          9260 non-null float64
temp_MT          9260 non-null float64
temp_ND          9260 non-null float64
temp_OK          9260 non-null float64
temp_SD          9260 non-null float64
wind_east_KC     9260 non-null float64
wind_east_KS     9260 non-null float64
wind_east_MT     9260 non-null float64
wind_east_ND     9260 non-null float64
wind_east_OK     9260 non-null float64
wind_east_SD     9260 non-null float64
wind_north_KC    9260 non-null float64
wind_north_KS    9260 non-null float64
wind_north_MT    9260 non-null float64
wind_north_ND    9260 non-null float64
wind_north_OK    9260 non-null float64
wind_north_SD    9260 non-null float64
dtypes: float64(18)
memory usage: 1.3 MB


Unnamed: 0_level_0,Unnamed: 1_level_0,temp_KC,temp_KS,temp_MT,temp_ND,temp_OK,temp_SD,wind_east_KC,wind_east_KS,wind_east_MT,wind_east_ND,wind_east_OK,wind_east_SD,wind_north_KC,wind_north_KS,wind_north_MT,wind_north_ND,wind_north_OK,wind_north_SD
date,hour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-01-02,1,257.61108,261.01688,248.36969,257.56012,263.6771,248.98686,-0.707576,-1.917161,4.305661,-0.680807,-2.297761,3.532831,-0.952418,-2.318179,4.030571,-0.905243,-2.443977,2.949128
2018-01-02,2,257.82318,260.97153,248.57384,257.80292,263.7624,249.35081,-0.519425,-1.586541,4.395525,-0.530371,-2.285378,3.72169,-0.961979,-2.215663,4.149198,-0.959009,-2.335447,2.981026
2018-01-02,3,258.02634,260.73227,248.59917,258.0081,263.95245,249.85002,-0.13207,-1.140821,4.363935,-0.151521,-2.23167,3.71776,-0.707359,-1.859291,4.142098,-0.717786,-2.306572,2.979758
2018-01-02,4,258.15787,260.77277,248.93546,258.13416,264.00702,250.74298,-0.00124,-0.845597,4.414314,-0.001432,-2.244833,3.61895,-0.369424,-1.41402,4.095286,-0.377892,-2.268881,3.007561
2018-01-02,5,258.13266,260.7961,250.16428,258.1163,263.705,251.8965,0.019917,-0.524373,4.698027,0.028193,-1.979061,3.713986,-0.261934,-1.104718,4.439171,-0.25018,-2.242912,3.134491


This DataFrame has hourly weather forecasts for locations in several of the states in the power grid we are studying

The columns are named `(variable)_(XX)` where `variable` is shorthand for the variable and `XX` is the two letter abbreviation of the state

The variables are:

- `temp`: temperature in degrees farenheit
- `wind_east`: the magnitude of wind flow in the east direction in miles per hour
- `wind_north`: the magnitude of wind flow in the north direction in miles per hour

We did not include the columns of this DataFrame in `train_X` or `test_X` because it is not available for all hours of the day:

In [9]:
weather.reset_index()["hour"].value_counts().sort_index()

1     579
2     577
3     578
4     579
5     579
6     579
7     579
8     579
9     579
10    578
11    579
12    579
13    389
15    190
16    389
18    190
19    389
21    190
22    389
24    190
Name: hour, dtype: int64

There should be 579 hours for all days, but there is not for two reasons:

1. The hourly forecasts turn to 3-hourly forecasts between 1 and 2 PM each day
2. The time shift that occurs due to daylight savings time causes some hours to appear only in winter months and some to appear only in summer months (e.g. hour 19 shows up in the winter whereas hour 18 appears in the summer)

This data is likely helpful and informative for your task, but if you desire to use it you will have to come up with a strategy for handling the missing hours in this dataset relative to what is in `train_X` and `test_X`

Note that because these are weather forecasts, you are permitted to join them with the `train_X` and `test_X` (on the date, hour columns) DataFrame and use them without worrying about if the data would be available at market participant bid deadline time

## Competition rules

Your tasks is to use data included in `train_X` (and potentially `weather`) to construct a regression model that predicts the day-ahead and real-time marginal cost of energy one day forward

The targets are already comptued for you in `train_y`, so you do not need to worry about shifting data yourself

This is inherently a time-series task, but you can apply non-time series methods without a problem (in fact, time series methods are more advanced/difficult, so we reccomend starting with classic regression algorithms)

Because of the time series nature of the problem, you could potentially look in `train_X` and find the corresponding values for `train_y`

If you figure out the pattern you could apply it to `test_X` and exactly produce some values for `test_y`

Please do not do this -- you won't learn

We will review all code used to make submissions and will disqualify any submissions that "cheat" in this way

You are permitted (encouraged) to work in teams

There is no limit on the number of responses you can submit

In order to submit responses we have created a function `upload_responses` below

Please read the documentation for how this function works

As an example of usage, the code below would make a properly formatted submission:

```python
predictions = np.random.randn(test_X.shape[0], 2)
upload_response("Gryffindor", predictions)
```

The performance of all submitted responses will be evaluated using the MSE loss function

In [10]:
def upload_response(team_name, predictions):
    """
    Upload a response to evaluation server and return feedback
    
    Parameters
    ==========
    team_name: string
        A string representing your team name. This will appear 
        on the leaderboard and will be used to identify the 
        winning team
    
    predictions: pd.DataFrame or numpy array or list of lists
        A 2-dimensional numpy array, pandas DataFrame, or list
        of lists containing the predictions. The shape of this 
        object MUST have two columns and the same number of rows
        as test_X.
    
    Returns
    =======
    rank: int
        The rank of the current submission, relative to all others
        that have been recieved
        
    leaderboard: pd.DataFrame
        A pandas DataFrame representing a leaderboard of the top
        50 responses recieved so far
    
    """
    import numpy as np
    import requests
    import pandas as pd
    url = "http://jupyter.valorumdata.com:5000/submit"
    payload = dict(name=team_name, prediction=np.asarray(predictions).tolist())
    res = requests.post(url, json=payload)
    
    if not res.ok:
        msg = res.content
        raise ValueError("Failed with message: {}".format(msg))
    
    print("Response successfully submitted")
    
    data = res.json()
    rank = data["rank"]
    print("Your current rank is {}".format(rank))
    
    leaderboard = pd.DataFrame(res.json()["leaders"])

    leaderboard["timestamp"] = pd.to_datetime(leaderboard["timestamp"])
    return rank, leaderboard

## Workspace

Ok, that's it! 

Let's get to work

Do your best to build the winning model

Good luck!

In [84]:
from sklearn import preprocessing, pipeline, linear_model, metrics, svm, multioutput, neural_network

In [46]:
for _df in [train_X, test_X]:
    _df["nonwind"] = _df.eval("load - wind")


In [47]:
train_X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
load,10181.0,30932.543091,5420.193054,148.56,27254.9845,29795.34475,33757.3375,50469.541
zone_1_wind_production,10181.0,498.411783,340.523522,0.0,193.908,459.6,766.592,1402.649
zone_2_wind_production,10181.0,2067.981683,1231.93929,2.417,954.519,2021.844,3228.372,4514.66
neighbor_region_3_load,10198.0,20095.783063,3810.796133,13413.03,17366.7675,19044.435,22350.135,32042.85
wind,10181.0,7762.878979,3932.40007,261.125,4324.966667,7585.7,11239.033333,16283.483333
zone_4_wind_production,10181.0,3836.566542,2385.868088,7.758,1652.733,3591.242,5965.033,8786.525
neighbor_region_1_load,10198.0,17144.060494,2418.205068,11751.14,15368.9525,16998.05,18780.1875,26187.54
zone_5_wind_production,10181.0,820.544735,468.297953,2.763,433.2,786.9,1167.9,1938.0
natural_gas,10181.0,6998.58324,3397.027731,1151.091667,4470.591667,6303.625,8925.475,19439.241667
zone_3_wind_production,10181.0,541.485832,357.056763,0.0,235.942,521.983,776.483,1639.358


In [53]:
model1 = linear_model.LinearRegression(fit_intercept=False)
X1 = train_X[["nonwind"]].ffill(limit=4).bfill(limit=4)
X1_test = test_X[["nonwind"]].ffill(limit=1)
model1.fit(X1, train_y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
         normalize=False)

In [50]:
metrics.mean_squared_error(model1.predict(X1), train_y)

389.5948392955416

In [57]:
upload_response("sglyon-baseline", model1.predict(X1_test))

Response successfully submitted
Your current rank is 7


(7,            mse                  name                 timestamp
 0   457.263936  RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 1   458.510467                  boat 2019-08-14 18:08:03+00:00
 2   458.521125                  boat 2019-08-11 23:15:27+00:00
 3   459.352200   RW-LinearRegression 2019-08-15 23:43:48+00:00
 4   472.157265                  boat 2019-08-10 17:57:53+00:00
 5   484.407248        Darwin_Results 2019-08-19 20:24:10+00:00
 6   485.337050       sglyon-baseline 2019-08-19 22:20:45+00:00
 7   487.715763              SudeepNN 2019-08-16 16:37:38+00:00
 8   491.702301                Sudeep 2019-08-16 15:22:20+00:00
 9   491.702301              SudeepLR 2019-08-16 15:26:22+00:00
 10  491.702301              SudeepLR 2019-08-16 15:30:59+00:00
 11  504.323668                  boat 2019-08-14 05:37:02+00:00
 12  505.052174            Gryffindor 2019-08-06 20:12:06+00:00)

In [69]:
X2 = train_X.ffill(limit=4).bfill(limit=4)
X2_test = test_X.ffill(limit=1)

In [62]:
model2 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.MultiTaskElasticNetCV(cv=12)
)
model2.fit(X2, train_y)



Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('multitaskelasticnetcv', MultiTaskElasticNetCV(alphas=None, copy_X=True, cv='warn', eps=0.001,
           fit_intercept=True, l1_ratio=0.5, max_iter=1000, n_alphas=100,
           n_jobs=None, normalize=False, random_state=None,
           selection='cyclic', tol=0.0001, verbose=0))])

In [67]:
pd.DataFrame(model2.steps[-1][-1].coef_, columns=list(X2)).T

Unnamed: 0,0,1
load,0.439082,0.068601
zone_1_wind_production,0.0,0.0
zone_2_wind_production,0.150123,0.510022
neighbor_region_3_load,1.161953,1.868369
wind,0.674689,0.605804
zone_4_wind_production,0.400913,0.242184
neighbor_region_1_load,1.657123,1.781968
zone_5_wind_production,0.673363,0.980547
natural_gas,-0.004672,-0.018462
zone_3_wind_production,-0.241113,-0.734746


In [68]:
metrics.mean_squared_error(model2.predict(X2), train_y)

345.92228108228414

In [70]:
upload_response("sglyon-enet_full", model2.predict(X2_test))

Response successfully submitted
Your current rank is 1


(1,            mse                  name                 timestamp
 0   456.634788       sglyon-baseline 2019-08-19 22:24:58+00:00
 1   457.263936  RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 2   458.510467                  boat 2019-08-14 18:08:03+00:00
 3   458.521125                  boat 2019-08-11 23:15:27+00:00
 4   459.352200   RW-LinearRegression 2019-08-15 23:43:48+00:00
 5   472.157265                  boat 2019-08-10 17:57:53+00:00
 6   484.407248        Darwin_Results 2019-08-19 20:24:10+00:00
 7   485.337050       sglyon-baseline 2019-08-19 22:20:45+00:00
 8   487.715763              SudeepNN 2019-08-16 16:37:38+00:00
 9   491.702301                Sudeep 2019-08-16 15:22:20+00:00
 10  491.702301              SudeepLR 2019-08-16 15:26:22+00:00
 11  491.702301              SudeepLR 2019-08-16 15:30:59+00:00
 12  504.323668                  boat 2019-08-14 05:37:02+00:00
 13  505.052174            Gryffindor 2019-08-06 20:12:06+00:00)

In [73]:
def transform3(_df):
    out = _df.copy()
    out["rtda_mce"] = out.eval("rtmce - damce")
    return out

X3 = transform3(train_X).ffill(limit=4).bfill(limit=4)
X3_test = transform3(test_X).ffill(limit=1)

In [75]:
from copy import deepcopy

In [78]:
model3 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.MultiTaskElasticNetCV(cv=10)
)
model3.fit(X3, train_y)
metrics.mean_squared_error(model3.predict(X3), train_y)

341.9179707262756

In [79]:
upload_response("sglyon-enet-rtdamce", model3.predict(X3_test))

Response successfully submitted
Your current rank is 3


(3,            mse                        name                 timestamp
 0   456.634788             sglyon-baseline 2019-08-19 22:24:58+00:00
 1   457.263936        RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 2   457.313194         sglyon-enet-rtdamce 2019-08-19 22:29:48+00:00
 3   458.510467                        boat 2019-08-14 18:08:03+00:00
 4   458.521125                        boat 2019-08-11 23:15:27+00:00
 5   459.352200         RW-LinearRegression 2019-08-15 23:43:48+00:00
 6   470.341087  RW-RandomForestWithWeather 2019-08-19 22:27:13+00:00
 7   472.157265                        boat 2019-08-10 17:57:53+00:00
 8   484.407248              Darwin_Results 2019-08-19 20:24:10+00:00
 9   485.337050             sglyon-baseline 2019-08-19 22:20:45+00:00
 10  487.715763                    SudeepNN 2019-08-16 16:37:38+00:00
 11  491.702301                      Sudeep 2019-08-16 15:22:20+00:00
 12  491.702301                    SudeepLR 2019-08-16 15:26:22+00:00
 13  491.702301  

In [92]:
train_X.index.get_level_values("hour") >= 7

array([False, False, False, ...,  True,  True,  True])

In [123]:
train_X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
load,10181.0,30932.543091,5420.193054,148.56,27254.9845,29795.34475,33757.3375,50469.541
zone_1_wind_production,10181.0,498.411783,340.523522,0.0,193.908,459.6,766.592,1402.649
zone_2_wind_production,10181.0,2067.981683,1231.93929,2.417,954.519,2021.844,3228.372,4514.66
neighbor_region_3_load,10198.0,20095.783063,3810.796133,13413.03,17366.7675,19044.435,22350.135,32042.85
wind,10181.0,7762.878979,3932.40007,261.125,4324.966667,7585.7,11239.033333,16283.483333
zone_4_wind_production,10181.0,3836.566542,2385.868088,7.758,1652.733,3591.242,5965.033,8786.525
neighbor_region_1_load,10198.0,17144.060494,2418.205068,11751.14,15368.9525,16998.05,18780.1875,26187.54
zone_5_wind_production,10181.0,820.544735,468.297953,2.763,433.2,786.9,1167.9,1938.0
natural_gas,10181.0,6998.58324,3397.027731,1151.091667,4470.591667,6303.625,8925.475,19439.241667
zone_3_wind_production,10181.0,541.485832,357.056763,0.0,235.942,521.983,776.483,1639.358


In [122]:
def transform4(_df):
    out = _df.copy()
    out["is_weekend"] = out.index.get_level_values("date").dayofweek >= 5
    _hr = out.index.get_level_values("hour")
    out["is_peak"] = (_hr >= 7) & (_hr <= 21)
    return out.astype(float)
X4 = transform4(train_X).ffill(limit=4).bfill(limit=4)
X4_test = transform4(test_X).ffill(limit=1)

model4 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.MultiTaskElasticNetCV(cv=10)
)
model4.fit(X4, train_y)
metrics.mean_squared_error(model4.predict(X4), train_y)

334.97054212673265

In [98]:
upload_response("sglyon-enet-weekend-peak", model4.predict(X4_test))

  Xt = transform.transform(Xt)


Response successfully submitted
Your current rank is 1


(1,            mse                        name                 timestamp
 0   454.119476    sglyon-enet-weekend-peak 2019-08-19 22:47:18+00:00
 1   456.634788            sglyon-enet_full 2019-08-19 22:24:58+00:00
 2   457.263936        RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 3   457.313194         sglyon-enet-rtdamce 2019-08-19 22:29:48+00:00
 4   458.510467                        boat 2019-08-14 18:08:03+00:00
 5   458.521125                        boat 2019-08-11 23:15:27+00:00
 6   459.352200         RW-LinearRegression 2019-08-15 23:43:48+00:00
 7   470.341087  RW-RandomForestWithWeather 2019-08-19 22:27:13+00:00
 8   472.157265                        boat 2019-08-10 17:57:53+00:00
 9   484.407248              Darwin_Results 2019-08-19 20:24:10+00:00
 10  485.337050             sglyon-baseline 2019-08-19 22:20:45+00:00
 11  487.715763                    SudeepNN 2019-08-16 16:37:38+00:00
 12  491.702301                      Sudeep 2019-08-16 15:22:20+00:00
 13  491.702301  

## nonlinear-ml



In [102]:
from sklearn import tree, ensemble, model_selection

In [115]:
def transform5(_df):
    out = _df.copy()
    out["is_weekend"] = out.index.get_level_values("date").dayofweek >= 5
    _hr = out.index.get_level_values("hour")
    out["is_peak"] = (_hr >= 7) & (_hr <= 21)
    return out
# X5 = transform4(train_X).ffill(limit=4).bfill(limit=4)
# X5_test = transform4(test_X).ffill(limit=1)

model5_base = pipeline.Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("tree", tree.DecisionTreeRegressor(max_depth=6, min_samples_leaf=0.01))
])

param_grid5 = dict(
    tree__max_depth=[2, 6, 10],
    tree__min_samples_leaf=[1, 0.01, 0.05, 0.1]
)
model5 = model_selection.GridSearchCV(model5_base, param_grid5, cv=10)
model5.fit(X2, train_y)
model5_base.fit(X2, train_y)
print(metrics.mean_squared_error(model5.predict(X2), train_y))
print(metrics.mean_squared_error(model5_base.predict(X2), train_y))

340.7100017541681
327.99965793854267


In [116]:
upload_response("sglyon-dtree", model5_base.predict(X2_test))

Response successfully submitted
Your current rank is 10


(10,            mse                        name                 timestamp
 0   454.119476    sglyon-enet-weekend-peak 2019-08-19 22:47:18+00:00
 1   456.634788            sglyon-enet_full 2019-08-19 22:24:58+00:00
 2   457.263936        RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 3   457.313194         sglyon-enet-rtdamce 2019-08-19 22:29:48+00:00
 4   458.510467                        boat 2019-08-14 18:08:03+00:00
 5   458.521125                        boat 2019-08-11 23:15:27+00:00
 6   459.352200         RW-LinearRegression 2019-08-15 23:43:48+00:00
 7   470.341087  RW-RandomForestWithWeather 2019-08-19 22:27:13+00:00
 8   472.157265                        boat 2019-08-10 17:57:53+00:00
 9   475.660907                sglyon-dtree 2019-08-19 22:58:26+00:00
 10  484.407248              Darwin_Results 2019-08-19 20:24:10+00:00
 11  485.337050             sglyon-baseline 2019-08-19 22:20:45+00:00
 12  487.715763                    SudeepNN 2019-08-16 16:37:38+00:00
 13  491.702301 

In [120]:
def transform6(_df):
    out = _df.copy()
    out["is_weekend"] = out.index.get_level_values("date").dayofweek >= 5
    _hr = out.index.get_level_values("hour")
    out["is_peak"] = (_hr >= 7) & (_hr <= 21)
    return out.astype(float)
X6 = transform6(train_X).ffill(limit=4).bfill(limit=4)
X6_test = transform6(test_X).ffill(limit=1)

model6 = pipeline.Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("tree", tree.DecisionTreeRegressor(max_depth=6, min_samples_leaf=0.01))
])

model6.fit(X6, train_y)
print(metrics.mean_squared_error(model6.predict(X6), train_y))

319.2390187790837


In [121]:
upload_response("sglyon-dtree-peak-weekend", model6.predict(X6_test))

Response successfully submitted
Your current rank is 11


(11,            mse                        name                 timestamp
 0   454.119476    sglyon-enet-weekend-peak 2019-08-19 22:47:18+00:00
 1   456.634788            sglyon-enet_full 2019-08-19 22:24:58+00:00
 2   457.263936        RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 3   457.313194         sglyon-enet-rtdamce 2019-08-19 22:29:48+00:00
 4   458.510467                        boat 2019-08-14 18:08:03+00:00
 5   458.521125                        boat 2019-08-11 23:15:27+00:00
 6   459.352200         RW-LinearRegression 2019-08-15 23:43:48+00:00
 7   470.341087  RW-RandomForestWithWeather 2019-08-19 22:27:13+00:00
 8   472.157265                        boat 2019-08-10 17:57:53+00:00
 9   475.660907                sglyon-dtree 2019-08-19 22:58:26+00:00
 10  476.064450   sglyon-dtree-peak-weekend 2019-08-19 22:59:57+00:00
 11  482.634781              Darwin_Results 2019-08-19 22:59:18+00:00
 12  484.407248              Darwin_Results 2019-08-19 20:24:10+00:00
 13  485.337050 

In [125]:
def transform7(_df):
    out = _df.copy()
    out["is_weekend"] = out.index.get_level_values("date").dayofweek >= 5
    _hr = out.index.get_level_values("hour")
    out["is_peak"] = (_hr >= 7) & (_hr <= 21)
    out["rtda"] = out.eval("rtmce - damce")
    return out.astype(float)

X7 = transform7(train_X).ffill(limit=4).bfill(limit=4)
X7_test = transform7(test_X).ffill(limit=1)

model7 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    ensemble.RandomForestRegressor(max_depth=6, min_samples_leaf=0.01, max_features="sqrt", n_estimators=600)
)

model7.fit(X7, train_y)
print(metrics.mean_squared_error(model7.predict(X7), train_y))

323.378039751172


In [126]:
upload_response("class-forest-rtda-peak-weekend", model7.predict(X7_test))

Response successfully submitted
Your current rank is 2


(2,            mse                            name                 timestamp
 0   454.119476        sglyon-enet-weekend-peak 2019-08-19 22:47:18+00:00
 1   454.691850  class-forest-rtda-peak-weekend 2019-08-19 23:16:41+00:00
 2   456.634788                sglyon-enet_full 2019-08-19 22:24:58+00:00
 3   457.263936            RW-LinRegWithWeather 2019-08-19 18:10:19+00:00
 4   457.313194             sglyon-enet-rtdamce 2019-08-19 22:29:48+00:00
 5   458.510467                            boat 2019-08-14 18:08:03+00:00
 6   458.521125                            boat 2019-08-11 23:15:27+00:00
 7   459.352200             RW-LinearRegression 2019-08-15 23:43:48+00:00
 8   470.341087      RW-RandomForestWithWeather 2019-08-19 22:27:13+00:00
 9   472.157265                            boat 2019-08-10 17:57:53+00:00
 10  475.660907                    sglyon-dtree 2019-08-19 22:58:26+00:00
 11  476.064450       sglyon-dtree-peak-weekend 2019-08-19 22:59:57+00:00
 12  482.634781                  Da

In [131]:
train_X.shape

(10198, 16)

In [141]:
all_df = train_X.join(weather, how="left")
all_df_test = test_X.join(weather, how="left")

In [157]:
X8_test = transform4(all_df_test).ffill(limit=24).bfill(limit=24)

In [158]:
X8 = transform4(all_df).ffill(limit=30).bfill(limit=30)

In [159]:
model8 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.MultiTaskElasticNetCV(cv=10)
)
model8.fit(X8, train_y)
metrics.mean_squared_error(model8.predict(X8), train_y)

330.97458543016467

In [226]:
upload_response("class-enet-weather-peak-weekend-real", model8.predict(X8_test))

Response successfully submitted
Your current rank is 1


(1,            mse                                     name  \
 0   453.641651          class-enet-weather-peak-weekend   
 1   453.641651     class-enet-weather-peak-weekend-real   
 2   454.119476                 sglyon-enet-weekend-peak   
 3   454.128997  class-enet-weather-peak-weekend-rolling   
 4   454.691850           class-forest-rtda-peak-weekend   
 5   456.634788                         sglyon-enet_full   
 6   457.263936                     RW-LinRegWithWeather   
 7   457.313194                      sglyon-enet-rtdamce   
 8   458.510467                                     boat   
 9   458.521125                                     boat   
 10  459.352200                      RW-LinearRegression   
 11  470.341087               RW-RandomForestWithWeather   
 12  472.157265                                     boat   
 13  475.660907                             sglyon-dtree   
 14  476.064450                sglyon-dtree-peak-weekend   
 15  482.634781                      

In [207]:
def transform9(df_train, df_test):
    df = pd.concat([df_train, df_test]).reset_index()
    dt = df["date"] + pd.Timedelta(hours=1)*(df["hour"] - 1)
    df_with_dt = df.assign(dt=dt).set_index("dt").sort_index()
    date_hour = df_with_dt[["date", "hour"]]
    df_with_dt = df_with_dt.drop(["date", "hour"], axis=1)
        
    rolling_mean = (
        df_with_dt
        .rolling("14D")
        .mean()
    )
    
    jan1_filler = all_weather_rolling_mean.loc["2018-01-02", :].shift(-1, freq="D")
    rolling_mean_full = rolling_mean.fillna(jan1_filler)
    output = df_with_dt.fillna(rolling_mean_full)
    
    # add back in date and hour columns
    output["date"] = date_hour["date"]
    output["hour"] = date_hour["hour"]
    
    # split into train and test
    out_original_index = (
        output.reset_index(drop=True)
        .set_index(["date", "hour"])
    )
    
    train_X = out_original_index.loc[df_train.index, :]
    test_X = out_original_index.loc[df_test.index, :]
    return train_X, test_X

df_9, df9_test = transform9(train_X, test_X)

In [208]:
X9 = transform4(df_9)
X9_test = transform4(df9_test)

model9 = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.MultiTaskElasticNetCV(cv=10)
)
model9.fit(X9, train_y)
metrics.mean_squared_error(model9.predict(X9), train_y)

334.96270250666413

In [209]:
upload_response("class-enet-weather-peak-weekend-rolling", model9.predict(X9_test))

Response successfully submitted
Your current rank is 3


(3,            mse                                     name  \
 0   453.641651          class-enet-weather-peak-weekend   
 1   454.119476                 sglyon-enet-weekend-peak   
 2   454.128997  class-enet-weather-peak-weekend-rolling   
 3   454.691850           class-forest-rtda-peak-weekend   
 4   456.634788                         sglyon-enet_full   
 5   457.263936                     RW-LinRegWithWeather   
 6   457.313194                      sglyon-enet-rtdamce   
 7   458.510467                                     boat   
 8   458.521125                                     boat   
 9   459.352200                      RW-LinearRegression   
 10  470.341087               RW-RandomForestWithWeather   
 11  472.157265                                     boat   
 12  475.660907                             sglyon-dtree   
 13  476.064450                sglyon-dtree-peak-weekend   
 14  482.634781                           Darwin_Results   
 15  484.407248                      

In [225]:
def transform10(df_train, df_test):
    df = pd.concat([df_train, df_test]).reset_index()
        
    monthly_mean = (
        df.reset_index()
        .groupby(pd.Grouper(key="date", freq="M"))
        .mean()
    )
    df["merge_month"] = df.index.get_level_values("date")  - pd.Timedelta(days=1) + MonthEnd(1)
    output = df.merge(monthly_mean, left_on="merge_month", right_index=True)
    
    
    
    train_X = output.loc[df_train.index, :]
    test_X = output.loc[df_test.index, :]
    return train_X, test_X

moms = transform10(train_X, test_X)

KeyError: 'Level date must be same as name (None)'

In [220]:
from pandas.tseries.offsets import MonthEnd

train_X.index.get_level_values("date")

DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-01',
               '2018-01-01', '2018-01-01', '2018-01-01', '2018-01-01',
               '2018-01-01', '2018-01-01',
               ...
               '2019-07-31', '2019-07-31', '2019-07-31', '2019-07-31',
               '2019-07-31', '2019-07-31', '2019-07-31', '2019-07-31',
               '2019-07-31', '2019-07-31'],
              dtype='datetime64[ns]', name='date', length=10198, freq=None)

DatetimeIndex(['2018-01-31', '2018-01-31', '2018-01-31', '2018-01-31',
               '2018-01-31', '2018-01-31', '2018-01-31', '2018-01-31',
               '2018-01-31', '2018-01-31',
               ...
               '2019-07-31', '2019-07-31', '2019-07-31', '2019-07-31',
               '2019-07-31', '2019-07-31', '2019-07-31', '2019-07-31',
               '2019-07-31', '2019-07-31'],
              dtype='datetime64[ns]', name='date', length=10198, freq=None)

moms