# Abstract

One of the main issues with electricity markets is predicting renewable energy loads for generators and prices for customers. Due to the advent of machine learning, energy forecasting techniques have improved significantly, but is also in dire need of wide-spread implementation.  Forecasted load and price data, as well as actual price data from a four year period between 2015 and 2018 was gathered by fellow Kaggle user [Nicholas Jhana](https://www.kaggle.com/nicholasjhana/energy-consumption-generation-prices-and-weather), which was collected from ENTSOE (European Network of Transmission System Operators for Electricity) and the Spanish TSO Red Electric España. The goal of this project is to improve upon the day-ahead electicity pricing using multiple linear regression.

Results shown below, using squared mean error as the method of comparison.

* Multiple Linear Regression: 9.51
* Electric Company Error: 13.44

The results show that a simple multiple linear regression using the predictions from the Spanish TSO, and forecasted load data greatly improved the predictions of electricity prices in Spain.

# Problem Statement

Does the current TSO in Spain accurately predict electricity prices?  Can we improve the predictions with the provided data?  I will use multiple linear regression to improve upon the existing electricity predictions.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import mean_squared_error
from openpyxl import load_workbook

In [None]:
data_energy = pd.read_csv('../input/energy-consumption-generation-prices-and-weather/energy_dataset.csv')

hour = data_energy.time.str.slice(11, 13)

df = pd.DataFrame(data_energy)

df['hour'] = hour

df['time'] = pd.to_datetime(df['time'], utc=True)

df['hour'] = df['hour'].astype(str)

new_dummies = pd.get_dummies(df['hour'])

df = pd.concat([df,new_dummies],axis='columns')

# Data Overview

The TSO predictions consistently underestimate the electricity prices, with 89% of the predictions being an underestimate, and a median underestimate of €7.41/MWh.  What's interesting, is the profile shape of the actual and predicted prices are very similar, the only difference being that the TSO estimates were shifted down to the left.  

In [None]:
# sns.set_color_codes(palette='deep')
sns.color_palette()
blue= '#30b4c9'
red= '#f25c5c'
green= '#69b64f'
yellow= '#f6ab53'
purple ='#ad85d2'

# ‘b’, ‘g’, ‘r’, ‘c’, ‘m’, ‘y’, ‘k’, ‘w’

plt.figure(figsize=(10,4))
gr = sns.distplot(df['price day ahead'], bins=50, label='TSO Prediction',color='r')
gr = sns.distplot(df['price actual'], bins=50, label='Actual Price',color='b')


gr.set(xlabel="Price of Electricity (€/MWh)", ylabel="Frequency")
gr.set_title('Spain Electricity Price Comparison\nActual Price vs. TSO Prediction')
plt.legend()
plt.show()

p_diff = df['price actual'] - df['price day ahead']

plt.figure(figsize=(10,4))
gr = sns.distplot(p_diff, bins=50, label='Actual Price minus TSO Prediction',color='r')
gr.set(xlabel="Price of Electricity (€/MWh)", ylabel="Frequency")
gr.set_title('Spain Electricity Price\nDifference Between Actual Price and TSO Prediction')
plt.legend()
plt.show()

# Method

A multiple linear regression with an 80%-20% data split for training and testing the algorithm.  Below are the dependent variables that will be used to predict electricity price.

* Forecast Solar Day Ahead
* Forecast Wind Onshore Day Ahead
* Total Load Forecast
* Hour of Day
* Price Day Ahead (TSO Prediction)

The TSO predicted price will be used because it may be advantageuous to use their predictions, since the profile of their predicted prices resembles the actual price.  

In [None]:
X = ['forecast solar day ahead','forecast wind onshore day ahead', 'total load forecast','00','01','02','03','04','05',
     '06', '07', '08', '09','10', '11', '12', '13', '14', '15', '16' ,'17', '18', '19', '20', '21', '22', '23','price day ahead']
y = ['price day ahead', 'price actual']

df = df.dropna(subset=X)
df = df.dropna(subset=y)

for i in df[X]:
    df[i] = pd.to_numeric(df[i])

for i in df[y]:
    df[i] = pd.to_numeric(df[i])

# Results

A multiple linear regression model was implemented, and a score of 54.6% was achieved.  This seems quite low, but when compared to the TSO predictions, was a significant improvement over their predictions. The median difference between this method's predictions and the actual price hovered around zero, whereas the median TSO predctions were around €7 below the actual prices.

The multiple linear regression resulted in a lower mean absolute error of, with the results shown below.

* Multiple Linear Regression: 9.51
* Electric Company Error: 13.44

The predicted electricity price profiles from this report and the TSO are compared along with the actual prices.  This paper's predictions are more in line with the actual prices, and the price differences show that add to the fact that a simple multiple linear regression improved the predictions of the Spanish TSO.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[X],df[y],test_size=0.2, random_state=0)

lr_model = LinearRegression()

lr_model.fit(X_train,y_train['price actual'])

print(lr_model.score(X_test,y_test['price actual']))

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

In [None]:
val_predictions = lr_model.predict(X_test)

diff1 = y_test['price actual'] - val_predictions
diff2 = y_test['price actual'] - y_test['price day ahead']

# f, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)

plt.figure(figsize=(10,4))
gr = sns.distplot(diff1, bins=50, label='Actual Price minus New Predictions',color='g')
gr = sns.distplot(diff2, bins=50, label='Actual Price minus TSO Predictions',color='r')
gr.set(xlabel="Price Difference (€/MWh)", ylabel="Frequency")
gr.set_title('Price Difference Comparison\nNew Prediction vs. TSO Prediction')
gr.set(xlim=(-60,60))
plt.legend()
plt.show()

plt.figure(figsize=(10,4))
gr = sns.distplot(y_test['price actual'], bins=50, label='Actual Price',color='b')
gr = sns.distplot(y_test['price day ahead'], bins=50, label='TSO Prediction',color='r')
gr = sns.distplot(val_predictions, bins=50, label='My Prediction',color='g')
gr.set(xlabel="Price of Electricity (€/MWh)", ylabel="Frequency")
gr.set_title('Electricity Price Distribution\nActual Price, New Prediction & TSO Prediction')
gr.set(xlim=(0,None))
plt.legend()
plt.show()

In [None]:
rms_my_pred = sqrt(mean_squared_error(y_test['price actual'], val_predictions))
rms_TSO_pred = sqrt(mean_squared_error(
    y_test['price actual'], y_test['price day ahead']))

mean_abs_error_my_pred = mean_absolute_error(
    y_test['price actual'], val_predictions)
mean_abs_error_TSO_pred = mean_absolute_error(
    y_test['price actual'], y_test['price day ahead'])

print('Multiple Linear Regression Error: ' +
      str(rms_my_pred))
print('TSO Error: ' +
      str(rms_TSO_pred))

# Conclusion

It was shown that a simple multiple linear regression could improve the predicted electricity prices from the Spanish TSO.  A combination of using the TSO's predicted prices, as well as the predicted solar, wind, and total load forecasts, and the hour of the day, improved the TSO's predicted electricity prices.  Further investigation using more data and a more complicated machine learning algorithm could produce better results.