# Comparing Nord Pool and SVK Regional Wind Power Data

Sweden is divided into four different price areas. On the day-ahead market, participants need to bid into these areas with their expected production or consumption. These areas are shown geographically in the figure below.

![title](Areas_Sweden.png)

Public regional wind power data is available for all this areas with some days/weeks lag fram real time. There are two different providers of this data:

* [Nord Pool](http://www.nordpoolspot.com/historical-market-data/)
* [SVK](https://mimer.svk.se/ProductionConsumption/ProductionIndex)

In this notebook we will compare these two datasets with each other. It is expected that the data is identical. 

## Import Python libraries

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff

## Import data

### Nord Pool data

In [2]:
# Import Nord Pool wind power data
df_N_2015 = pd.read_excel('data/wp_N_2015.xlsx', header=2, index_col=None)
df_N_2016 = pd.read_excel('data/wp_N_2016.xlsx', header=2, index_col=None)
df_N_2017 = pd.read_excel('data/wp_N_2017.xlsx', header=2, index_col=None)
df_N = pd.concat([df_N_2015, df_N_2016, df_N_2017], axis=0, ignore_index=True)

In [3]:
# Format dataframe
df_N.index = pd.to_datetime(df_N['Date']+' '+df_N['Hours'].apply(lambda x: x.split()[0]), format='%d-%m-%Y %H')
df_N.index.name = 'Time'
df_N = df_N.drop(['Date', 'Hours'], axis=1)

In [4]:
# Remove duplicate indices
print(df_N.index.is_unique)
df_N = df_N[~df_N.index.duplicated(keep='first')] #
print(df_N.index.is_unique)

False
True


In [5]:
df_N[800:805]

Unnamed: 0_level_0,SE1,SE2,SE3,SE4,SE
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-02-03 08:00:00,141.0,429.0,623.0,144.0,1338.0
2015-02-03 09:00:00,131.0,390.0,608.0,138.0,1266.0
2015-02-03 10:00:00,116.0,375.0,547.0,181.0,1153.0
2015-02-03 11:00:00,114.0,335.0,500.0,184.0,1057.0
2015-02-03 12:00:00,99.0,317.0,463.0,116.0,994.0


### SVK data

In [6]:
# Import SVK wind power data
df_S_SE1 = pd.read_csv('data/wp_S_SE1.csv', sep=';', header=0, names=['Time', 'SE1'], index_col=0, usecols=[0,2], decimal=',')
df_S_SE2 = pd.read_csv('data/wp_S_SE2.csv', sep=';', header=0, names=['Time', 'SE2'], index_col=0, usecols=[0,2], decimal=',')
df_S_SE3 = pd.read_csv('data/wp_S_SE3.csv', sep=';', header=0, names=['Time', 'SE3'], index_col=0, usecols=[0,2], decimal=',')
df_S_SE4 = pd.read_csv('data/wp_S_SE4.csv', sep=';', header=0, names=['Time', 'SE4'], index_col=0, usecols=[0,2], decimal=',')
df_S_SE = pd.read_csv('data/wp_S_SE.csv', sep=';', header=0, names=['Time', 'SE'], index_col=0, usecols=[0, 2], decimal=',')

In [7]:
# Concatenate dataframes
df_S = pd.concat([df_S_SE1, df_S_SE2, df_S_SE3, df_S_SE4, df_S_SE], axis=1)

In [8]:
# Convert to MWh and remove last row
df_S = df_S/10**3
df_S = df_S[:-1] 
df_S.head()

Unnamed: 0,SE1,SE2,SE3,SE4,SE
2014-12-01 00:00,35.631305,411.486402,335.119003,,
2014-12-01 01:00,32.907812,392.75638,348.271535,,
2014-12-01 02:00,49.587869,417.99198,323.013154,,
2014-12-01 03:00,72.84576,440.684582,315.589557,,
2014-12-01 04:00,92.836764,453.357844,323.035225,,


## Time series plot

By comparing the Nord Pool and SVK wind power time series it seems like it is indeed the same data. However, for the individual areas, SVK has data for about one more month compared to Nord Pool. While for the aggregated time series of all Swedish areas, Nord Pool seems to have more stored data.  

In [9]:
def ts_plot(df_N, df_S, area):
    trace_N = go.Scatter(x=df_N.index, y=df_N[area], name='Nord Pool')
    trace_S = go.Scatter(x=df_S.index, y=df_S[area], name='SVK')
    data = [trace_N, trace_S]
    
    layout = dict(
        title='Wind Power Production in ' + area,
        xaxis=dict(
            rangeselector=dict(
                buttons=list([
                    dict(count=1,
                         label='1m',
                         step='month',
                         stepmode='backward'),
                    dict(count=6,
                         label='6m',
                         step='month',
                         stepmode='backward'),
                    dict(count=1,
                         label='1y',
                         step='year',
                         stepmode='backward'),
                    dict(step='all')
                ])
            ),
            rangeslider=dict(),
            type='date',
            range = ['2017-06-25','2017-08-21']
        ),
        yaxis=dict(title='Production [MW]'),
    )

    fig = dict(data=data, layout=layout)
    
    return fig

In [10]:
fig = ts_plot(df_N, df_S, area='SE1')
py.iplot(fig)

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



In [11]:
fig = ts_plot(df_N, df_S, area='SE2')
py.iplot(fig)

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



In [12]:
fig = ts_plot(df_N, df_S, area='SE3')
py.iplot(fig)

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



In [13]:
fig = ts_plot(df_N, df_S, area='SE4')
py.iplot(fig)

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



In [14]:
fig = ts_plot(df_N, df_S, area='SE')
py.iplot(fig)

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



## Histogram

It is hard to determine from the time series plot if the two different sources (Nord Pool and SVK) are indeed providing the same data. Therefore, we will take a look at the residual between the datasets by using a histogram. As can be seen the residuals are mostly close to zero and symetrically centered around zero. This indicates that there is no bias between the datasets. 

In [21]:
trace_SE1 = go.Histogram(x=df_N.SE1-df_S.SE1, xbins=dict(start=-5, end=5, size=0.2), histnorm='probability', name = 'SE1')
trace_SE2 = go.Histogram(x=df_N.SE2-df_S.SE2, xbins=dict(start=-5, end=5, size=0.2), histnorm='probability', name = 'SE2')
trace_SE3 = go.Histogram(x=df_N.SE3-df_S.SE3, xbins=dict(start=-5, end=5, size=0.2), histnorm='probability', name = 'SE3')
trace_SE4 = go.Histogram(x=df_N.SE4-df_S.SE4, xbins=dict(start=-5, end=5, size=0.2), histnorm='probability', name = 'SE4')

data = [trace_SE1, trace_SE2, trace_SE3, trace_SE4]
layout = go.Layout(title='Residual between Nord Pool and SVK data',
                   xaxis=dict(title='Residual [MW]'),
                   yaxis=dict(title='Probability'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

## Scatter plot

Another way to look at the residuals is by comparing Nord Pool and SVK production data direcly in a scatter plot. By zooming in, the fact that Nord Pool is rounding the data to whole MW becomes apparent.

In [3]:
def scatter_plot(df_N, df_S, area):
    df = df_N[[area]].join(df_S[[area]], lsuffix='_N', rsuffix='_S')
    
    trace = go.Scatter(x=df[area+'_N'], y=df[area+'_S'], mode='markers')
    layout = go.Layout(title='Wind Power Correlation in ' + area,
                       xaxis=dict(title='Nord Pool'),
                       yaxis=dict(title='SVK'))
    data = [trace]
    fig = go.Figure(data=data, layout=layout)
    
    return fig

In [4]:
fig = scatter_plot(df_N, df_S, area='SE1')
py.iplot(fig)

NameError: name 'df_N' is not defined

In [18]:
fig = scatter_plot(df_N, df_S, area='SE2')
py.iplot(fig)

In [19]:
fig = scatter_plot(df_N, df_S, area='SE3')
py.iplot(fig)

In [20]:
fig = scatter_plot(df_N, df_S, area='SE4')
py.iplot(fig)

In [18]:
fig = scatter_plot(df_N, df_S, area='SE')
py.iplot(fig)

## Conclusion

From this preliminary analysis of the regional wind power data we conclude that the two datasets (Nord Pool and SVK) are indeed very similar. The main reason for the difference between the datasets seems to be rounding. However, there are several residuals that has absolute value above 0.5 MW. This means that there are other unknown reasons why the datasets are different. Since the Nord Pool data is rounded and not availble for the first month on area level, it is anticipated that the SVK data is of higher quality. 