# **Netherlands electricity regional time series**

[1. Introduction](#introduction)<br>
&emsp;[1.1 Hypotosis](#Hypotosis)<br>
&emsp;[1.2 Setting up libraries](#setup)<br>
&emsp;[1.3 Load and Process](#loadnprocess)<br>
[2. Exploratory data analysis](#EPA)<br>
&emsp;[2.1 Aggregation annual statistics](#AAS)<br>
&emsp;[2.2 Data cleaning](#DC)<br>
&emsp;[2.3 Correcting electricity usage per connection](#CEUPC)<br>
[3. Geospacial based analysis](#GBA)<br>
&emsp;[3.1 Explaining the enviroment](#ETE)<br>
&emsp;[3.2 Geospacial based exploritory data analysis](#GBEDA)<br>
&emsp;[3.3 The most electricity-hungry cities](#TMEHC)<br>
&emsp;[3.4 The least electricity-hungry cities](#TLEHC)<br>
[4. Modelling](#M)<br>
&emsp;[4.1 General renewable electricity production](#GREP)<br>
&emsp;[4.2 Largest renewable electricity producers](#LREP)<br>
&emsp;[4.3 Renewable electricity trends](#RET)<br>

<a id="introduction"></a>   
# **1. Introduction**
By 2023 at least 27% of the electricity produced has to be renewable. We aim to research if these goals are realistic & research the current electricity in different places of The Netherlands. We aim to answer questions such as:


**(a)** How much per cent of all consumed electricity is renewable?<br>
**(b)** Where is the most renewable electricity produced & how does this grow over time?<br>
**(c)** Will the renewable electricity trend keep up with the electricity usage trend?<br>

<a id="Hypotosis"></a>
## 1.1 Hypotosis

Before researching I try to predict the outcome of this research. With my research, I try to confirm or deny my assumptions.

**(a)** How many per cent of all consumed electricity is renewable?<br>
I expect renewable electricity production to be very little compared to the non-renewable electricity usage.<br>
**(b)** Where is the most renewable electricity produced & how does this grow over time?<br>
I expect the most renewable electricity/connection to be produced outside of major cities. Since it looks like people in rural areas have bigger houses and thus more space for PVs.<br>
**(c)** Will the renewable electricity trend keep up with the electricity usage trend?<br>
I suspect that renewable electricity grows harder compared to net electricity consumptions since electronics get more efficient & generating renewable electricity gets more affordable.<br>

<a id="setup"></a>
## 1.2 Setting up libraries

In [None]:
import os
import glob
import base64
import folium
import imageio

import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import geopandas as gpd
from IPython import display
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

sns.set_theme()

# Configuration
ELECTRICITY_PATH = '/kaggle/input/dutch-energy/Electricity/'
POSTCODE2_PATH = '/kaggle/input/postalcodes2/postcodes2_2020.csv'

%matplotlib inline

## 1.3 Load & Process
<a id="loadnprocess"></a>
The data set supplied is divided into files that are categorized by supplier and year. We group these files based on the year of measurement. After this, we get a quick look at the data to look at the column names. It seems like the last three columns got corrupted.

In [None]:
dataframes = []
for year in range(2009,2021): # 2009 -> 2020
    dfs = [pd.read_csv(x) for x in glob.glob(f'{ELECTRICITY_PATH}*{year}.csv')]
    dataframe = pd.concat(dfs)
    
    # Add a year column to the dataset
    dataframe['year'] = year
    dataframes.append(dataframe)
    
full_dataframe = pd.concat(dataframes)
del dataframes # Free memory
full_dataframe.head()

<a id="EPA"></a>
# **2. Exploratory data analysis**
As for any dataset before we begin analysing the data we don't know how complete the dataset is. Therefore the first step of the analysis is validating the growth of the data points over time.

<a id="AAS"></a>
## 2.1 Aggregation annual statistics
The first aggregation is done to confirm if the trend is consistent. Here we sum annual statistics and plot them against the date of measurement. According to this graph the netherlands only uses 1.6 tWh, which seems to be invalid compared to the information given on this [source](https://www.cbs.nl/nl-nl/publicatie/2015/07/elektriciteit-in-nederland) (~100tWh / year). Solving this is the crux of the exploratory data analysis & the geospacial based analysis.

In [None]:
# Group by year
annual_agg = full_dataframe.groupby('year').aggregate({'annual_consume': 'sum', 'num_connections': 'sum', 'delivery_perc' : 'mean', 'perc_of_active_connections' : 'mean'})
annual_agg.reset_index(inplace=True)

# Create charts
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
annual_agg.plot(x='year', y='annual_consume', label='Electricity consumption', ylabel='Electricity (kWh)', xlabel='Year', ax=axs[0],title='Annual consumption of electricity', style='.-')
annual_agg.plot(x='year', y='num_connections', label='Connection count', ax=axs[1], title="Number of connections", style='.-')
annual_agg.plot(x='year', y='delivery_perc', label='Electricity delivery percentage', ylabel='Percentage (%)', ax=axs[2], title='Electricity bought percentage', style='.-');


<a id="DC"></a>
## 2.2 Data cleaning
Looking at the chart above we can conclude that 2009 is incomplete since it doesn't follow the same trend. Therefore we delete all data points of 2009. In `load & process` We also noticed corrupted / empty columns. These columns will also be deleted

In [None]:
# Remove corrupted columns
full_dataframe.drop(full_dataframe[['STANDAARDDEVIATIE', 'ï»¿NETBEHEERDER','%Defintieve aansl (NRM)']], axis=1, inplace=True)

# Remove incomplete year
full_dataframe = full_dataframe[full_dataframe.year != 2009]
annual_agg =  annual_agg[annual_agg['year'] != 2009];

<a id="DD"></a>
## 2.3 Data distribution
Creating a distribution of the dataset will help us see if the data is skewed, and what kind of preprocessing steps we should apply in the future if we want to use machine learning algorithms. Here we created an animation to show the change in the distribution over time.

Based on the graphs shown below, we can estimate the mode and median. It appears like a substantial quantity of measurements in the range of the 2000-4000 bin, which can explain the low total electricity consumption per year. According to the [Dutch environmental centre](https://www.milieucentraal.nl/energie-besparen/inzicht-in-je-energierekening/gemiddeld-energieverbruik/#Welke%20cijfers%20gebruikt%20Milieu%20Centraal?), the average household should consume 2547 kWh. But the median and mode below show the electricity consumption lies only a notch above this, while each row in the dataset contains multiple connections.
I assume that each row is the average of the manifold of connections.

In [None]:
# I used a function since variables created in this scope get garbage collected.
def create_animation(animation_file):
    frames = []
    os.makedirs('/kaggle/working/EDA1/', exist_ok=True)
    
    for year in range(2010,2021):
        fig, axs = plt.subplots(1, 2, figsize=(10, 5))
        selected_year = full_dataframe[full_dataframe['year'] == year]

        # Annual consumption
        working_dataframe = selected_year['annual_consume'].to_numpy()
        working_dataframe.sort()
        working_dataframe = working_dataframe[(np.abs(stats.zscore(working_dataframe)) < 2)]
        axs[0].hist(working_dataframe, bins=20, alpha=1)
        axs[0].set_title(f'Annual consumption of electricity {year}')
        axs[0].set_xlabel('Electricity usage (kWh)')

        # Electricity delivery percentage
        working_dataframe = selected_year['delivery_perc'].to_numpy()
        working_dataframe.sort()
        axs[1].hist(working_dataframe, bins=20, alpha=1)
        axs[1].set_title(f'Annual electricity delivery percentage {year}')
        axs[1].set_xlabel('Percentage delivered (%)')        
        
        # Store plot as image
        path = f'/kaggle/working/EDA1/{year}.png'
        plt.savefig(path);
        frames.append(imageio.imread(path))
        plt.close()
    

    imageio.mimsave(animation_file, frames, fps=1)
    
    with open(animation_file, 'rb') as fd: # Hacky way to display the gif
        b64 = base64.b64encode(fd.read()).decode('ascii')
    return display.HTML(f'<img src="data:image/gif;base64,{b64}"/>')
create_animation('/kaggle/working/EDA1/animation.gif')

In [None]:
ax = full_dataframe.boxplot('annual_consume', by='year', showfliers=False, figsize=(10,10))
ax.set_title('Annual consumption boxplot');

## 2.2 Electricity statistics per connection
<a id="ESPC"></a>
It seems like our previous assumptions are correct. The electricity consumption per row is the average over multiple connections. Therefore we should correct it.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
connections = pd.DataFrame({'Year' : annual_agg['year'], 'Gross connection': annual_agg['num_connections'], 'Net connection' : annual_agg['num_connections'] * (annual_agg['perc_of_active_connections'] / 100)})
connections.plot(x='Year' ,ylabel='Connection count', title='Difference when inactive connections are removed', style='.-', ax=axs[0])

# Compute Electricity usage per connection
connections['Electricity Consumption / Net connection'] = annual_agg['annual_consume'] / connections['Net connection']
connections['Electricity Consumption / Gross connection'] = annual_agg['annual_consume'] / connections['Gross connection']

# Plot processed connection data
connections.plot(x='Year', y=['Electricity Consumption / Net connection','Electricity Consumption / Gross connection'], style='.-', ylabel='Electricity consumption (kWh)', title='Average Electricity consumption per connection', ax=axs[1]);
del connections;

<a id="CEUPC"></a>
## 2.3 Correcting electricity usage per connection
Assuming the lower electricity consumption is caused by creating an average over different connections is partially correct. We now achieve an electricity usage of 36tWh, which is still very low compared (1/3) of the actual electricity usage. I compiled the following list of assumptions on what is happening with our data.
- Missing regional data
- Information missing caused by computational errors. I.E. averaging error, rounding error & projectional error.

In [None]:
full_dataframe['annual_consume'] = full_dataframe['annual_consume'] * (full_dataframe['num_connections'] * (full_dataframe['perc_of_active_connections'] / 100))

# Store cleaned data which can be used for further research
full_dataframe.to_csv("/kaggle/working/cleaned.csv", index=False);

In [None]:
# Displaying change in energy usage
annual_agg = full_dataframe.groupby('year').aggregate({'annual_consume': 'sum', 'num_connections': 'sum', 'delivery_perc' : 'mean', 'perc_of_active_connections' : 'mean'})
annual_agg.reset_index(inplace=True)

# Create charts
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
annual_agg.plot(x='year', y='annual_consume', label='Electricity consumption', ylabel='Electricity (kWh)', xlabel='Year', ax=axs[0],title='Annual consumption of electricity', style='.-')
annual_agg.plot(x='year', y='num_connections', label='Connection count', ax=axs[1], title="Number of connections", style='.-')
annual_agg.plot(x='year', y='delivery_perc', label='Electricity delivery percentage', ylabel='Percentage (%)', ax=axs[2], title='Electricity bought percentage', style='.-');


<a id="GBA"></a>
# **3. Geospacial based analysis**
The dataset is associated with spatial data. Thus we should verify if this is also correct and if we are not missing information.

<a id="ETE"></a>
## 3.1 Explaining the enviroment
In the image shown below, you can see the cluster size we use for plotting geospacial information. These clusters are based on postal 2 codes.

In [None]:
# Testing if the charting works correctly
postal_codes = gpd.read_file(POSTCODE2_PATH, GEOM_POSSIBLE_NAMES="geometry", KEEP_GEOM_COLUMNS="NO")

# Extracted important information & zipcode datatype which is converted to int64
postal_codes = gpd.GeoDataFrame({ 'postcode' : postal_codes.postcode.astype(int),  'geometry':postal_codes.geometry})
ax = postal_codes.plot(linewidth=1, cmap="prism", figsize=(10,10));
ax.axis('off');

<a id="GBEDA"></a>
## 3.2 Geospacial based exploritory data analysis
When we plot electricity over time & space, we can see that Zeeland is missing;  this could partially explain why the annual electricity usage is lower than it should be. 

Even in a perfect world where every province used 1/12th of the total electricity, this isn't enough to explain the lack of used consumed electricity; this confirms our assumption of regional data missing is correct.

In [None]:
post2_agg = full_dataframe

# Convert Zipcode6 to Zipcode2.
post2_agg['zipcode'] = post2_agg['zipcode_from'].apply(lambda x: int(str(x)[:2]))

# Aggregate zipcode & year this will help with aggregation with geometry data.
post2_agg = post2_agg.groupby(['zipcode', 'year']).aggregate({'annual_consume': 'sum', 'num_connections': 'sum', 'delivery_perc' : 'mean'})
post2_agg.reset_index(inplace=True)

post2_agg = postal_codes.merge(post2_agg, left_on="postcode", right_on="zipcode")

def create_animation_chart():
    frames = []
    
    # Create directory for the frames
    os.makedirs('/kaggle/working/GEO1/', exist_ok=True)
    
    for year in range(2010,2021):
        fig, axs = plt.subplots(1, 2, figsize=(20, 10))
        selected_year = post2_agg[post2_agg['year'] == year]

        postal_codes.plot(color='black', ax=axs[0])
        selected_year.plot('annual_consume', ax=axs[0], linewidth=1, legend=True, cmap='coolwarm')
        axs[0].set_title(f'Annual consumption of electricity {year}')
        axs[0].axis('off')
        
        postal_codes.plot(color='black', ax=axs[1])
        selected_year.plot('delivery_perc', ax=axs[1], linewidth=1, legend=True, cmap='coolwarm')
        axs[1].set_title(f'Annual electricity delivery percentage {year}')
        axs[1].axis('off')

        path = f'/kaggle/working/GEO1/{year}.png'
        plt.savefig(path);
        frames.append(imageio.imread(path))
        plt.close()
    

    imageio.mimsave('/kaggle/working/GEO1/animation.gif', frames, fps=1)
    
    with open('/kaggle/working/GEO1/animation.gif', 'rb') as fd: # Hacky way to display the gif
        b64 = base64.b64encode(fd.read()).decode('ascii')
    return display.HTML(f'<img src="data:image/gif;base64,{b64}"/>')

post2_agg.to_csv("/kaggle/working/post2_agg.csv", index=False)
create_animation_chart()

<a id="TMEHC"></a>
## 3.3 The most electricity-hungry cities
This c this chart to show if strange things are happening in the biggest cities.

This chart has been made to create more in-depth information about larger cities and the data trends. We can use this to assume the shortcomings of the dataset and account for these in our computations.  From the chart shown below Eindhoven looks inconsistent since their electricity consumption doesn't follow the trend of the other datapoints. Especially the drop in 2010. 

In [None]:
# Create an aggregation based on year and city name
city_agg = full_dataframe.groupby(['city', 'year']).aggregate({'annual_consume': 'sum', 'num_connections': 'sum', 'delivery_perc' : 'mean', 'perc_of_active_connections' : 'mean'})
city_agg.reset_index(inplace=True)

# Configure plot
fig, ax = plt.subplots(2, 1, figsize=(20,15))
ax[0].set_ylabel('Electricity consumption (kWh)')
ax[0].set_ylabel('Gross connection count')

# Find 10 biggest electricity users based on statistics from 2020
biggest_consumers = city_agg[city_agg['year'] == 2020].sort_values('annual_consume', ascending=False).head(10)

# Plot biggest electricity consumers
for city in biggest_consumers['city']:
    target_city = city_agg[city_agg['city']==city]
    target_city.plot(y='annual_consume', x='year', ax=ax[0], label=city, style='.-', title='Electricity consumption')
    target_city.plot(y='num_connections', x='year', ax=ax[1], label=city, style='.-', title='Number of connections')

When we dive deeper into the measurement count of Eindhoven, we can see that there are no measurements of 2010. We also lost a large number of measurements from 2013 to 2014.

In [None]:
# Configure plot
fig, ax = plt.subplots(1, 1, figsize=(15,5))
ax.legend()
ax.set_title('Measurement count most electricity hungry cities')

# Plot measurement count to see if it shows a similar trend as the plot shown above
for city in biggest_consumers['city']:
    target_city = full_dataframe[full_dataframe['city']==city]
    measurement_count = []
    for year in range(2010, 2021):
        count = target_city[target_city['year']==year].count()[0]
        measurement_count.append(count)
    ax.plot(range(2010, 2021), measurement_count, marker='.', label=city)  

del biggest_consumers;

<a id="LEHC"></a>
## 3.4 The least electricity-hungry cities
Sorting the time-based datasets ascending show us strange patterns. The smallest electricity consumers show us a substantial list of inconsistencies such as:
- City casing is not consistent.
- The `o` in Amsterdam zuidoost is replaced with zeros.
- Amsterdam Zuid-Oost is not a city but part a part of Amsterdam
- Amsterdam Zuid-Oost existed before 2016
- Cities which got newly created at different intervals
- Cities which disappeared from the dataset (Not visible on this chart)

These inconsistencies support our assumption of incomplete data.

In [None]:
smallest_consumers = city_agg[city_agg['year'] == 2020].sort_values('annual_consume', ascending=True).head(7)
fig, ax = plt.subplots(2, 1, figsize=(20,15))

for city in smallest_consumers['city']:
    target_city = city_agg[city_agg['city']==city]
    target_city.plot(y='annual_consume', x='year', ax=ax[0], label=city, style='.-', title='Electricity consumption')
    target_city.plot(y='num_connections', x='year', ax=ax[1], label=city, style='.-', title='Number of connections')

del smallest_consumers

<a id="M"></a>
# **4. Modelling**
Even though there are inconsistencies, we should still answer the hypothesis questions from the beginning of this research

<a id="GREP"></a>
## 4.1 General renewable electricity production
According to our dataset, the percentage of renewable electricity seem to be increasing in an exponential matter. According to [this source](https://www.trade.gov/knowledge-product/netherlands-electricity), the actual renewable electricity production seems to be higher; this confirms the hypothesis about the percentage of renewable electricity production.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
annual_renewable_electricity = pd.DataFrame({
    'Year' : annual_agg['year'],
    'Net electricity': annual_agg['annual_consume'] / (annual_agg['delivery_perc'] / 100),
    'Gross electricity': annual_agg['annual_consume'],
    'Renewable electricity': annual_agg['annual_consume'] / (annual_agg['delivery_perc'] / 100) - annual_agg['annual_consume']
})
annual_renewable_electricity['Percentage renewable'] = annual_renewable_electricity['Renewable electricity'] / annual_renewable_electricity['Net electricity'] * 100
annual_renewable_electricity.plot(ax=axs[0],x='Year', y=['Net electricity', 'Gross electricity'], style='.-', ylabel='Electricity (kWh)', title='Total electricity consumption')
annual_renewable_electricity.plot(ax=axs[1],x='Year', y='Renewable electricity', ylabel='Electricity (kWh)', style='.-', title='Renewable electricity production')
annual_renewable_electricity.plot(ax=axs[2],x='Year', y='Percentage renewable', ylabel='Precentage (%)', style='.-', title='Renewable electricity production')
plt.show()

annual_renewable_electricity.to_csv("/kaggle/working/annual_renewable_electricity.csv", index=False)

<a id="LREP"></a>
## 4.2 Largest renewable electricity producers
The charts below contradict my initial hypothesis. Large cities in urban areas produce more renewable electricity than smaller rural places.

In [None]:
city_agg['renewable_electricity'] = city_agg['annual_consume'] / (city_agg['delivery_perc'] / 100) - city_agg['annual_consume']
city_agg['renewable_electricity_perc'] = city_agg['renewable_electricity'] / city_agg['annual_consume'] * 100

# Plot preperation
fig, axs = plt.subplots(2, 1, figsize=(20, 15))
axs[0].set_ylabel('Electricity (kWh)')
axs[1].set_ylabel('Percentage (kWh)')

largest_producers = city_agg[city_agg['year'] == 2020].sort_values('renewable_electricity', ascending=False).head(10)

for city in largest_producers['city']:
    target_city = city_agg[city_agg['city'] == city]
    target_city.plot( x='year',y='renewable_electricity', ax=axs[0], label=city, style='.-', title='Renewable electricity production')
    target_city.plot(y='renewable_electricity_perc', x='year', ax=axs[1], label=city, style='.-', title='Renewable electricity production')


<a id="RET"></a>
## 4.3 Renewable electricity trends
Based on the small amount of data in this dataset, we predicted two years in the future. This prediction is very promising for a future where more renewable electricity gets produced

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(25, 5))

# Aggregate net electricity consumption
x = annual_renewable_electricity[['Year']]
y = annual_renewable_electricity[['Net electricity']]

# Create polynomial features & labels
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
future = np.array(range(2021,2023)).reshape(-1,1)
future_x = poly.transform(future)

# Polynomial regression for net electricity usage / year
line = LinearRegression()
line.fit(x_poly,  y)
axs[0].scatter(x, y, label='Net electricity')
axs[0].plot(x, line.predict(x_poly), label='Net electricity trend')
axs[0].scatter(future, line.predict(future_x), marker="x", c='b', label='Predicted net electricity')

# Polynomial regression Gross electricity usage / year
x = annual_renewable_electricity[['Year']]
y = annual_renewable_electricity[['Gross electricity']]
line = LinearRegression()
line.fit(x_poly,  y)
axs[0].scatter(x, y,c='r')
axs[0].plot(x, line.predict(x_poly), label='Gross electricity trend', c='r')
axs[0].scatter(future, line.predict(future_x), marker = "x", c='r', label='Predicted gross electricity')
axs[0].legend()
axs[0].set_title('Prediction electricity consumption')
axs[0].set_ylabel('Electricity (kWh)')


# Polynomial regression renewable electricity production (kWh) / year
x = annual_renewable_electricity[['Year']]
y = annual_renewable_electricity[['Renewable electricity']]
line = LinearRegression()
line.fit(x_poly,  y)
axs[1].scatter(x, y)
axs[1].plot(x, line.predict(x_poly), label='Renewable electricity trend')
axs[1].scatter(future, line.predict(future_x), marker = "x", c='b', label='Renewable electricity production')
axs[1].legend()
axs[1].set_title('Prediction renewable electricity production')
axs[1].set_ylabel('Electricity (kWh)')

# Polynomial regression renewable electricity production vs net electricity production percentage / year
x = annual_renewable_electricity[['Year']]
y = annual_renewable_electricity[['Percentage renewable']]
line = LinearRegression()
line.fit(x_poly,  y)
axs[2].scatter(x, y)
axs[2].plot(x, line.predict(x_poly) , label='Renewable electricity trend')
axs[2].scatter(future, line.predict(future_x), marker = "x", c='b', label='Renewable electricity prediction')
axs[2].set_title('Percentage Renewable electricity production')
axs[2].set_ylabel('Percentage (%)');
axs[2].legend()

del annual_renewable_electricity;