# Solar energy generation per country - clustering - part 1-3

---

__Description of the data set__

This dataset contains hourly estimates of an area's energy potential for 1986-2015 as a percentage of a power plant's maximum output.

The overall scope of EMHIRES is to allow users to assess the impact of meteorological and climate variability on the generation of solar power in Europe and not to mime the actual evolution of solar power production in the latest decades. For this reason, the hourly solar power generation time series are released for meteorological conditions of the years 1986-2015 (30 years) without considering any changes in the solar installed capacity. Thus, the installed capacity considered is fixed as the one installed at the end of 2015. For this reason, data from EMHIRES should not be compared with actual power generation data other than referring to the reference year 2015.

__Content__
- The data is available at both the national level and the [NUTS 2 level](https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics). The NUTS 2 system divides the EU into 276 statistical units.
- Please see the manual for the technical details of how these estimates were generated.
- This product is intended for policy analysis over a wide area and is not the best for estimating the output from a single system. Please don't use it commercially.

__Acknowledgements__

This dataset was kindly made available by [the European Commission's STETIS program](https://setis.ec.europa.eu/about-setis). You can find the original dataset here.

__Goal of this 1st step__

This is the first part of three. Here we're going to study solar generation on a country level in order to make cluster of country which present the same profile so that each group can be investigate in more details later.

In [None]:
# import of needed libraries

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_columns = 300

import warnings
warnings.filterwarnings("ignore")

First let's start with the data set for each country :

In [None]:
# let's see what our data set look like

df_solar_co = pd.read_csv("../input/30-years-of-european-solar-generation/EMHIRESPV_TSh_CF_Country_19862015.csv")
df_solar_co.head(2)

Each column represent a country, we can list them easily :

In [None]:
df_solar_co.columns

If needed, here is a dictionnary in python that can help us to make the conversion between the 2 letters and the real name of each country : 

In [None]:
country_dict = {
'AT': 'Austria',
'BE': 'Belgium',
'BG': 'Bulgaria',
'CH': 'Switzerland',
'CY': 'Cyprus',
'CZ': 'Czech Republic',
'DE': 'Germany',
'DK': 'Denmark',
'EE': 'Estonia',
'ES': 'Spain',
'FI': 'Finland',
'FR': 'France',
'EL': 'Greece',
'UK': 'United Kingdom',
'HU': 'Hungary',
'HR': 'Croatia',
'IE': 'Ireland',
'IT': 'Italy',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'NO': 'Norway',
'NL': 'Netherlands',
'PL': 'Poland',
'PT': 'Portugal',
'RO': 'Romania',
'SE': 'Sweden',
'SI': 'Slovenia',
'SK': 'Slovakia'
    }

How many columns and lines of records do we have :

In [None]:
df_solar_co.shape

Then, let's take a look at the data set at the NUTS 2 level system :

In [None]:
df_solar_nu = pd.read_csv("../input/30-years-of-european-solar-generation/EMHIRES_PVGIS_TSh_CF_n2_19862015.csv")
df_solar_nu = df_solar_nu.drop(columns=['time_step'])
df_solar_nu.tail(2)

In [None]:
df_solar_nu.shape

---

# Groups of countries or regions with similar profiles

## Clustering with the KMean model 

The objective of clustering is to identify distinct groups in a dataset such that the observations within a group are similar to each other but different from observations in other groups. In k-means clustering, we specify the number of desired clusters k, and the algorithm will assign each observation to exactly one of these k clusters. The algorithm optimizes the groups by minimizing the within-cluster variation (also known as inertia) such that the sum of the within-cluster variations across all k clusters is as small as possible. 

Different runs of k-means will result in slightly different cluster assignments because k-means randomly assigns each observation to one of the k clusters to kick off the clustering process. k-means does this random initialization to speed up the clustering process. After this random initialization, k-means reassigns the observations to different clusters as it attempts to minimize the Euclidean distance between each observation and its cluster’s center point, or centroid. This random initialization is a source of randomness, resulting in slightly different clustering assignments, from one k-means run to another. 

Typically, the k-means algorithm does several runs and chooses the run that has the best separation, defined as the lowest total sum of within-cluster variations across all k clusters. 

Reference : [Hands-On Unsupervised Learning Using Python](https://www.oreilly.com/library/view/hands-on-unsupervised-learning/9781492035633/)

## Evaluating the cluster quality

The goal here isn’t just to make clusters, but to make good, meaningful clusters. Quality clustering is when the datapoints within a cluster are close together, and afar from other clusters.
The two methods to measure the cluster quality are described below:
- Inertia: Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of inertia’s value starts from zero and goes up.
- Silhouette score: Silhouette score tells how far away the datapoints in one cluster are, from the datapoints in another cluster. The range of silhouette score is from -1 to 1. Score should be closer to 1 than -1.

Reference : [Towards Data Science](https://towardsdatascience.com/k-means-clustering-from-a-to-z-f6242a314e9a)

__Optimal K: the elbow method__

How many clusters would you choose ?

A common, empirical method, is the elbow method. You plot the mean distance of every point toward its cluster center, as a function of the number of clusters. Sometimes the plot has an arm shape, and the elbow would be the optimal K.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## On the NUTS 2 level

Let's keep the records of one year and tranpose the dataset, because we need to have one line per region.

In [None]:
df_solar_transposed = df_solar_nu[-24*365:].T
df_solar_transposed.tail(2)

In [None]:
def plot_elbow_scores(df_, cluster_nb):
    km_inertias, km_scores = [], []

    for k in range(2, cluster_nb):
        km = KMeans(n_clusters=k).fit(df_)
        km_inertias.append(km.inertia_)
        km_scores.append(silhouette_score(df_, km.labels_))

    sns.lineplot(range(2, cluster_nb), km_inertias)
    plt.title('elbow graph / inertia depending on k')
    plt.show()

    sns.lineplot(range(2, cluster_nb), km_scores)
    plt.title('scores depending on k')
    plt.show()
    
plot_elbow_scores(df_solar_transposed, 20)

The best nb k of clusters seems to be 7 even if there isn't any real elbow on the 1st plot.

## On the country level

Let's do exactly the same thing but this same at the country level :

In [None]:
df_solar_transposed = df_solar_co[-24*365*10:].T
plot_elbow_scores(df_solar_transposed, 20)

The best nb k of clusters seems to be 6 even if there isn't any real elbow on the 1st plot.

Finally, we can keep the optimal number k of clusters, and retrieve infos on each group such as number of countries, and names of those countries :

In [None]:
X = df_solar_transposed

km = KMeans(n_clusters=6).fit(X)
X['label'] = km.labels_
print("Cluster nb / Nb of countries in the cluster", X.label.value_counts())

print("\nCountries grouped by cluster")
for k in range(6):
    print(f'\ncluster nb {k} : ', " ".join([country_dict[c] + f' ({c}),' for c in list(X[X.label == k].index)]))

---
# Conclusions

In this first part, we've managed to make cluster of countries / regions with similar profiles when it comes to solar generation. This can be convenient when, in the second part, we'll analyze in depth the data for one country representative of each cluster instead of 30.

References :
- [the European Commission's STETIS program](https://setis.ec.europa.eu/about-setis)
- [Hands-On Unsupervised Learning Using Python](https://www.oreilly.com/library/view/hands-on-unsupervised-learning/9781492035633/)
- [Towards Data Science](https://towardsdatascience.com/k-means-clustering-from-a-to-z-f6242a314e9a)