## Spain Energy Analysis 
To make the successful transition to renewable energy, forecasting is commonly used in the energy markets to understand energy demand as well as energy generation under certain weather or seasonal trend. The use of machine learning on time series datasets is highly leveraged to understand these dynamics. 

This analysis aims to practice data preprocessing, implement algorithms on time-series datasets, and evaluate models using different metrics. This notebook prepares the dataset for further analysis. It presents the process of exploring the energy dataset and interpolates possible missing values. 

In [None]:
# data analysis, wrangling and preprocessing
import numpy as np 
import pandas as pd
import datetime

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight') # change the style of the plot

# missing data interpolation
from scipy.interpolate import CubicSpline

### Loading data 
We start by loading the dataset into Pandas DataFrames.

In [None]:
energy = pd.read_csv('../input/energy-consumption-generation-prices-and-weather/energy_dataset.csv')
energy.head()

Using the `info` function to check if there are any missing values in this dataset and the data type for each column.

In [None]:
energy.info()

Some takeaways from the output above:
* There are two columns (`generation hydro pumped storage aggregated', 'forecast wind offshore day ahead'`) that don't have any values in them. We will drop these two columns. We will also drop the columns that are not useful to our analysis (`total load forecast`, `forecast solar day ahead`, `forecast wind onshore day ahead`, `price day ahead`).
* There might be some missing values for some of the columns. We will try to check the missing rate and decide how to deal with them.
* We will rename the column names because the names of the columns are long and not necessary.

Drop columns that are uninformative for the analysis

In [None]:
dropList = ['generation hydro pumped storage aggregated', 'forecast wind offshore eday ahead',
'total load forecast','forecast solar day ahead','forecast wind onshore day ahead',
'price day ahead']
energy.drop(dropList, axis=1, inplace=True)

Find the missing rate for each column

In [None]:
energy.isnull().mean()

All of the columns that have missing data have a missing rate of less than 1%. We will conduct a missingness analysis to see if those missing values are coming from a specific period of time. 

#### Rename columns
We drop the unnecessary text in the column names and rename `total load actual` to `energy demand`.

In [None]:
energy.rename(columns=lambda x: x[11:] if 'generation' in x else x, inplace=True)
energy.rename(columns={'total load actual':'energy demand'}, inplace=True)

#### Create different time granularity
To explore the dataset further, we create a `Year-Month` column for later aggregation purposes.

In [None]:
energy['time'] = pd.to_datetime(energy['time'], utc=True)
energy['Year-Month'] = pd.to_datetime(energy['time'].dt.strftime('%Y-%m-01')).dt.date

### Missingness analysis
Even though the missing rate is less than 1% of the time, we still want to see if they all come from a specific period.

In [None]:
energy[energy.isna().any(axis=1)]

There are only 47 records that contain NaN in the row. We want to plot these 47 records on a timeline and see if there are any clusters.

In [None]:
sns.swarmplot(x='time', data=energy[energy.isna().any(axis=1)])
plt.xticks(rotation=45)
plt.title('NaN values across time')

In [None]:
energy[energy.isna().any(axis=1)].groupby('Year-Month')['time'].count()

As we can see in the timeline of the events and frenquency of NaN values by month table above, they scattered around since the data are being recorded. There are several clusters early on, such as the one around January 2015 and May 2015. There are fewer and fewer records recently. It is possible that the instruments were being set up to measure those energy generation data early on, They have a more healthy infrastructure in place to collect necessary data.

### Consolidate different energy sources
We want to consolidate the existing columns into the following columns by summing up the related subcolumns for further analysis. Here is a brief description of each columns:
* Fossil fuel as `fossil_fuel` (nonrenewable energy formed in the geological past from the remains of living organisms)
* Energy generated from biomass as `biomass` (renewable energy produced by living or once-living organisms)
* Energy generated from hydropower as `hydro` (renewable energy generated by fast-running water)
* Energy generated from nuclear energy as `nuclear` (nonrenewable energy that use a nuclear reaction to produce electricity)
* Energy generated from wind energy as `wind` (renewable energy that converts kinetic energy in the wind into mechanical power)
* Energy generated from waste as `waste` (Waste-to-energy plants make steam and electricity)
* Energy generated from other sources as `others`

In [None]:
fossil_fuel = ['fossil brown coal/lignite', 'fossil gas',
       'fossil hard coal', 'fossil oil']
hydro = ['hydro pumped storage consumption',
       'hydro run-of-river and poundage',
       'hydro water reservoir']
wind = ['wind onshore']
others = ['other', 'other renewable'] 
energy['fossil_fuel'] = energy.loc[:, fossil_fuel].sum(axis=1)
energy['hydro'] = energy.loc[:, hydro].sum(axis=1)
energy['wind'] = energy.loc[:, wind].sum(axis=1)
energy['others'] = energy.loc[:, others].sum(axis=1)

We have aggregated there columns:
* `biomass`
* `fossil_fuel`
* `hydro`
* `wind`
* `solar`
* `nuclear`
* `waste`
* `others`

All of them are in Megawatts(MW).

#### Find pattern for zero value entry  
We want to see if any zero values in this dataset are meaningful. If not, they can be treated as missing values (NaN) and interpolate together with other NaN values. 

In [None]:
energy_list = ['biomass','fossil_fuel','hydro','wind','solar','nuclear','waste','others']
energy_flat = pd.melt(energy, id_vars='Year-Month', value_vars=energy_list)
tmp = energy_flat.loc[energy_flat.value == 0]
tmp.sort_values(by='Year-Month', inplace=True)
sns.swarmplot(x='Year-Month', y='variable', data=tmp)
plt.title('NaN Across Time Group By Energy Source')
plt.ylabel('Energy Sources')
plt.xlabel('Time (Year-Month)')
plt.xticks(rotation=45)

According to the plot above, we can see that zero values coincide between different energy sources. Since they are sparse earlier in time, we can view them as missing values. As a result, we can replace all the zero values with NaN and interpolate them with all the other NaN values. 

In [None]:
energy[energy_list] = energy[energy_list].replace(['0', 0], np.nan)

### Interpolation
We use two different methods to interpolate all the NaN values that we in the `energy` dataset. 
* If there are multiple NaN values in the close span, we use quadratic interpolation with the order of four to account for the movement of the curve. 
* If there is only one NaN value in the close span, we use the CubicSpline method to interpolate that specific NaN value.

We wrote a function below to automate this process.

In [None]:
def interpolation(column_list, energy):
    for column in column_list:
        index_list = energy[energy[column].isna()].index
        for index in index_list:
            lb = index - 10
            up = index + 10
            if ((energy.iloc[lb:up][column].isna()).sum() > 1):
                temp = energy.iloc[lb:up][column]
                indices = temp[temp.isna()].index
                temp = temp.interpolate(method='quadratic', order=4)
                for index in indices:
                    energy.loc[index, column] = temp.loc[index]
            else:
                lb = index - 2
                up = index + 2
                temp = energy.iloc[lb: up][column]
                temp = temp[~temp.isna()]
                X = temp.index.values
                y = temp.values
                cs = CubicSpline(X , y)
                energy.loc[index, column] = cs(index).item(0)

Call the function `interpolation` and pass in a list of columns that we want to interpolate and the `energy` dataframe.

In [None]:
interpolation(energy_list, energy)

We want to make sure there are no NaN or zero values in the `energy` dataframe before going into further analysis.

In [None]:
energy.loc[:,energy_list].isna().sum() + energy.loc[:,energy_list][energy.loc[:,energy_list] == 0].sum()

There is no row with either zero values or NA values after interpolation.

In the next notebook, we will explore the time series of the major sources of energy generation and the relationships between some of the time series.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0623ba08-1bf2-4899-8295-01976760b872' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>