# Problem Statement
It is January 2020.

You just started your new position: you've joined the Applied Machine Learning team of an Italian utility. 

You barely had the time to meet your new colleagues, and you already have your first task. Up until now, the company relied on an external service to forecast the power demand of its customers. It is now time to internalize.

Power demand forecasting is a critical task for every utility: power storage is neither cheap nor widely available, so the balance of the grid must be guaranteed at any time. The production must match the demand. In Italy, this is ensured by the free power market, where utilities can trade power production, and by *ad-hoc* actions performed by the Transmission System Operator (TSO).

To safeguard the smooth operation of the system, utilities must produce day-ahead hourly forecasts of the power demand of their customers, and they get financial penalties for errors. 

<div class="alert alert-block alert-warning">
<b>Simplification.</b> 

We will consider daily forecasting of the Italian load, instead of hourly forecast of the demand from the customers of a specific company.
</div>

However, before jumping into day-ahead forecasting, you are required to produce some long-term models. They are baselines to be used in the case of any issue with the short-term predictors.

Therefore, your initial problem statement is as follows.
<div class="alert alert-block alert-info">
<b>Problem Statement</b> 
    
Given Italian daily power load data from 2006 to 2019, forecast the daily load in 2020. In general, the model shall be able to produce one-year-ahead forecasts.
</div>

# Data
Historical power load data can be retrieved from the [ENTSO-E portal](https://www.entsoe.eu/data/power-stats/), while newer series are avaiable on the websites of the national TSOs. In Italy, the TSO is Terna, and it [publishes power load and its own forecast](https://www.terna.it/en/electric-system/transparency-report/total-load).

Fortunately, Matteo and Gabriele, the ML Engineers in your team, developed an automated pipeline to ensure that the dataset is constantly updated. The pipeline takes care of harmonizing the different data formats, and checks for outliers or missing data, so that you can trust the consistency of the processed dataset.

Therefore, you can simply retrieve the data from the Amazon S3 URI they shared with you.

In [None]:
# To read data from S3
! pip install pandas s3fs --upgrade

Please, restart the kernel if this is the first time you run this notebook.

This is necessary to ensure that we can actually import the libraries we've just installed in the previous cells.

In [None]:
import sagemaker
import pandas as pd

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
# Configuring the default size for matplotlib plots
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20,6)

We care about daily data, so we sum the load on the same day.

In [None]:
NOW = '2019-12-31 23:59'
raw_data_s3_path = "s3://public-workshop/normalized_data/processed/2006_2022_data.parquet"

raw_df = pd.read_parquet(raw_data_s3_path)
load_df = raw_df.resample('D').sum()[:NOW].copy()
load_series = load_df.Load
load_series.head()

# Exploratory Data Analysis
You have made your own homework: the features of the power load series are subject for an extensive literature, so you already know what to look for.

You start by plotting the series and by zooming on a forthnight.

In [None]:
load_series.plot();

In [None]:
load_series['2019-10-01':'2019-10-15'].plot();

Some features are immediately visible:
- the trend is decreasing
- there are weekly and yearly seasonalities
- summer is the period of highest consumption, due to air conditioning
- the size of the peak changes from year to year, due to weather conditions
- there are some drops, possibly caused by holidays, which reduce the demand from industrial plants

You further explore the trend with a moving average filter.

In [None]:
load_series.rolling(365).mean().plot();

You have a look at the autocorrelation and partial autocorrelation function to further confirm the seasonality.

In [None]:
plot_acf(load_series);

In [None]:
plot_pacf(load_series);

The ACF shows a weekly seasonality, as well as a longer periodicity, which we may assume to be yearly. This fact can be confirmed by plotting the periodogram, an estimator of the spectral power density of the time series. The analysis is here omitted.

Finally, you plot the power demand year-over-year, to better appreciate the effect of holidays and weather.

In [None]:
year_over_year_df = pd.DataFrame({
    'load': load_series,
    'day_in_year': load_series.index.dayofyear,
    'day_in_week': load_series.index.dayofweek, # Monday = 0, Sunday = 6
    'year': load_series.index.year
})
for year, year_df in year_over_year_df.groupby('year'):
    plt.plot(year_df.day_in_year, year_df.load.shift(year_df.day_in_week.iloc[0]), label=year)
plt.legend()
plt.xlabel('Adjusted day in year - if each year started on Monday')
plt.show()

# EDA conclusions
You have confirmed the most notable feature of the power demand:
- the trend is decreasing
- there are strong weekly and yearly seasonalities
- summer is the period of highest consumption, due to air conditioning
- the size of the peak changes from year to year, due to weather conditions
- there are some drops, possibly caused by holidays, which reduce the demand from industrial plants
- trend and seasonal structure do not explain the effect of moving holidays (e.g. Easter) as well as the influence of weather

# Next Steps
You are getting on with the data, aren't you. But you may wonder: how did Matteo and Gabriele retrieve that data in the first place?