# EDA Capstone Project - Electricity Forecast
---

Information: <br>
The aim of this script is to examine a proper EDA on the dataset.
<br><br>
Expected Output:
 * Visualization of important features (stored in ../eda_plots)
 * Precise insights for data cleaning and data preparation

---

## Setup

In [1]:
# Main data packages. 
import numpy as np
import pandas as pd

# Data Viz. 
import statsmodels.formula.api as smf
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.ndimage import gaussian_filter
from calendar import monthrange
from calendar import month_name

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': 'white', 'grid.color': '.8'}
)
NF_ORANGE = '#ff5a36'
NF_BLUE = '#163251'
cmaps_hex = ['#193251','#FF5A36','#696969', '#7589A2','#FF5A36', '#DB6668']
sns.set_palette(palette=cmaps_hex)
sns_c = sns.color_palette(palette=cmaps_hex)
%matplotlib inline
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['figure.dpi'] = 300

---

## Start EDA

---

### load and inspect data

In [2]:
# load data
# df = pd.read_csv('')

In [3]:
# inspect data
# df.head()

In [None]:
# Check for dtypes and NAs
# df.info()

In [None]:
# Check numeric features
# df.describe()

In [None]:
# Check duplicates
# print(df.duplicated().sum())

# if df.duplicated().sum() > 0:
#     df = df.drop_duplicates()

First insights:
 * No. of rows:
 * No. of columns:
 * Mismatching dtypes:
 * NA's:
 * Duplicated values: 

---


### check features

Suggestion: use Seaborn for plotting.

In [None]:
# Template for visualization over time
fig, ax = plt.subplots()
sns.lineplot(x='time_variable', y='y_variable', data=df, ax=ax)
ax.set(title='Title')
ax.set_xlabel('Time_Name')
ax.set_ylabel('y_variable_name [unit]');

fig.savefig("../eda_plots/plot_name_that_makes_sense.png")

In [4]:
# check more features


In [5]:
# check more features


---
### check trend

Here we want to see the trend of the price over the year/years. Later on, we can substract this rolling mean to de-trend the data and focus more on seasonality (weekly/daily). 

Understand again, if this is necessary for all variables?

#### price

In [None]:
# Plot moving average of different length (week, month, year)
ma = [7, 30, 365]


smooth_daily_data_df = daily_data_df \
    .reset_index() \
    .assign(date = lambda x: x['date'].transform(pd.to_datetime))

# Smooth and plot
fig, ax = plt.subplots(len(ma)+1, 1, constrained_layout=True, sharex=True)
plt.suptitle('Price Development (Daily) - Moving Average Smoothing', y=1.02);

for i, m in enumerate(ma):
    smooth_daily_data_df[f'price_smooth_ma_{m}'] = smooth_daily_data_df['price'].rolling(window=m,center=True).mean() #compute the rolling mean

    sns.lineplot(x='date', y='price', label='Price (Signal)', data=smooth_daily_data_df,  ax=ax[i])
    sns.lineplot(x='date', y=f'price_smooth_ma_{m}', label=f'Price smoothed:\n ma = {m} days', data=smooth_daily_data_df, color=NF_ORANGE, ax=ax[i])

    ax[i].legend(title='', loc='center left', bbox_to_anchor=(1, 0.5))
    ax[i].set(title='', ylabel=r'$^\circ$C');
        
sns.lineplot(x='date', y=f'price_smooth_ma_{m}', label=f'Price smoothed:\n ma = {m} days', data=smooth_daily_data_df, color=sns_c[1], ax=ax[i+1])
ax[i+1].legend(title='', loc='center left', bbox_to_anchor=(1, 0.5))
ax[i+1].set(title='', ylabel='$')

    
fig.savefig("../eda_plots/Price_MA_Smoothing.png")

#### temperature

In [None]:
# Plot moving average of different length (week, month, year)
ma = [7, 30, 365]


smooth_daily_data_df = daily_data_df \
    .reset_index() \
    .assign(date = lambda x: x['date'].transform(pd.to_datetime))

# Smooth and plot
fig, ax = plt.subplots(len(ma)+1, 1, figsize=(12, 9), constrained_layout=True, sharex=True)
plt.suptitle('Temperature (Daily) - Moving Average Smoothing', y=1.02);

for i, m in enumerate(ma):
    smooth_daily_data_df[f'temp_smooth_ma_{m}'] = smooth_daily_data_df['temperature'].rolling(window=m,center=True).mean() #compute the rolling mean

    sns.lineplot(x='date', y='temperature', label='Temperature (Signal)', data=smooth_daily_data_df,  ax=ax[i])
    sns.lineplot(x='date', y=f'temp_smooth_ma_{m}', label=f'Temperature smoothed:\n ma = {m} days', data=smooth_daily_data_df, color=NF_ORANGE, ax=ax[i])

    ax[i].legend(title='', loc='center left', bbox_to_anchor=(1, 0.5))
    ax[i].set(title='', ylabel=r'$^\circ$C');
        
sns.lineplot(x='date', y=f'temp_smooth_ma_{m}', label=f'Temperature smoothed:\n ma = {m} days', data=smooth_daily_data_df, color=sns_c[1], ax=ax[i+1])
ax[i+1].legend(title='', loc='center left', bbox_to_anchor=(1, 0.5))
ax[i+1].set(title='', ylabel=r'$^\circ$C');

    
fig.savefig("../eda_plots/Temp_MA_Smoothing.png")

---

## Insights

The following steps need to be addressed in the (automatized) data cleaning and feature engineering /selection process:
 * Step 1:

 * Step 2:
 
 * Step 3:

---