# Exploring the Data

## Table of Contents
* [Loading the Data](#loading_data)
* [Data Profiling](#data_profiling)
* [Missing values](#missing)
* [Outliers](#outliers)
* [Correlations](#corr)
* [Transformations identification](#transformations)
* [Exploring external data](#extra-data)

In [None]:
# Libraries

%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import pandas as pd
import numpy as np
import datetime as dt
import gc
import missingno as msno
import pandas_profiling
import statsmodels as sm
from statsmodels.tsa.seasonal import seasonal_decompose
import random
import re

from wind_power_forecasting.nodes import data_exploration as dexp
from wind_power_forecasting.nodes import data_transformation as dtr

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as pty

import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

# Ignore warnings (SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Loading the data
<a id="getting_data"></a>

In [None]:
# Raw data for WF1
X_train = context.catalog.load("X_train_raw")
y_train = context.catalog.load("y_train_raw")
X_test = context.catalog.load("X_test_raw")

# Data set for EDA
eda_df = context.catalog.load("df_WF1")

## Data Profiling
<a id="data_profiling"></a>

Data profile by `NWP`:

In [None]:
import pandas_profiling

nwps = list(eda_df['NWP'].unique())
profiles = dexp.get_report_by_NWP(eda_df, nwps )

In [None]:
# export data profiles to html
REPORTS_LOC = "../../reports/WF1/"
dexp.export_reports('WF1', profiles, REPORTS_LOC)

## Missing values
<a id="missing"></a>

In [None]:
missing_vals = dexp.get_missing_percentage(
    eda_df.set_index( ['NWP', 'fc_day', 'run']),
    ['NWP', 'fc_day', 'run']
)

missing_vals.head()

In [None]:
eda_df_rced = eda_df[(eda_df.fc_day == 'D') & (eda_df.run == '00h')]

## Outliers
<a id="outliers"></a>

In [None]:
# box-plots

for nwp in [1,2,3,4]:
    eda_df_rced.loc[
        eda_df_rced['NWP'] == nwp, 
        [
            'U',
            'V',
            'T',
            'CLCT',
            'production'
        ]
    ].iplot(
        subplots=True, 
        shape=(2,3),
        kind='box', 
        boxpoints='outliers',
        filename='cufflinks/box-plots'
    )

Variables which may have some outliers are:
* `U` and `V`
* `Production`

Temperature and Cloud Coverage don't have any anomaly value, according to their box-plots.

In [None]:
# Time series visualization
for nwp in [1,2,3,4]:
    eda_df_rced.loc[
        eda_df_rced['NWP'] == nwp, ['time','U','V','T','CLCT','production']].set_index('time').iplot(
        kind='scatter', 
        filename='cufflinks/cf-simple-line'
    )

## Correlations
<a id="corr"></a>

Let's have a look at the linear correlations between predictors and the target attribute.

In [None]:
sns.pairplot(
    eda_df_rced,
    vars = ['U','V','T','CLCT','production'],
    diag_kind='kde'
)

In [None]:
eda_df.corr().iplot(
    kind='heatmap', 
    colorscale='spectral',
    filename='cufflinks/simple-heatmap'
)

## Transformations identification
<a id="transformations"></a>

Several transformation identification:
* Input missing values
* Create new features:
    - Wind velocity module (`w_vel`)
    - Wind direction (`w_dir`)
    - Wind velocity escaled to heigth of the turbine . We can calculate it by using 
    
        $$u(z) = u(z_0)\left(\frac{z}{z_0}\right)^{\alpha},$$ 
      
      with $\alpha = 1/7$, $z_0 = 100$ or $10$ meters, depending on the Wild Farm, and $z = 50$ m, the height of the turbines.
    - New features with the mean values of meteorological variables for every Numerical Weather Predictor (`U`, `V`, `T`, `CLCT`)
    - Date time future enconding to capture seasonality (`month`, `day_of_month`, `hour`)
    - Binning 'CLCT' due to its bimodal distribution (depends on local weather, i.e., on each Wind Farm).
    - Cyclical enconding for wind direction and date time features.
* Convert temperature units to $^\text{o}$C.
* Stardard Scaling of variables
* Outlier treatment (to define, using the extra data in order to identify anomalies).
* Feature selection