# Missing Values Imputation
This notebook loads the preprocessed data and impute the missing values for each station.

## 0 - Setup

### 0.1 - Imports
Load the necessary dependencies.

In [1]:
from ydata.connectors import LocalConnector, GCSConnector
from ydata.utils.formats import read_json
from ydata.quality.impute.timeseries import TSMissingImputer

  from distributed.utils import LoopRunner, format_bytes
Install h5py to use hdf5 features: http://docs.h5py.org/
  warn(h5py_msg)
  "Scikit-learn <0.24 will be deprecated in a "


## 0.2 - Auxiliary Functions
The auxiliary functions are custom-designed utilities developed for the REE use case.

In [2]:
from ydata.dataset import Dataset

from utils import setting_index_data
from imputation import get_cold_start_meters, resample_station_data, data_boundaries, load_factors

## 1 - Load Data
Train data comprises the preprocessed readings until August 2021

In [4]:
# Create the connector for Google Cloud Storage
connector = LocalConnector()

# Read the train data
data = connector.read_file('train_allmeters.csv')

In [5]:
# Load the factors
add_factors = load_factors('df_factors_2018_2021.json')

## 2 - Data Processing

### 2.1 - Data Wrangling
Parse the data into the correct types and with the right format.

In [6]:
# Preprocess data to be ready for imputation
data = setting_index_data(data)

### 2.2 - Cold Start
Training on cold-start meters (i.e. without any observed values) should be made in separate from the rest of the meters.

In [7]:
# Get a list of cold-start meters
cold_start_meters = get_cold_start_meters(data)
cold_start_meters

['aysha1', 'gode1']

In [8]:
# Remove the cold-start meters from the train data.
train_data = data[~data['station'].isin(cold_start_meters)]

### 2.3 - Data Boundaries

In [9]:
train_data = data_boundaries(train_data, replace_na=True)

In [10]:
train_data.isna().sum()

station           0
speed        470005
direction    468570
dtype: int64

## 3 - Imputer
The TSMissingImputer is responsible to impute the missing values for time-series.
- Learns the temporal dynamics from the observed values
- Supports multiple entities with the `partition_by` parameter
- Follows the usual scikit-learn method interfaces (e.g. fit, transform)

### 3.1 - Train the TSMissing Imputer

In [11]:
# Train the Imputer
imputer = TSMissingImputer()

In [12]:
# Train the Imputer
imputer.fit(train_data, partition_by='station', num_cols=['speed'], add_factors=add_factors)

TSMissingImputer()

### 3.2 - Impute for Full Year
Construct a full year of data, on hourly basis, for devices with observed readings. For each hour, the average of windspeed/winddirection is calculated and used as ground-truth for observed readings.

In [13]:
# Create a DataFrame of a whole year for all the meters with observed values.
whole_year = resample_station_data(train_data)

In [14]:
# Apply the missing values imputation to reconstruct a whole year of data.
reconstructed = imputer.transform(whole_year)

### 3.3 - Impute for Holdout
Construct a full month of holdout, on hourly basis, for devices with observed readings.

In [15]:
# Apply the missing values imputation to reconstruct the holdout period
holdout = connector.read_file('holdout_allmeters.csv')
holdout = preprocess_data(holdout)

# Remove the cold-start meters from holdout data.
holdout = holdout[~holdout['station'].isin(cold_start_meters)]
whole_holdout = resample_station_data(holdout, start_ts='2021-03-01', end_ts='2021-04-30')
holdout_reconstructed = imputer.transform(whole_holdout)

## 3.3 - Data Validation

In [16]:
# After reconstruction, no value should be missing
assert reconstructed.isna().sum().sum() == 0, "The reconstructed dataset contains missing values after reconstruction."
assert holdout_reconstructed.isna().sum().sum() == 0, "The reconstructed dataset of holdout contains missing values after reconstruction."

### 3.4 - Data PostProcessing
The imputation of time-series is applicable to any type of numerical data and thus agnostic to energy-specific boundaries of wind measurements. To guarantee adequacy for wind speed and direction, we enforce that wind speed cannot be negative and that wind direction should range within degree angles (between 0 and 360).

In [17]:
# Postprocess the training data
postprocessed = data_boundaries(data=reconstructed)

# Postprocess the holdout data
postprocessed_holdout = data_boundaries(data=holdout_reconstructed)

## 4 - Store Data
After the data is fully reconstructed, store to cloud storage.

In [19]:
# Store the whole year reconstructed
connector.write_file(data=postprocessed.reset_index(), path='whole_year_reconstructed.csv', index=True)

# Store the holdout
connector.write_file(data=postprocessed_holdout.reset_index(), path='holdout_reconstructed.csv', index=True)

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client


ERROR: 2022-02-18 14:44:47,882 Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 1391, in _do_ssl_handshake
    self.socket.do_handshake()
  File "/opt/conda/lib/python3.7/ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
OSError: [Errno 0] Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 696, in _handle_events
    self._handle_read()
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 1478, in _handle_read
    self._do_ssl_handshake()
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 1429, in _do_ssl_handshake
    return self.close(exc_info=err)
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 611, in close
    self._signal_closed()
  File "/opt/conda