# Data Preprocessing
This notebooks loads and preprocesses the raw data for missing values imputation.

## 0 - Setup

### 0.1 - Imports
Load the necessary dependencies.

In [1]:
%%capture
from ydata.connectors import GCSConnector, LocalConnector
from ydata.utils.formats import read_json

## 0.2 - Auxiliary Functions
The auxiliary functions are custom-designed utilities developed for this use case.

In [2]:
from ydata.dataset import Dataset
from preprocess import preprocess_data

## 1 - Load Data
First extraction: 26 November 2020 - 15 July 2021

Second extraction: 15 July 2021 - 29 September 2021

In [3]:
# Load the credentials
credentials = read_json('gcs_credentials.json')

# Create the connector for Google Cloud Storage
connector = GCSConnector('ydatasynthetic', gcs_credentials=credentials)

# Read the first extraction data 26 Nov 2020 to 15 Jul 2021
data = connector.read_file('gs://pipelines_artifacts/wind_measurements_pipeline/data/ethiopia_wind_lot_nan.csv')

## 2 - Data Processing

Train: 26 November 2021 - 31 August 2021

Holdout: 1 September 2021 - 29 September 2021

Steps:
  1. Select the relevant columns
  2. Convert to Pandas DataFrame
  3. Cast to the correct data types
  4. Join the readings from both extractions into a single table
  5. Split into train and holdout

In [4]:
train, holdout = preprocess_data(data)

## 3 - Store Data
After data is preprocessed, store locally or to cloud storage.

In [5]:
%%capture
local_connector = LocalConnector()

# Store the train
local_connector.write_file(data=train.reset_index(), path='train_allmeters.csv', index=True)

# Store the holdout
local_connector.write_file(data=holdout.reset_index(), path='holdout_allmeters.csv', index=True)