# 1. Dataset balancing with synthetic data

This set of notebooks is meant to serve as guideline and quickstart for the development of a classifier for imbalanced data. The pipeline template includes the following steps and notebooks:

- **Read dataset**
    - Consume a datasource created in the UI
- **Data profiling**
    - Profile the created dataset to extract summary statistics and other metrics, that can be used further down the development, for either de definition of constrains or rules implementation
- **Synthetic data generation & merge with real data**
    - Generation of synthetic records from the class identified as being the least represented
- **Classifier training**
    - Training of a classifier based on the prepared & balanced dataset
- **Validation of the results in a holdout set**

## Read the data

The first step is to read the data. As Data Scientists it is usual to find the use of CSV as the main data sources. For that reason we enable a flexible interface to consume CSV data.

- Through YData SDK - LocalConnector - this method returns automatically a Dataset object that is ready to be consumed in downstream applications such as Synthetisizer.
- Leveraging pandas - Pandas is available in YData image. This method requires the data to be converted into Dataset, if the user wants to leverage downstream applications such as Synthesizers, or the scale from our distributed computing engine.

In this particular example, we have decided to load the data while using pandas SDK method `read_csv`.

### Import needed packages

In [1]:
import os

import numpy as np
import pandas as pd

from ydata.utils.formats import read_json
from ydata.dataset import Dataset
from ydata.metadata import Metadata
from ydata.labs.datasources import DataSources

from ydata.connectors.filetype import FileType
from ydata.connectors import GCSConnector

from ydata.dataset.holdout import Holdout

### Read the data from an existing source

In [2]:
dataset_id = os.getenv('datasetid', 'b9cc0c94-f9b9-4dd9-9893-113e6f5244a5')

In [11]:
dataset = DataSources.get(uid=dataset_id).read()

metadata = Metadata(dataset)

[########################################] | 100% Completed | 5.02 ss
[########################################] | 100% Completed | 4.83 sms
[########################################] | 100% Completed | 7.50 ss
[########################################] | 100% Completed | 35.86 s


In case any datatype is not correctly identified, it is recommend to udpate the Metadata
The way data type are chosen, can impact the quality of the synthetic data generated

In [12]:
metadata.update_datatypes({'Class': 'categorical'})

## Preparing the data for the workflow
Now that we have our dataset ready to start working, before any data analysis and preparation it is always recommended to create a holdout. This ensures unbiased results in terms of improvements achieved.

In [15]:
print(type(metadata))

<class 'ydata.metadata.metadata.Metadata'>


In [16]:
holdout = Holdout()
train, test = holdout.get_split(dataset, metadata, strategy='stratified')

In [18]:
metadata

<ydata.metadata.metadata.Metadata at 0x7f2935662980>

In [6]:
metadata_train = Metadata(train)
metadata_train.update_datatypes({'Class':'categorical'})

[########################################] | 100% Completed | 4.87 ss
[########################################] | 100% Completed | 4.26 sms
[########################################] | 100% Completed | 5.82 ss
[########################################] | 100% Completed | 28.42 s


Pipelines are able to output results for each an every step. In this case, we've decided to output the metadata identified warnings.

In [7]:
warnings=[]
for warning, val in metadata.warnings.items():
    for col in val:
        try:
            level = col.details['level'].name
            value = round(col.details['value'], 4)
        except:
            level = None
            value = None
        warnings.append({'warning': warning, 'column': col.column, 'level': level, 'value':value})

df_warnings= pd.DataFrame(warnings)

## Create pipeline outputs
Setting our pipeline outputs to share data and artifacts in-between the different steps of the pipeline, as well as for visualization.

### Outputs
Depending on the data volume datasets can also be stored in removed storages such as Google Cloud Storage of RDMBS. 
If they are small, they can be saved as pipelines temporary objects (CSV or pickle)

In [8]:
#Save the train metadata as a pickle file
metadata_train.save('metadata.pkl')

#Save the train dataset as a CSV or Pickle file
train.to_pandas().to_csv('train.csv')

#Save the holdout set as a CSV or Pickle file. 
test.to_pandas().to_csv('holdout.csv')

### Visualizations

In [9]:
import json

metadata = {
    'outputs' : [
        {
      'type': 'table',
      'storage': 'inline',
      'format': 'csv',
      'header': list(df_warnings.columns),
      'source': df_warnings.to_csv(header=False, index=False)
    }
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)