### Dataloader
---

A light-weight Dataset class object that can be used with:
* saving and loading - holistic or by chunks
* processing

Future
* meta-data management
* automatic processing/clipping
* validation
* weighing
* splitting


Can study AIF360 dataset objects for inspiration
* https://github.com/Trusted-AI/AIF360/blob/master/aif360/datasets/dataset.py
* https://github.com/Trusted-AI/AIF360/blob/master/aif360/datasets/structured_dataset.py
* https://github.com/Trusted-AI/AIF360/blob/746e763191ef46ba3ab5c601b96ce3f6dcb772fd/aif360/datasets/binary_label_dataset.py#L6

The Dataset Object essentially have 3 stages
* Load
* Process
* Save

The exact methods should be able to be customized, but we should provide a basic framework for the pattern. My current thought is to have a generator producing paths used to load and save the data, and the provided processing function will be applied on each chunk.

This process will be inplemented in Pandas.

In [1]:
!pip install --upgrade pandas smart_open pyarrow tqdm



In [2]:
import sys, warnings, smart_open, shutil
sys.path.insert(1, '..')

import numpy as np
import pandas as pd
import datetime as dt

%load_ext autoreload
%autoreload 2

In [3]:
from rdsutils.datasets import Dataset, StructuredDataset, DataLoader, DataDumper

#### Example Data
---

In [4]:
from data.titanic.enums import features, target, cat_idx, cat_features, num_features, categorical_encoder

train = pd.read_csv('../data/titanic/train.csv', index_col=0)
valid = pd.read_csv('../data/titanic/valid.csv', index_col=0)
test = pd.concat([train, valid])

#### Load

In [5]:
# load data iteratively with a generator
dl = DataLoader("../data/titanic", suffix="csv")
for fname, df__ in dl:
    print(fname, df__.shape)

test (418, 11)
train (623, 13)
valid (268, 13)


In [6]:
# all loaded file paths
display(dl.get_paths())

['/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic/test.csv',
 '/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic/train.csv',
 '/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic/valid.csv']

In [7]:
# load all files and concat
df = dl.get_full()

#### Save
Use pathlib
* https://stackoverflow.com/questions/42407976/how-to-load-multiple-text-files-from-a-folder-into-a-python-list-variable

What is needed for Dumper?
* provide a dirctory path. create on if does not exist
* when given a path, save it as parquet with the provided keywords

In [8]:
# how to create a pipeline: load and dump
dl = DataLoader("../data/titanic", suffix="csv")
dd = DataDumper("../data/titanic-copy")

for fname, df_ in dl:
    print(f"saving {fname}")
    dt_str = str(int(dt.datetime.now().timestamp()))
    dd.to_parquet(df_, fname+"_"+dt_str)

saving test
saving train
saving valid


In [9]:
# verify load and dumped are the same
dl = DataLoader("../data/titanic-copy", suffix="parquet")
display(dl.get_paths())
for fname, df__ in dl:
    print(fname, df__.shape)

['/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic-copy/test_1621275145.parquet',
 '/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic-copy/train_1621275145.parquet',
 '/home/ec2-user/SageMaker/projects-framework/rdsutils/examples/../data/titanic-copy/valid_1621275145.parquet']

test_1621275145 (418, 11)
train_1621275145 (623, 13)
valid_1621275145 (268, 13)


In [10]:
((df_ == df__) | (df_.isna() & df__.isna())).all()

Unnamed: 0     True
PassengerId    True
Survived       True
Pclass         True
Name           True
Sex            True
Age            True
SibSp          True
Parch          True
Ticket         True
Fare           True
Cabin          True
Embarked       True
dtype: bool

#### Test dumping into multiple file functionality

In [11]:
train = pd.read_csv('../data/titanic/train.csv', index_col=0)
valid = pd.read_csv('../data/titanic/valid.csv', index_col=0)
test = pd.read_csv('../data/titanic/test.csv', index_col=0)

train["type"] = "train"
valid["type"] = "valid"
test["type"] = "test"
df_full = pd.concat([train, valid, test])

df_full.shape

(1309, 13)

In [15]:
dfs = [df_full[df_full.type == t] for t in df_full.type.unique()]

dd = DataDumper("../data/titanic-copy2")
dd.to_parquets(dfs, "by_types")

100%|██████████| 3/3 [00:00<00:00, 337.55it/s]


In [16]:
dl = DataLoader("../data/titanic-copy2", suffix="parquet")
df_full_ = dl.get_full()

In [17]:
# verify the reconstructed data is equivalent to the loaded one
df_ = df_full_.sort_index()
df__ = df_full.sort_index()
((df_ == df__) | (df_.isna() & df__.isna())).all()

PassengerId    True
Survived       True
Pclass         True
Name           True
Sex            True
Age            True
SibSp          True
Parch          True
Ticket         True
Fare           True
Cabin          True
Embarked       True
type           True
dtype: bool

#### Clean up

In [33]:
import shutil

shutil.rmtree("../data/titanic-copy/", ignore_errors=True)
shutil.rmtree("../data/titanic-copy2/", ignore_errors=True)