# Data Pipeline Concept

- we want to use pytorch DataLoaders for efficiency (prefatches batches in the background)
- `DataLoader` requires `Dataset` + `Sampler` + `collate_fn`
    - `Sampler`: produces `index` objects when iterated over.
    - `Dataset`: values are looked up via `Dataset[index]`
    - `collate_fn`: receives `[Dataset[next(iSampler)] for _ in batch_size]` and postprocesses it as desired.
    
Model requires **preprocessed** data, such as variable encoding, standardization, etc.
The `preprocessor` must be fit on the training dataset. The whole pipeline looks something like this:



- The target loss might be defined only for the original data, not the encoded data

### Pipeline sketch - encode after sampling

```python
preprocessor.fit(train_data)

# Construct pipeline objects
dataset = make_dataset(train_data)  # original data space
sampler = make_sampler(train_data)  # original data space
collate = make_batcher(train_data)  # original data space
dloader = DataLoader( ... )         # original data space

# post-process the batch
batch = next(iter(dloader))         # original data space
inputs, target = get_items(batch)   # original data space

# get model outputs
model_inputs = preprocessor.encode(inputs)  # model data space
model_target = preprocessor.encode(target)  # model data space
model_output = model(model_inputs)          # model data space
loss = loss_fn(model_output, model_target)  # model data space

# evaluation
predictions = decode(model_outputs)         # original data space
eval_loss = score_fn(targets, predictions)  # original data space
```

### Pipeline sketch - encode before sampling

```python
preprocessor.fit(train_data)
encoded_ds = preprocessor.encode(train_data)

# Construct pipeline objects
dataset = make_dataset(encoded_ds)  # model data space
sampler = make_sampler(encoded_ds)  # model data space
collate = make_batcher(encoded_ds)  # model data space
dloader = DataLoader( ... )         # model data space

# post-process the batch
batch = next(iter(dloader))         # model data space
inputs, target = get_items(batch)   # model data space

# get model outputs
model_inputs = preprocessor.encode(inputs)  # model data space
model_target = preprocessor.encode(target)  # model data space
model_output = model(model_inputs)          # model data space
loss = loss_fn(model_output, model_target)  # model data space

# evaluation
predictions = decode(model_outputs)         # original data space
eval_loss = score_fn(targets, predictions)  # original data space
```

### Pipeline sketch - both pre + post-processing

```python
preprocessor.fit(train_data)
encoded_ds = preprocessor.encode(train_data)

# Construct pipeline objects
dataset = make_dataset(encoded_ds)  # model data space
sampler = make_sampler(encoded_ds)  # model data space
collate = make_batcher(encoded_ds)  # model data space
dloader = DataLoader( ... )         # model data space

# post-process the batch
batch = next(iter(dloader))         # model data space
inputs, target = get_items(batch)   # model data space

# get model outputs
model_inputs = preprocessor.encode(inputs)  # model data space
model_target = preprocessor.encode(target)  # model data space
model_output = model(model_inputs)          # model data space
loss = loss_fn(model_output, model_target)  # model data space

# evaluation
predictions = decode(model_outputs)         # original data space
eval_loss = score_fn(targets, predictions)  # original data space
```

## Conclusions

1. Since inputs and targets may have different modalities, we may need multiple encoders, or, at the very least, a way to split of a sub-encoder responsible for treating the target data only.

2. Alternatively, the encoding could happen \*before\* the data is 

3. Encode before sampling might be desirable for performance reasons, but impossible for memory limitations. For example, the full dataset might not fit into GPU memory, or even into RAM. Then, post sampling encoding becomes mandatory (or preprocess the whole dataset and store preprocessed version on disk alongside original data.)

4. The preprocessor might do more than asked for, maybe easier to split of responsibilities?

5. What is the \*exact\* relationship between the preprocessor and sampler / dataset objects?

6. The get_items should possibly be included in the `collate_fn`

7. The batch should possibly return more than the inputs and targets, possibly:

- index that was used to look up the data
  - This should probably be the responsibility of the dataloader, but anyhow.
- original data (not just model inputs) for plotting purposes.
- extra data that is not used by every model.


## Idea: TupleEncoder / MappingEncoder

Add `TupleEncoder` representing product type, i.e. given `encoder = TupleEncoder(enc1, enc2, enc3)`, then

```python
encoder.encode( (a, b, c) )
```

is equivalent to

```python
(enc1.encode(a), enc2.encode(b), enc3.encode(c))
```

Likewise, we can have a `MappingEncoder` that acts as `encoder.encode({key:value})` being equivalent to `{key:encoder[key].encode(value)}` or `(encoder[key].encode(value))`.



## Idea: Allow Slicing or encoders

For example: Allow slicing of `Standardizer`. Since this encoder works element-wise, it makes sense that we should be able to select a subset of it.

Similarly, a `DataFrameEncoder` should allow getting `SeriesEncoder` from individual columns.

For others, such as `FloatEncoder` or `TensorEncoder`, it shouldn't matter. Should they be allowed to "pass through" the data?

## Idea: Encode the indices as well and look up the encoded index data!

Not sure about ramifications: If one encodes to float, rounding errors might lead to faulty lookups!

## Idea: Do we really need / want index encoding?

Shouldn't Series encoding not be enough?


## Idea: Change DataFrameEncoder's behaviour

- By default, map DataFrame -> DataFrame

Add either an option, or an additional encoder that splits a DataFrame by columns / index into multiple sub-frames / series

## Idea: TupleDataset  / MappingDataset

`TupleDataset` basically dups `TensorDataset`, but doesn't require Tensors!

## Idea: TupleSampler / MappingSampler

# ⟹ Algebraify all the data structures!!! ⟸

# Encoder Slicing - The details

1. Slicing ChainedEncoder
    - try from the top until implemented, then return new Chained-Encoder representing the slice
        - New object problematic, since it won't get updated when the original changes?
        - What happens to the slice if we re-fit the original? Does the slice change as well?
            - Since the content are references to the same objects, it should work!
        - What happens when we refit a slice, does the original change?
2. Slicing BaseEncoder
    - Either: Do nothing, or raise error
3. 
