# Batch Generator for MIMIC-III
This notebook is dedicated to reproducing the batch generator, which calls on the normalizer and discretizer to process each subject dataframe into smaller batches. The batch generator is spread out over several deeply nested function in the benchmark codebase, but we would like to fuse it into a self-contained set of function.

In [1]:
import os
import numpy as np
import argparse
import json
import pandas as pd
import pdb
import random
from pathlib import Path

from preprocessing.mimic import Discretizer

2022-03-29 10:10:58.842453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-29 10:10:58.842496: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Data Dependencies
The Notebook will require an unprocessed timeseries as generated by the the Preprocessor class, which makes the task data.

In [2]:
X_df = pd.read_csv(Path("resources", "10011_episode1_timeseries.csv")).set_index('Hours')
y_df = pd.read_csv(Path("resources", "listfile.csv"))
with open(Path("resources", "discretizer_config.json")) as file: 
    config = json.load(file)

### Reading the Series
Analoguesly to the benchmark implementation, we will read the timeseries into subsampled windows. The lower end of the dataframe is being expanded on each iteration and paired with the complementary prediction.

In [3]:
def read_timeseries(X_df, y_df):
    Xs = list()
    ys = list()
    ts = list()
    names = list()
    
    for index in range(len(y_df)):  

        if index < 0 or index >= len(y_df):
            raise ValueError("Index must be from 0 (inclusive) to number of examples (exclusive).")

        name = y_df.iloc[index][0]
        t = y_df.iloc[index][1]
        y = y_df.iloc[index][2]
        (X, header) = X_df[X_df.index < t + 1e-6], list(X_df.columns)

        Xs.append(X)
        ys.append(y)
        ts.append(t)
        names.append(name)
        header = header
        
    return Xs, ys, ts, names, header

### Processing
Each of the read windows needs to be discretized and scaled. This has been found to be the most time intensive step.

In [4]:
batch_size = 512
remaining = 10
chunk_size = 12
discretizer = Discretizer(config)


def process_batch(Xs, ts):
    data = [discretizer.transform(X, end=t) for (X, t) in zip(Xs, ts)]
        
    return data  

### Shuffling
Depending on the application, shuffling the data can enhance the prediction capabilities. The listed nature of our data, label pairs makes this easier.

In [5]:
def shuffle(data, batch_size):
    """
    """
    assert len(data) >= 2
    if type(data[0][0]) == pd.DataFrame:
        data[0] = [x.values for x in data[0]]
    # Passed data is paralell list(X, y, ts)  
    data = list(zip(*data))
    # Data is now put into tuples, list(tuple1, tuple2, tuple3)
    random.shuffle(data)

    residual_length = len(data) % batch_size
    head = data[:len(data) - residual_length]
    residual = data[len(data) - residual_length:]

    # Sort by length of X
    head.sort(key=(lambda x: x[0].shape[0]))
    

    batches = [head[i: i+batch_size] for i in range(0, len(head), batch_size)]

    random.shuffle(batches)

    batches += residual
    batches = list(zip(*batches))
    return batches

In [6]:
def make_sample_zeropadding(data):
    """
    """
    dtype = data[0].dtype
    max_len = max([x.shape[0] for x in data])
    ret = [np.concatenate([x, np.zeros((max_len - x.shape[0],) + x.shape[1:], dtype=dtype)], axis=0)
           for x in data]
    return np.array(ret)

### Generator
Finally, we can plug the our functions into the generator.

In [17]:
def generator():
    
    # while True:
    # while remainging > 0:
    #    current_size = min(chunk_size, remaining)
    Xs, ys, ts, names, header = read_timeseries(X_df, y_df)

    Xs = process_batch(Xs, ts)
    (Xs, ys, ts, names) = shuffle([Xs, ys, ts, names], batch_size)
    current_size = len(Xs)

    for i in range(0, current_size, batch_size):
        X = make_sample_zeropadding(Xs[i:i + batch_size])
        y = np.array(ys[i:i + batch_size])
        batch_names = names[i:i+batch_size]
        batch_ts = ts[i:i+batch_size]
        batch_data = (X, y)
        pdb.set_trace()
        yield batch_data

In [18]:
Xs, ys, ts, names, header = read_timeseries(X_df, y_df)

In [19]:
x = generator()
x

<generator object generator at 0x7fddb7df05f0>