# Train Test Split: Splitting Consumption Timeseries

The train/test split function as is, is starting to become overcrowded. While some requirements are shared by the MIMIC-III and smartmeter dataset, we need to investigate, whether it makes sense to split the function.

Additionally, we want to add the following split functionalities:
* Split by date
* Split by time ratio
* Split by feature

In [1]:
import pandas as pd
from utils.mimic import get_sample_size
import numpy as np
import pdb
from utils.IO import *
from sklearn import model_selection
from dateutil.parser import parse

## I. Data

To test our implementation, we will need to construct example data frames. 

We need frames representing:
* Dictionary with subframes of variable length
* Dictionary with subframes of same frames
* Frame with datetime index
* Dictionary with datetime indexed frames

In [172]:
label_df = pd.read_csv(Path("smartmeter", "resources", "preprocessed_labels.csv"))
label_df = label_df.set_index("timestamp")
label_df.index = pd.to_datetime(label_df.index)

sample_df = pd.read_csv(Path("smartmeter", "resources", "preprocessed_samples.csv"))
sample_df = sample_df.set_index("timestamp")
sample_df.index = pd.to_datetime(sample_df.index)

timeseries_df = pd.read_csv(Path("smartmeter", "resources", "timeseries.csv"))
timeseries_df = timeseries_df.set_index("timestamp")
timeseries_df.index = pd.to_datetime(timeseries_df.index)

mimic_df = pd.read_csv(Path("mimic", "benchmark", "resources", "10011_episode1_timeseries.csv"))
mimic_target_df = pd.read_csv(Path("mimic", "benchmark", "resources", "10011_target.csv"))
sample_df.head(3)

Unnamed: 0_level_0,LCLid,holiday,precipType,icon,summary,visibility,windBearing,temperature,dewPoint,pressure,apparentTemperature,windSpeed,humidity,month,weekday
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2012-10-12 06:30:00,MAC000002,no holiday,rain,partly-cloudy-night,Mostly Cloudy,12.38,259.0,10.37,8.62,1002.72,10.37,6.11,0.89,10,4
2012-10-12 07:00:00,MAC000002,no holiday,rain,partly-cloudy-day,Mostly Cloudy,12.73,261.0,10.39,8.35,1003.91,10.39,6.71,0.87,10,4
2012-10-12 07:30:00,MAC000002,no holiday,rain,partly-cloudy-day,Mostly Cloudy,12.73,261.0,10.39,8.35,1003.91,10.39,6.71,0.87,10,4


In [3]:
label_df.head()

Unnamed: 0_level_0,y
timestamp,Unnamed: 1_level_1
2012-10-14 12:00:00,6
2012-10-14 18:00:00,19
2012-10-15 00:00:00,1
2012-10-15 06:00:00,11
2012-10-15 12:00:00,8


## II. Split by Date

In the case of the smartmeter dataset, we want to split the data into timespans. 

This can be done by either:
* Specifying the date by which to split (special event which might induce concept drift)
* Specifying span ratios

### Input Parameters

In [4]:
bydate = ["2013-03-03"]
# bydate = ["2013-03-03", "2013-09-03"]

The challenge in by date splits is that the windows have not yet been generated. Since there is only a single

In [5]:
X = sample_df
y = label_df

bydate = [parse(date) for date in bydate]

label_splits = list()
sample_splits = list()

window_width = y.index[0] - X.index[0]
forecast_horizon = y.index[-1] - X.index[-1]
sample_width = window_width - forecast_horizon

In [6]:
forecast_horizon

Timedelta('0 days 06:00:00')

In [7]:
for date in bydate:
    label_split = y[y.index < date]
    label_splits.append(label_split)
    y = y[y.index >= date]
    sample_splits.append(X[(X.index >= label_split.index[0] - window_width) & (X.index <= label_split.index[-1] - forecast_horizon)])
    
    
label_splits.append(y)
sample_splits.append(X[(X.index >= y.index[0] - window_width) & (X.index <= y.index[-1] - forecast_horizon)])

In [8]:
[print(f" sample start: {sample.index[0]}, label start: {label.index[0]},  distance: {label.index[0] - sample.index[0]} \n sample end: {sample.index[-1]},   label end: {label.index[-1]},    distance: {label.index[-1] - sample.index[-1]}") for sample, label in zip(sample_splits, label_splits)]

 sample start: 2012-10-12 06:30:00, label start: 2012-10-14 12:00:00,  distance: 2 days 05:30:00 
 sample end: 2013-03-02 12:00:00,   label end: 2013-03-02 18:00:00,    distance: 0 days 06:00:00
 sample start: 2013-02-28 18:30:00, label start: 2013-03-03 00:00:00,  distance: 2 days 05:30:00 
 sample end: 2014-02-27 18:00:00,   label end: 2014-02-28 00:00:00,    distance: 0 days 06:00:00


[None, None]

## III. Split By Ratio

The by ratio function is largely able to reuse the code provided by the by date function. After deducing the date on which to split, one simply needs to feed back the dates into the by date function.

### Input Parameters

In [9]:
test_size = 0.2
val_size = 0.2

Lets compute the dates.

In [10]:
total_span = label_df.index[-1] - label_df.index[0]
total_span

Timedelta('501 days 12:00:00')

In [11]:
train_span = total_span * (1 - (test_size + val_size))
train_span

Timedelta('300 days 21:36:00')

In [12]:
split_date = list()
split_date.append(str(label_df.index[0] + train_span))

if val_size:
    split_date.append(str(label_df.index[0] + train_span + val_size*total_span))
split_date

['2013-08-11 09:36:00', '2013-11-19 16:48:00']

# IV. Integration
Now, that we finished with the proof of concept, we can continue by intergrating and testing the new functionalites with the already existing module.

## By Date Split Function

In [13]:
bydate = ["2013-03-03", "2013-09-03"]

def make_date_split(X, dates):
    """
    """
    if isinstance(dates, str):
        dates = [dates]

    dates = [parse(date) for date in dates] + [X.index[-1]]

    sample_splits = list()
    
    for date in dates:
        sample_split = X[X.index < date]
        sample_splits.append(sample_split)
        X = X[X.index >= date]
        
    return tuple(sample_splits)

# print(len(make_date_split(sample_df, bydate)))
X, X1, X2 = make_date_split(sample_df, bydate)
print(f"train: {len(X) / (len(X) + len(X1) + len(X2))},"
      f"test: {len(X1) / (len(X) + len(X1) + len(X2))}"
      f"val: {len(X2) / (len(X) + len(X1) + len(X2))}\n")

def make_date_split(X, y, dates):
    """
    """
    if isinstance(dates, str):
        dates = [dates]

    dates = [parse(date) for date in dates] + [y.index[-1]]

    label_splits = list()
    sample_splits = list()

    window_width = y.index[0] - X.index[0]
    forecast_horizon = y.index[-1] - X.index[-1]
    sample_width = window_width - forecast_horizon
    
    for date in dates:
        label_split = y[y.index < date]
        label_splits.append(label_split)
        y = y[y.index >= date]
        sample_split = X[(X.index >= label_split.index[0] - window_width) & (X.index <= label_split.index[-1] - forecast_horizon)]
        sample_splits.append(sample_split)
        
    return (*sample_splits, *label_splits)

X, X1, X2, y, y1, y2 = make_date_split(sample_df, label_df, bydate)
# print(len(make_date_split(sample_df, label_df, bydate)))
print(f"train: {len(X) / (len(X) + len(X1) + len(X2))},"
      f"test: {len(X1) / (len(X) + len(X1) + len(X2))}"
      f"val: {len(X2) / (len(X) + len(X1) + len(X2))}")


train: 0.2814995655232342,test: 0.3654570281789217val: 0.35304340629784414

train: 0.27873704982733105,test: 0.36655155402072026val: 0.3547113961519487


## By Time Ratio Split Function

The by ratio function is then again based on the make_data_split function.

In [14]:
def make_timeratio_split(X, y, test_size=0.5, val_size=0.):
    """
    """
    total_span = label_df.index[-1] - label_df.index[0]
    train_span = total_span * (1 - (test_size + val_size))

    dates = list()
    dates.append(str(label_df.index[0] + train_span))

    if val_size:
        dates.append(str(label_df.index[0] + train_span + val_size*total_span))
    
    return make_date_split(X, y, dates)

len(make_timeratio_split(sample_df, label_df, test_size=0.5))

4

In [15]:
X1, X2, y1, y2 = make_timeratio_split(sample_df, label_df, test_size=0.9)
len(y1)/len(y2)

0.1113573407202216

# V. Class implementation

We integrate the functions and leagacy code with now implemented functions.

In [16]:
!pip install multipledispatch

[0m

In [17]:
from multipledispatch import dispatch

In [165]:
class SplitUtility():
    
    def __init__(self):
        """
        """ 
        self.X_out = {
            "train": None,
            "test": None,
            "val": None
        }
        
        self.y_out = {
            "train": None,
            "test": None,
            "val": None
        }
        
        self.X_buffer = {
            "train": list(),
            "test": list(),
            "val": list()
        }
        
        self.y_buffer = {
            "train": list(),
            "test": list(),
            "val": list()
        }
        
        pass
    
    def train_test_split(self, X, y=None, test_size=0.5, val_size=0., dates=[], method="sample", concatenate=True):
        """
        This function splits the provided data into a test and train data set.
        """
        
        test_size = float(test_size)
        val_size = float(val_size)
        
        if dates:
            method = "date"
            
            if isinstance(dates, str):
                dates = [dates]
                
        print(method)
            
        method_switch = {
            "date": self.make_date_split,
            "time": self.make_timeratio_split,
            "sample": self.make_frame_split
        }
        
        args_switch = {
            "date": {"dates": dates},
            "time": {"test_size": test_size, 
                     "val_size": val_size,
                    },
            "sample": {"test_size": test_size, "val_size": val_size}
        }
                
        if isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray) and dates:   
            if y is not None: 
                method_switch[method](X, y, **args_switch[method])
            else:
                method_switch[method](X, **args_switch[method])

        elif isinstance(X, dict) and len(X) == 1:
            if y is not None: y = [*y.values()][0]
            return self.make_frame_split([*X.values()][0], y, test_size=test_size, val_size=val_size)
        
        elif isinstance(X, dict) and method in ["time", "date"]:
            if not y:
                y = {}
            self.make_perdictionary_split(X, y, *args_switch[method].values(), splitter=method_switch[method])
        
        elif isinstance(X, dict):
            if not y:
                y = {}
            self.make_dictionary_split(X, y, test_size, val_size)

        return tuple(self.make_returns())
    
    @dispatch(pd.DataFrame, list) 
    def make_date_split(self, X, dates):
        """
        """
        if isinstance(dates, str):
            dates = [dates]

        dates = [parse(date) for date in dates] + [X.index[-1]]
        sample_splits = list()

        for name, date in zip(self.X_out.keys(), dates):
            self.X_out[name] = X[X.index < date]
            X = X[X.index >= date]
            if not len(X): break
                
        return

    @dispatch(pd.DataFrame, pd.DataFrame, list) 
    def make_date_split(self, X, y, dates):
        """
        """
        if isinstance(dates, str):
            dates = [dates]

        dates = [parse(date) for date in dates] + [y.index[-1]]

        window_width = y.index[0] - X.index[0]
        forecast_horizon = y.index[-1] - X.index[-1]
        sample_width = window_width - forecast_horizon
        
        for name, date in zip(self.y_out.keys(), dates):
            label_split = y[y.index < date]
            self.y_out[name] = label_split
            y = y[y.index >= date]
            self.X_out[name] = X[(X.index >= label_split.index[0] - window_width) & (X.index <= label_split.index[-1] - forecast_horizon)]
            if not len(y): break
            
        return
    
    def make_timeratio_split(self, X, y={}, test_size=0., val_size=0.):
        """
        """
        total_span = X.index[-1] - X.index[0]
        train_span = total_span * (1 - (test_size + val_size))
        dates = list()
        dates.append(str(X.index[0] + train_span))

        if val_size:
            dates.append(str(X.index[0] + train_span + val_size*total_span))
        
        if len(y):
            self.make_date_split(X, y, dates)
        else:
            self.make_date_split(X, dates)
            
        return
    
    @dispatch(pd.DataFrame)
    def make_frame_split(self, X, test_size, val_size):
        """
        """
        total_size = test_size + val_size
        self.X_out["train"], self.X_out["test"] = model_selection.train_test_split(X, test_size=total_size, shuffle=False)
        
        if val_size:
            val_test_ratio =  val_size / total_size
            
            self.X_out["test"], self.X_out["val"] = model_selection.train_test_split(self.X_out["test"], test_size=val_test_ratio, shuffle=False)
    
        return
    
    @dispatch(pd.DataFrame, pd.DataFrame)
    def make_frame_split(self, X, y, test_size, val_size):
        """
        """
        total_size = test_size + val_size
        (self.X_out["train"], 
         self.X_out["test"], 
         self.y_out["train"], 
         self.y_out["test"]) = model_selection.train_test_split(X, y, test_size=total_size, shuffle=False)
        
        if val_size:
            val_test_ratio = val_size / total_size
            
            (self.X_out["test"], 
             self.X_out["val"], 
             self.y_out["test"], 
             self.y_out["val"]) = model_selection.train_test_split(self.X_out["test"], self.y_out["test"], test_size=val_test_ratio, shuffle=False)          
        
        return
    @dispatch(dict, dict, float, float)
    def make_perdictionary_split(self, X_dict, y_dict, test_size, val_size, splitter=None):
        """
        """
        column_length = (int(test_size!=0) + int(val_size!=0) + 1) * (1 + int(y_dic!=None))
        
        for i, (key, X) in enumerate(X_dict.items()):
            if len(y_dict):
                splitter(X, y_dict[key], test_size=test_size, val_size=val_size)
            else:
                splitter(X, test_size=test_size, val_size=val_size)
            self.buffer()
            
        self.save()
    
    @dispatch(dict, dict, list)
    def make_perdictionary_split(self, X_dict, y_dict, dates, splitter=None):
        """
        """
        column_length = len(dates) + 1
        
        for i, (key, X) in enumerate(X_dict.items()):
            if len(y_dict):
                splitter(X, y_dict[key], dates)
            else:
                splitter(X, dates)
            self.buffer()
            
        self.save()
            
    def make_dictionary_split(self, X_dict, y_dict, test_size=0.5, val_size=None, concatenate=True):
        """
        """
        test_subjects, test_size_real = self.make_subject_ratio(X_dict, test_size)

        if test_size > 0 and not len(test_subjects):
            raise("Test set empty!")

        train_subjects = set(X_dict.keys()) - set(test_subjects)

        if val_size:
            X_train_dict = {key: X_dict[key] for key in train_subjects}
            train_size = (1 - test_size)
            val_train_ratio = val_size/train_size
            val_subjects, val_size_real = self.make_subject_ratio(X_train_dict, val_train_ratio)

            if val_size > 0 and not len(val_subjects):
                raise("Test set empty!")

            train_subjects = train_subjects - set(val_subjects)
            self.X_out["val"] = [X_dict[subject] for subject in val_subjects]

        self.X_out["train"] = [X_dict[subject] for subject in list(train_subjects)]
        self.X_out["test"] = [X_dict[subject] for subject in test_subjects]
        
        if len(y_dict):
            self.y_out["train"] = [y_dict[subject] for subject in list(train_subjects)]
            self.y_out["test"] = [y_dict[subject] for subject in test_subjects]
            
            if val_size:
                self.y_out["val"] = [y_dict[subject] for subject in val_subjects] 

        return
    
    def buffer(self):
        """
        """
        for key, value in self.X_out.items():
            self.X_buffer[key].append(self.X_out[key])
        for key, value in self.y_out.items():
            self.y_buffer[key].append(self.y_out[key])
                        
    def save(self):
        """
        """
        for key in self.X_out.keys():
            if self.X_buffer[key][0] is not None:
                self.X_out[key] = self.X_buffer[key]
            else:
                self.X_out[key] = None
                
            if self.y_buffer[key][0] is not None:
                self.y_out[key] = self.y_buffer[key]
            else:
                self.y_out[key] = None
                            
    def make_returns(self):
        """
        """
        for key in self.X_out.keys():
            if not isinstance(self.X_out[key], list) and self.X_out[key] is not None:
                self.X_out[key] = [self.X_out[key]]
                self.y_out[key] = [self.y_out[key]]
                
        rets = (*[value for value in self.X_out.values() if value is not None], 
                *[value for value in self.y_out.values() if value is not None])

        for key in self.X_out.keys():
            self.X_out[key] = None
            self.y_out[key] = None
            
        for key in self.X_buffer.keys():
            self.X_buffer[key] = list()
            self.y_buffer[key] = list()
        
        
        return rets
        
    def make_subject_ratio(self, X_dict, target_size):
        """
        """
        total_samples = get_sample_size(X_dict)

        # Create a dataframe containing the participant id and share on total sample as ratio
        subject_ratios_pairs = [(subject, len(X_dict[subject])/total_samples) for subject in X_dict.keys()]

        # Build dataframe
        ratio_df = pd.DataFrame(subject_ratios_pairs, 
                                columns=['participant', 'ratio'])

        ratio_df = ratio_df.sort_values('ratio')
        ratio_df = ratio_df.set_index('participant')

        subjects = list()

        best_diff = 1e18
        tolerance = 0.005
        max_iter = 1000
        iter = 0

        while best_diff > tolerance and iter < max_iter:

            current_size = 0
            remaining_pairs_df = ratio_df.sample(frac=1)

            while current_size < target_size:
                current_to_rarget_diff = target_size - current_size
                remaining_pairs_df = remaining_pairs_df[remaining_pairs_df['ratio'] < current_to_rarget_diff]

                if not len(remaining_pairs_df): break

                next_subject = remaining_pairs_df.iloc[-1]

                current_size += next_subject['ratio']
                subject_name = next_subject.name
                subjects.append(subject_name)
                remaining_pairs_df = remaining_pairs_df.drop(subject_name)

            diff = abs(target_size - current_size)

            if diff < best_diff:
                best_subjects = subjects
                best_size = current_size
                best_diff = diff

            iter += 1

        return best_subjects, best_size


## VI. Tetsting

Before refactoring the functionality in the codebase, we want to make sure, that the functionality performs well.

### Dictionaries

We want dictionaries of different lengths.

In [166]:
from preprocessing.smartmeter import Preprocessor

preprocessor = Preprocessor()

X_dic = {}
y_dic = {}

n_samples = len(timeseries_df)

for index in range(20):
    i = index + 1
    X_household, y_household = preprocessor.transform(timeseries_df=timeseries_df.iloc[:int((n_samples/i) - 1), :],
                                                      forecast_horizon="qdaily",
                                                      input_width=48)
    X_household_new = {}
    y_household_new = {}
    
    for key, values in X_household.items():
        X_household_new[f"{key}_{1}"] = values
        
    for key, values in y_household.items():
        y_household_new[f"{key}_{1}"] = values
        
    X_dic.update(X_household_new)
    y_dic.update(y_household_new)

In [167]:
su = SplitUtility()

### Split By Time Ratio

**With targets & val set**

In [148]:
X, X1, X2, y, y1, y2 = su.train_test_split(X_dic, y_dic, test_size=0.2, val_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]) + len(X2[0]))

time


0.601593625498008

**With val set**

In [149]:
X, X1, X2 = su.train_test_split(X_dic, test_size=0.2, val_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]) + len(X2[0]))

time


0.603585657370518

**With target**

In [150]:
X, X1, y, y1 = su.train_test_split(X_dic, y_dic, test_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]))

time


0.8007968127490039

**Blank**

In [151]:
X, X1 = su.train_test_split(X_dic, test_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]))

time


0.8027888446215139

### Split By Households/Subjects

**With targets & val set**

In [152]:
X, X1, X2, y, y1, y2 = su.train_test_split(X_dic, y_dic, test_size=0.2, val_size=0.2)
get_sample_size(X) / (get_sample_size(X) + get_sample_size(X1) + get_sample_size(X2))

sample


0.6055940233236151

**Val set**

In [153]:
X, X1, X2 = su.train_test_split(X_dic, test_size=0.2, val_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]) + len(X2[0]))

time


0.603585657370518

**Target**

In [154]:
X, X1, y, y1 = su.train_test_split(X_dic, y_dic, test_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]))

time


0.8007968127490039

**Blank**

In [155]:
X, X1 = su.train_test_split(X_dic, test_size=0.2, method="time")
len(X[0]) / (len(X[0]) + len(X1[0]))

time


0.8027888446215139

### Split by Dates

In [156]:
bydate = ["2013-03-03", "2013-09-03"]

**With targets & val set**

In [157]:
X, X1, X2, y, y1, y2 = su.train_test_split(X_dic, y_dic, dates=bydate)
get_sample_size(X) / (get_sample_size(X) + get_sample_size(X1) + get_sample_size(X2))

date


0.32415340677274584

**Val set**

In [112]:
X, X1, X2 = su.train_test_split(X_dic, dates=bydate)
len(X[0]) / (len(X[0]) + len(X1[0]) + len(X2[0]))

0.28087649402390436

In [113]:
bydate = ["2013-03-03"]

**Target**

In [114]:
X, X1, y, y1 = su.train_test_split(X_dic, y_dic, dates=bydate)
len(X[0]) / (len(X[0]) + len(X1[0]))

0.2788844621513944

**Blank**

In [115]:
X, X1 = su.train_test_split(X_dic, dates=bydate)
len(X[0]) / (len(X[0]) + len(X1[0]))

0.28087649402390436

### Split by Samples

**Where do I need this then**

In [163]:
X, X1 = su.train_test_split(mimic_df, test_size=0.2)
len(X) / (len(X) + len(X1))

sample


0.7980295566502463

**Val set**

In [170]:
X, X1, X2 = su.train_test_split(mimic_df, test_size=0.2, val_size=0.2)
len(X) / (len(X) + len(X1) + len(X2))

sample


0.5960591133004927

In [132]:
X, X1 = su.train_test_split(mimic_df, test_size=0.2)
len(X) / (len(X) + len(X1))

0.7980295566502463