<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/demos/04_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Input Files

For the observation of a process there are often used multiple sensors. So when saving the data of the sensors, multiple files are created. However for training, all these files need to be correctly combined to one dataset. This Demo will show how to handle multiple input files.

# How To use in the Datafactory

## Import packages

In [1]:
# if running in colab
import sys
if 'google.colab' in sys.modules:
    !git clone https://github.com/sdsc-bw/DataFactory.git # clone repository for colab
    !ls
    
    !pip install cloudpickle==1.6.0
    !pip install imgaug==0.2.6
    !pip install scipy==1.7.3 # install scipy to use hyperopt, RESTART RUNTIME AFTER THAT
    
    !pip install mlflow # install mlflow to use hyperopt
    
    # install auto-sklearn
    !sudo apt-get install build-essential swig
    !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
    
    !pip install tsai # install tsai

In [2]:
import warnings # igorne irrelevant warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd # library for creating tables
import numpy as np # library for efficient list calculations

# add path to import datafactory 
import sys
if 'google.colab' in sys.modules:
    root = 'DataFactory/'
else:
    root = '../'
sys.path.append(root)

# Time series
from datafactory.ts.preprocessing.loading import load_dataset_from_file
from datafactory.ts.preprocessing.cleaning import combine_dataframes

2022-03-05 12:52:53,928 - init


## Load Multiple Files

First, we need to define the paths to the Sensordata.

In [4]:
paths = []
for i in range(1, 4):
    paths.append('../data/sensor_' + str(i) + '.csv')

In [5]:
dfs = load_dataset_from_file('csv', paths, shuffle=False, index_col='date')

In [6]:
dfs[0]

Unnamed: 0_level_0,Unnamed: 0,val1,val2,val3,target
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-02-24 00:00:00,0,0.107112,0.148927,-0.022579,0
2015-02-24 00:01:00,1,-0.308293,-1.432239,-0.221532,1
2015-02-24 00:02:00,2,1.064629,-1.346123,0.545769,1
2015-02-24 00:03:00,3,-0.91462,1.503748,0.608795,0
2015-02-24 00:04:00,4,-1.33354,-0.39279,0.276805,1


In [7]:
dfs[1]

Unnamed: 0_level_0,Unnamed: 0,val1,val2,val3,target
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-02-24 00:00:00,0,-1.553171,-1.56658,-0.50523,1
2015-02-24 00:02:00,1,0.151971,-1.291708,0.839898,1
2015-02-24 00:04:00,2,0.639565,0.691188,-1.38224,0
2015-02-24 00:06:00,3,0.489976,-0.534892,-0.346734,1
2015-02-24 00:08:00,4,-0.974679,1.231506,2.555207,0


In [8]:
dfs[2]

Unnamed: 0_level_0,Unnamed: 0,val1,val2,val3,target
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-02-24 00:00:10,0,-0.48463,-0.379543,0.952757,1
2015-02-24 00:01:10,1,-0.857693,0.777133,1.244604,0
2015-02-24 00:02:10,2,-0.056405,0.45756,-1.921964,0
2015-02-24 00:03:10,3,1.057107,1.475887,1.016144,1
2015-02-24 00:04:10,4,-0.820045,0.827722,-1.172019,0


In [11]:
l = [dfs[0].index.to_series().diff().median(), dfs[1].index.to_series().diff().median(), dfs[2].index.to_series().diff().median()]

In [12]:
l

[Timedelta('0 days 00:01:00'),
 Timedelta('0 days 00:02:00'),
 Timedelta('0 days 00:01:00')]

In [13]:
sorted(l)

[Timedelta('0 days 00:01:00'),
 Timedelta('0 days 00:01:00'),
 Timedelta('0 days 00:02:00')]

## Combine Files