Import the necessary packages.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression

Read in the data with input and output sensor measurements prepared in the previous step.

In [2]:
df_sensors_prepared = pd.read_parquet('./data_prepared/prepared_sensor_data.parquet')

df_sensors_prepared

Unnamed: 0,IN1,IN2,IN3,IN4,IN5,Out1,Out2,IN6,IN7,IN8
1970-01-01 00:00:00+00:00,0.077744,0.795565,-0.068312,0.879321,0.134419,-0.122686,0.123661,-0.463482,1.135632,0.191626
1970-01-01 00:01:00+00:00,0.080313,0.824595,-0.068312,0.875636,0.134941,-0.122686,0.123661,-0.998246,1.085583,0.503527
1970-01-01 00:02:00+00:00,0.087355,0.776258,-0.068312,0.884105,0.132452,-0.026857,0.123661,-1.171168,0.096908,-0.133069
1970-01-01 00:03:00+00:00,0.091774,0.739149,-0.068312,0.892043,0.131287,-0.026857,0.251010,0.260074,0.332611,-0.151051
1970-01-01 00:04:00+00:00,0.091166,0.761332,-0.068312,0.892752,0.136788,-0.026857,0.251010,-0.941272,0.324632,-0.260873
...,...,...,...,...,...,...,...,...,...,...
1970-01-10 23:56:00+00:00,-2.371841,-1.399878,-2.594168,-0.861832,-1.607387,-2.039274,-1.531870,-2.462220,-0.748116,-1.713229
1970-01-10 23:57:00+00:00,-2.378745,-1.394332,-2.594168,-0.871577,-1.604817,-2.039274,-1.531870,-2.552222,-1.493226,-1.253816
1970-01-10 23:58:00+00:00,-2.385070,-1.388787,-2.594168,-0.880330,-1.579605,-1.943445,-1.659219,-2.273337,-0.411844,-1.692019
1970-01-10 23:59:00+00:00,-2.386119,-1.372902,-2.594168,-0.886744,-1.556761,-1.943445,-1.659219,-2.517728,-1.720201,-1.242460


Here we run a very simple feature selection procedure, based on mutual information regression having each output sensor as a target and the input sensors as predictors. Ideally, we should obtain the original input sensors (IN1 through IN5) as the top 5 most relevant features for predicting both output sensors Out1 and Out2.

In [3]:
tags = df_sensors_prepared.columns
targets = ['Out1', 'Out2']
features = [c for c in tags if c not in targets]

sensors_mi = {}
for target in targets:
    mi = mutual_info_regression(X=df_sensors_prepared[features], y=df_sensors_prepared[target], discrete_features=False)
    sensors_mi[target] = mi
    
k = 5
top_features_regression = {}
for target in targets:
    mi = sensors_mi[target]
    idx = np.argpartition(mi, -k)[-k:]
    idx = idx[np.argsort(mi[idx])][::-1]
    top_features_regression[target] = [tags[i] for i in idx]
    
top_features_regression

{'Out1': ['IN1', 'IN3', 'IN4', 'IN5', 'IN2'],
 'Out2': ['IN1', 'IN3', 'IN4', 'IN2', 'IN5']}

Create a new data frame for each output sensor and corresponding selected features and split them for model training and test. We then save separately each training and testing data frame for each output sensor as parquet files.

In [4]:
for target in targets:
    top_features = top_features_regression[target]
    df = df_sensors_prepared[[c for c in df_sensors_prepared.columns if c in top_features + [target]]]
    df_train = df[df.index < '1970-01-10 08:00:00']
    df_test = df[df.index >= '1970-01-10 08:00:00']
    df_train.to_parquet('./data_prepared/train/' + '/' + target + '.parquet')
    df_test.to_parquet('./data_prepared/test/' + '/' + target + '.parquet')
    print(df_train)
    print(df_test)

                                IN1       IN2       IN3       IN4       IN5  \
1970-01-01 00:00:00+00:00  0.077744  0.795565 -0.068312  0.879321  0.134419   
1970-01-01 00:01:00+00:00  0.080313  0.824595 -0.068312  0.875636  0.134941   
1970-01-01 00:02:00+00:00  0.087355  0.776258 -0.068312  0.884105  0.132452   
1970-01-01 00:03:00+00:00  0.091774  0.739149 -0.068312  0.892043  0.131287   
1970-01-01 00:04:00+00:00  0.091166  0.761332 -0.068312  0.892752  0.136788   
...                             ...       ...       ...       ...       ...   
1970-01-10 07:55:00+00:00 -2.225415 -0.899663 -2.259944 -0.942947 -1.039779   
1970-01-10 07:56:00+00:00 -2.230137 -0.814217 -2.222946 -0.924413 -1.021914   
1970-01-10 07:57:00+00:00 -2.227956 -0.894048 -2.199024 -0.896666 -0.979397   
1970-01-10 07:58:00+00:00 -2.222570 -0.730551 -2.227159 -0.880720 -0.926202   
1970-01-10 07:59:00+00:00 -2.219947 -0.898019 -2.244797 -0.891174 -0.927005   

                               Out1  
1970-01-01 00