Import the necessary packages.

In [1]:
import requests
import pandas as pd
import numpy as np
import random

Download the original datasets and save them locally. We use a public dataset available from [Mendeley Data](https://data.mendeley.com/datasets/kcpnnrn67p/1), obtained from a chemical processing unit. This data and the underlying process is described in [this paper](https://folk.ntnu.no/skoge/prost/proceedings/ifac2002/data/content/01320/1320.pdf).

In [2]:
in_table_url = 'https://data.mendeley.com/public-files/datasets/kcpnnrn67p/files/f132a70d-11b8-43d1-aaef-facbb6e667ac/file_downloaded'
out_table_url = 'https://data.mendeley.com/public-files/datasets/kcpnnrn67p/files/e62b3119-5435-4810-a96f-e40be1904777/file_downloaded'

with open('./data_original/IN_Table.csv', 'wb') as f:
    response = requests.get(in_table_url)
    f.write(response.content)
    
with open('./data_original/OUT_Table.csv', 'wb') as f:
    response = requests.get(out_table_url)
    f.write(response.content)

Read the dataset containing input sensor measurements.

In [3]:
df_in = pd.read_csv('./data_original/IN_Table.csv')
df_in

Unnamed: 0,IN1,IN2,IN3,IN4,IN5
0,0.077744,0.795565,-0.665503,0.879321,0.134419
1,0.080313,0.824595,-0.655447,0.875636,0.134941
2,0.087355,0.776258,-0.650550,0.884105,0.132452
3,0.091774,0.739149,-0.644934,0.892043,0.131287
4,0.091166,0.761332,-0.648654,0.892752,0.136788
...,...,...,...,...,...
14396,-2.371841,-1.399878,-2.594168,-0.861832,-1.607387
14397,-2.378745,-1.407683,-2.601522,-0.871577,-1.604817
14398,-2.385070,-1.388787,-2.606594,-0.880330,-1.579605
14399,-2.386119,-1.372492,-2.615738,-0.886744,-1.556761


Read the dataset containing output sensor measurements.

In [4]:
df_out = pd.read_csv('./data_original/OUT_Table.csv')
df_out

Unnamed: 0,Out1,Out2
0,-0.122686,0.123661
1,-0.122686,0.123661
2,-0.026857,0.123661
3,-0.026857,0.251010
4,-0.026857,0.251010
...,...,...
14396,-2.039274,-1.531870
14397,-2.039274,-1.531870
14398,-1.943445,-1.659219
14399,-1.943445,-1.659219


Define a function to compute missing value ratios in a data frame.

In [5]:
def compute_missing_ratio(df):
    df_missing = (df.isnull().sum() / len(df)) * 100
    df_missing = df_missing.drop(df_missing[df_missing == 0].index).sort_values(ascending = False)
    display(pd.DataFrame({'Missing Ratio' :df_missing}))

Check for missing values in the input sensor data.

In [6]:
compute_missing_ratio(df_in)

Unnamed: 0,Missing Ratio


Check for missing values in the output sensor data.

In [7]:
compute_missing_ratio(df_out)

Unnamed: 0,Missing Ratio


Create a single data frame with input and output sensor measurements.

In [8]:
df = pd.concat([df_in, df_out], axis=1)
df

Unnamed: 0,IN1,IN2,IN3,IN4,IN5,Out1,Out2
0,0.077744,0.795565,-0.665503,0.879321,0.134419,-0.122686,0.123661
1,0.080313,0.824595,-0.655447,0.875636,0.134941,-0.122686,0.123661
2,0.087355,0.776258,-0.650550,0.884105,0.132452,-0.026857,0.123661
3,0.091774,0.739149,-0.644934,0.892043,0.131287,-0.026857,0.251010
4,0.091166,0.761332,-0.648654,0.892752,0.136788,-0.026857,0.251010
...,...,...,...,...,...,...,...
14396,-2.371841,-1.399878,-2.594168,-0.861832,-1.607387,-2.039274,-1.531870
14397,-2.378745,-1.407683,-2.601522,-0.871577,-1.604817,-2.039274,-1.531870
14398,-2.385070,-1.388787,-2.606594,-0.880330,-1.579605,-1.943445,-1.659219
14399,-2.386119,-1.372492,-2.615738,-0.886744,-1.556761,-1.943445,-1.659219


Check the column names corresponding to input and output sensors.

In [9]:
df.columns

Index(['IN1', ' IN2', ' IN3', ' IN4', ' IN5', 'Out1', 'Out2'], dtype='object')

Create 3 new input sensor measurements as actual measurements added to random noise. This will later be used to simulate a simple feature selection, where ideally these fabricated input sensors won’t be selected as features.

In [10]:
df['IN6'] = df[' IN3'] + np.random.normal(0, 0.5, 14401)
df['IN7'] = df[' IN4'] + np.random.normal(0, 0.5, 14401)
df['IN8'] = df[' IN5'] + np.random.normal(0, 0.5, 14401)

df

Unnamed: 0,IN1,IN2,IN3,IN4,IN5,Out1,Out2,IN6,IN7,IN8
0,0.077744,0.795565,-0.665503,0.879321,0.134419,-0.122686,0.123661,-0.463482,1.135632,0.191626
1,0.080313,0.824595,-0.655447,0.875636,0.134941,-0.122686,0.123661,-0.998246,1.085583,0.503527
2,0.087355,0.776258,-0.650550,0.884105,0.132452,-0.026857,0.123661,-1.171168,0.096908,-0.133069
3,0.091774,0.739149,-0.644934,0.892043,0.131287,-0.026857,0.251010,0.260074,0.332611,-0.151051
4,0.091166,0.761332,-0.648654,0.892752,0.136788,-0.026857,0.251010,-0.941272,0.324632,-0.260873
...,...,...,...,...,...,...,...,...,...,...
14396,-2.371841,-1.399878,-2.594168,-0.861832,-1.607387,-2.039274,-1.531870,-2.462220,-0.748116,-1.713229
14397,-2.378745,-1.407683,-2.601522,-0.871577,-1.604817,-2.039274,-1.531870,-2.552222,-1.493226,-1.253816
14398,-2.385070,-1.388787,-2.606594,-0.880330,-1.579605,-1.943445,-1.659219,-2.273337,-0.411844,-1.692019
14399,-2.386119,-1.372492,-2.615738,-0.886744,-1.556761,-1.943445,-1.659219,-2.517728,-1.720201,-1.242460


Make the data frame into long format, which is more commonly seen in sensor data collection.

In [11]:
df1 = pd.DataFrame()

for i in range(len(df.columns)):
    df1 = df1.append(pd.DataFrame({'time': df.index, 'sensor': df.columns[i], 'value': df[df.columns[i]]}))

df1 = df1.reset_index(drop=True)
df1

Unnamed: 0,time,sensor,value
0,0,IN1,0.077744
1,1,IN1,0.080313
2,2,IN1,0.087355
3,3,IN1,0.091774
4,4,IN1,0.091166
...,...,...,...
144005,14396,IN8,-1.713229
144006,14397,IN8,-1.253816
144007,14398,IN8,-1.692019
144008,14399,IN8,-1.242460


Correct the column names by removing the extra leading spaces.

In [12]:
df1['sensor'].unique()

array(['IN1', ' IN2', ' IN3', ' IN4', ' IN5', 'Out1', 'Out2', 'IN6',
       'IN7', 'IN8'], dtype=object)

In [13]:
df1['sensor'] = df1['sensor'].apply(lambda x: x.strip())
df1['sensor'].unique()

array(['IN1', 'IN2', 'IN3', 'IN4', 'IN5', 'Out1', 'Out2', 'IN6', 'IN7',
       'IN8'], dtype=object)

Here we artificially insert missing values at random in the data. This is just to simulate a simple strategy for missing value imputation later during data preparation. We introduce 1,000 missing measurements across all input variables.

In [14]:
col_nan = [c for c in df1['sensor'].unique() if c not in ['Out1', 'Out2']]
idx_nan = random.sample(list(df1[df1['sensor'].isin(col_nan)].index), 1000)

df1.loc[df1.index.isin(idx_nan), 'value'] = np.nan
df1

Unnamed: 0,time,sensor,value
0,0,IN1,0.077744
1,1,IN1,0.080313
2,2,IN1,0.087355
3,3,IN1,0.091774
4,4,IN1,0.091166
...,...,...,...
144005,14396,IN8,-1.713229
144006,14397,IN8,-1.253816
144007,14398,IN8,-1.692019
144008,14399,IN8,-1.242460


In [15]:
compute_missing_ratio(df1)

Unnamed: 0,Missing Ratio
value,0.694396


Save the generated dataset as a parquet file. The dataset has 8 input variables, 3 of them artificially generated, and 2 output variables. It also has about 0.7% missing values at random across the input variables.

In [16]:
df1.to_parquet('./data_generated/raw_sensor_data.parquet')