# Synopsis

This notebook addresses how we can find the bad sensor values, what we know about them, and a basic suggestion of how we might handle them.

# Setup

In [None]:
import copy
import random

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.io.formats import style

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import sklearn
import sklearn.preprocessing as sk_prep
import sklearn.model_selection as sk_ms
import sklearn.feature_selection as sk_fs
import sklearn.pipeline as sk_pipe
import sklearn.compose as sk_comp
import sklearn.base as sk_base
import sklearn.ensemble as sk_ens
import sklearn.metrics as sk_met
import sklearn.linear_model as sk_lm
import sklearn.tree as sk_tree
import sklearn.svm as sk_svm
import sklearn.decomposition as sk_de

from scipy import stats

import catboost as cb

from statsmodels.tsa import seasonal

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
DATA_DIR = '/kaggle/input/tabular-playground-series-jul-2021'
RANDOM_STATE = 88533

np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

# Load Data

In [None]:
# Train Data

train_set = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'), parse_dates=[0])
train_X = train_set.iloc[:, :-3] # Feature columns
# train_ar = train_X.to_numpy()

target_cols = list(train_set.columns)[-3:]
train_y = train_set[target_cols]

# # Train data with label encoded target
# train_w_targ = train_X.copy()
# train_w_targ[target_cols] = train_y

print(train_set.shape)
train_set

In [None]:
# Training DataFrames with datetime index
# Targets adjusted to log values

train_set_tidx = train_set.iloc[:, 1:].copy()
train_set_tidx.index = train_set['date_time']
train_X_tidx = train_set_tidx.iloc[:, :-3]

for col in train_set_tidx.columns[-3:]:
    train_set_tidx.loc[:, col] = np.log(train_set_tidx[col])

train_y_tidx = train_y.copy()
train_y_tidx.loc[:, :] = np.log(train_y)
train_y_tidx.index = train_set['date_time']

train_set_tidx

In [None]:
# Test Data

test_set = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'), parse_dates=[0])
test_X = test_set
test_ar = test_X.to_numpy()

test_X.shape

In [None]:
# Test dataframes with datetime index

test_set_tidx = test_set.iloc[:, 1:]
test_set_tidx.index = test_set['date_time']
test_X_tidx = test_set_tidx

test_set_tidx

# Initial Check for Missing Data

In [None]:
# Count nulls
print("Count of NaN's")
print('Train: ', np.sum(np.isnan(train_set_tidx.values)))
print('Test: ', np.sum(np.isnan(test_set_tidx.values)))

We see that there are no missing values, so we might be tempted to think that no repair is needed.  Further exploration reveals that we have bad data instead of missing data.

# Timelines of Predictors

We will start with time plots of a few of our predictors.  I first found the errors by examining these plots.

In [None]:
# Samples of timelines

for col in train_X.iloc[:, 1:5]:
    fig, axis = plt.subplots(figsize=(15,2))
    sns.lineplot(x='date_time', y=col, data=train_X)
    sns.lineplot(x='date_time', y=col, data=test_X)
    axis.set_title(col)
    plt.show()

It should jump out at us that from time to time some sensors jump to a value that is far from their recent average and may stay there for a while.  This is especially obvious in absolute humidity where the value in these sections is on the edge of the range of the other values.

# Using Pairplots to Find the Bad Values

We can use pairplots to help find the values that have unusual combinations of values for two variables.

We will start with an interactive pair plot.  You can zoom in and then select groups of values that fall outside the main group; this will highlight those samples in all frames.  I would suggest zooming in on the separated group in absolute humidity (3rd row) vs sensor 4.  Then highlight the blob.  When you hover over the plot,the controls show in the upper right.

In [None]:
# fig = px.scatter_matrix(train_set_tidx.iloc[:, [0,5,6,9,10]])
fig = px.scatter_matrix(train_set_tidx.iloc[:, 1:])
fig.update_traces(selected_marker_color='#ff0000', marker_size=2)
fig.show()

Here is a plot with a smaller number of variables to make the selection process easier.

In [None]:
fig = px.scatter_matrix(train_set_tidx.iloc[:, [0,5,6,9,10]])
fig.update_traces(selected_marker_color='#ff0000', marker_size=2)
fig.show()

We can also determine the position of the blob and use the range of the coordinates to identify the bad values and then do the pair plot with two categories (normal and bad).
Here we are using a 20% sampling to speed up the plotting; smaller samples had plotting errors.

# Identifying a Set of Bad Samples

Interactively, I determined that the bad values for the combination of deg_C and Sensor 3 and sensor 4 are those between:

Degrees C - 22-26.5

Sensor 3 - 1500 - 2100

Sensor 4 - 500-700

Using Degrees C with either of the others appears to define the set accurately.

In [None]:
# Bad index and predictors for combination of train and test

data_bad = pd.concat([train_X_tidx, test_X_tidx[1:]])
bad_idx = data_bad['deg_C'].between(22.0, 26.5).astype('int')
bad_idx *= data_bad['sensor_3'].between(1500, 2100).astype('int')

for col in data_bad.columns:
    data_bad.loc[bad_idx == 0, col] = np.nan

# Bad index and targets for train

bad_idx_tr = bad_idx[:train_y_tidx.shape[0]]
data_bad_y = train_y_tidx.copy()

for col in data_bad_y:
    data_bad_y.loc[bad_idx_tr == 0, col] = np.nan

In [None]:
# predictor data in bad samples
data_bad.describe()

In [None]:
# target data in bad samples
data_bad_y.describe()

From this description we can see that Benzene has almost no variation in the bad samples.  We will remove it from the following pairplots because it would cause trouble there.

# Pairplot Showing the Bad Values

In [None]:
# Using 20% sample

# Benzene removed because it does not have enought variation in bad values.

plot_df = train_set_tidx.copy()
plot_df.drop(columns='target_benzene', inplace=True)
plot_df['bad_val'] = bad_idx_tr

g = sns.PairGrid(plot_df.sample(frac=0.2), hue='bad_val', diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, multiple='layer', fill=True)

We can see that every variable has a limited range for the samples that we have identified as bad, except target carbon monoxide and target nitrogen oxides.  This leads us to expect that the bad data is in the same samples for all our inputs.  We already saw that benzene has almost no range at all in these samples.

# Timelines with Errors Marked

We can also look at all of our timelines again, this time with the error samples marked.  This also fits with the belief that all predictors and one target are affected at the same time.

In [None]:
for col in train_X_tidx:
    fig, axis = plt.subplots(figsize=(15,2))
    sns.lineplot(x='date_time', y=col, data=train_X_tidx)
    sns.lineplot(x='date_time', y=col, data=test_X_tidx)
    sns.lineplot(x='date_time', y=col, data=data_bad, color='red')
#     sns.scatterplot(x='date_time', y=col, data=data_bad, marker='o', color='red', edgecolors='none', alpha=1.0)
    axis.set_title(col)
    plt.show()

In [None]:
for col in train_y_tidx:
    fig, axis = plt.subplots(figsize=(11,2))
    sns.lineplot(x=train_X_tidx.index, y=train_y_tidx[col])
    sns.lineplot(x=train_X_tidx.index, y=data_bad_y[col], color='red')
    axis.set_title(col)
    plt.show()

# Pairplot of Bad Samples

The next item to check is whether the small amount of variation in the bad readings have some value for predictions.  If they did, we would still have to separate them out or rescale them to work with the good values.  Let's use a pairplot to check the possibilities.

In [None]:
# Bad values only

plot_df = plot_df.loc[bad_idx_tr == 1]
# plot_df.drop(columns='target_benzene', inplace=True)

g = sns.PairGrid(plot_df.iloc[:, :-1], diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.histplot)

The only clear correlation is between two of the targets, so we will conclude that none of the variation in the predictors in the bad samples has any value for us.

# Corrections

## Preprocessing

There are many ways we could substitute values in place of the bad data.  Here we will simply replace any bad value with the last good value before it, and check that we have some correlation to the target values for those samples.

In [None]:
# Simple data fix--use previous value

# Get list of locations to fix (by number)
bad_idx_2 = bad_idx_tr.copy()
bad_idx_2.index = range(bad_idx_2.shape[0])

bad_idx_list = bad_idx_2[bad_idx_2 == 1].index

In [None]:
# Fix these locations by using the value before the start of the bad data
fixed_data = train_set_tidx.copy()
skip_cols = ['target_carbon_monoxide', 'target_nitrogen_oxides']

for i in bad_idx_list:
    for j, col in enumerate(fixed_data.columns):
        if col in skip_cols:
            continue
        fixed_data.iloc[i, j] = fixed_data.iloc[i-1, j]

In [None]:
# Fixed samples only

plot_df = fixed_data.loc[bad_idx_tr == 1]

g = sns.PairGrid(plot_df, diag_sharey=False)
g.map_upper(sns.scatterplot, alpha=0.1)
g.map_lower(sns.kdeplot)
g.map_diag(sns.histplot)

Here we can see that some sensors show strong correlations, so some form of fixed data should be used.

Benzene always has the same value when the sensors are failing, so we might do better setting the final prediction to that value for those samples rather than using a prediction from the model.  This assumes, of course, that the test data is like the training data in this respect.  Be sure to use the original value of Benzene in the bad data and not the log value shown in this notebook.