# Data Exploration and Regression
This notebook explores the tabular dataset and contains a simple solution based on regression. The results are not great, however this notebook is intended to provide insight into the dataset and show a basic approach to a complete kaggle workflow. This Notebook is mostly unfiltered and simultaniously records my process. After some experimentation I might create a summary notebook! I hope you guys like it!

## Initialization

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

# Helper variables
sample_data_path = "/kaggle/input/tabular-playground-series-apr-2022/sample_submission.csv"
train_data_path = "/kaggle/input/tabular-playground-series-apr-2022/train.csv"
train_labels_path = "/kaggle/input/tabular-playground-series-apr-2022/train_labels.csv"
test_data_path = "/kaggle/input/tabular-playground-series-apr-2022/test.csv"

# Import data to pandas df
train_data = pd.read_csv(train_data_path)
train_labels = pd.read_csv(train_labels_path)
test_data = pd.read_csv(test_data_path)

# Useful variables
sensor_cols = [col for col in train_data.columns if col not in ["sequence", "subject", "step"]]

## Exploration
First some data exploration. Shows the format of the data and some relevant statistics.

In [None]:
train_data.head()

It seems that multiple rows correspond to the same measurement, or sequence. The rows have distinct time steps.

In [None]:
train_labels.head()

In [None]:
train_labels['state'].unique()

The labels are mapped directly to a sequence. Time to evaluate the data a bit more by looking at some summary staticstics.

### Sequences and Labels
Let's compare the number of sequences and number of timesteps first. We should verify whether every sequence has an equal number of time steps available.

In [None]:
per_sequence = train_data.groupby("sequence")
per_sequence['step'].count().unique()

That seems to be the case, cool! Now let's check if the training labels fully account for all the sequences in the training data.

In [None]:
if (train_labels["sequence"] == train_data["sequence"].unique()).all():
    print("All sequences accounted for!")
else:
    print("Nope, some sequences differ")

How about the number of sequences?

In [None]:
train_data['sequence'].unique().size

Since most of the data is present, we can take a look at some sensor measurement properties.

### Sensor Measurement Properties

We can get a sense by plotting some standard statistics such as mean, count and distribution parameters. Luckely pandas has us covered with the `describe` functionality.

In [None]:
train_data["sensor_00"].describe()

These statistics describe the properties of the data summed over **both sequences and steps**. We can conclude the the data has not yet been normalized, or the data contains big outliers. Since the difference in quantiles is quite large it might help to create some plots. 

In [None]:
fig, axs = plt.subplots(sensor_cols.__len__(), 1, figsize=(15,100), sharex=True)
plt.xlabel('sequence')
for col, plt_ax in zip(sensor_cols, axs):
    plt_ax.title.set_text(col)
    train_data[col].plot(ax=plt_ax)

Looking at the plots we can conclude that the data has not yet been normalized. We'll get back to preprocessing later. 

It would be interesting to see how many unique subjects are present in the dataset. Lastly it would be good to see if any of the sensors or the subject entries are invalid (NAN). Let's do that now.

In [None]:
train_data["subject"].unique().size

In [None]:
train_data.isnull().sum()

No invalid data! Such a luxury ;). That means we can move on to preprocessing.

## Preprocessing

Since the sensor data will be used to train a simple regression model, we will scale the data so every value fits between 0 and 1. I am ignoring the peculiar data distribution in sensor_02 for now. 

In [None]:
from pandas import DataFrame
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

def normalize_columns(bio_data: DataFrame, cols) -> DataFrame:
    """
    Normalize data in the provided columns between 0 and 1.
    """
    bio_data_norm = bio_data.copy()
    normalized_cols = pd.DataFrame(scaler.fit_transform(bio_data[cols]), columns=cols)
    bio_data_norm[cols] = normalized_cols
    return bio_data_norm

In [None]:
norm = normalize_columns(train_data, sensor_cols)
norm.describe()

For now we can choose a basic feature set for prediction. A straightforward choice is the mean over time, per sequence, of each normalized sensor value. This generates a set of 13 features per sequence that can be used for prediction. Obviously these feature are not sensitive to distribution at all, but consider this a nice starting point. We can analyze the correlation of sensor means with respect to the state in the dataset by plotting a correlation heatmap.

In [None]:
from pandas import DataFrame

def time_mean_per_sequence(data_frame: DataFrame, cols):
    """
    Return a dataframe with the mean over time steps of chosen columns.
    """
    return data_frame[cols].groupby('sequence').mean()

In [None]:
mean_sensor_readings = time_mean_per_sequence(norm, sensor_cols + ['sequence'])
mean_sensor_readings

In [None]:
import seaborn as sns

mean_with_labels = mean_sensor_readings.copy() 
mean_with_labels['state'] = train_labels['state']

sns.set(rc = {'figure.figsize':(20,8)})
sns.heatmap(mean_with_labels.corr(), annot=True)

It seems that none of the features correlate strongly with the state, which probably means we will have to intruduce other features later to improve the model.

## Regression Model

Regression might not be the best tool for the job here, but it is again great to start with because it is simple to implement and understand. I'll be using `sklearn` to fit a multivariate regression model first. The preprocessor first normalizes all sensor values and then takes the mean over time. 

In [None]:
def pre_processor(data_frame):
    return time_mean_per_sequence(normalize_columns(data_frame, sensor_cols), sensor_cols + ['sequence'])

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()

features = pre_processor(train_data)
lm.fit(features.values, train_labels['state'].values)

Now we can use the model to predict on the test set. Then it's just a matter of putting the predictions in the right format and handing it in!

In [None]:
test_features = pre_processor(test_data)
test_predictions = lm.predict(test_features.values)
test_predictions_frame = pd.DataFrame(test_predictions, columns=['state'])
test_predictions_frame.describe()

In [None]:
test_predictions_frame['state'] = test_predictions_frame['state'].clip(0, 1)
test_predictions_frame.describe()

In [None]:
test_predictions_frame['sequence'] = test_data['sequence'].unique()
test_predictions_frame.head()

In [None]:
test_predictions_frame.to_csv('submission.csv', index=False)

Interestingly the linear model predicts negative state values for most of the sequences. This might have to do with the fact that the features lack predictive power. The most logical followup would thus be to look for features with better predictive power, and use them for a better prediction. 

## Improved Features

Since the previous model does not appear to be better than a coinflip, we can try to improve the performance by introducing lag features. When evaluated by their correlation with the target state we can get a sense of their predictive power. It's also worth plotting the sequence based mean of each sensor to try and identify global patterns over time. In this case the mean over sequences does not seem to contain clearly identifyable trends.

In [None]:
mean_per_step = train_data.groupby('step').mean()
mean_per_step[sensor_cols].plot()

In [None]:
def lag_features(data_frame):
    """
    Return a dataframe containing several lag features of current columns,
    excluding the sequence and step columns.
    """
    lag_features = pd.DataFrame()
    columns = [col for col in data_frame.columns if col not in ['subject', 'sequence', 'step'] ]
    for col in columns:
        lag_features[col + '_0_20'] = data_frame[(data_frame['step'] >= 0) & (data_frame['step'] < 20)].groupby('sequence').mean()[col]
        lag_features[col + '_20_40'] = data_frame[(data_frame['step'] >= 20) & (data_frame['step'] < 40)].groupby('sequence').mean()[col]
        lag_features[col + '_40_60'] = data_frame[(data_frame['step'] >= 40) & (data_frame['step'] < 60)].groupby('sequence').mean()[col]
    return lag_features

In [None]:
lagged_sensors = lag_features(train_data)
lagged_sensors['state'] = train_labels['state']
sns.heatmap(lagged_sensors.corr(), annot=False)

Still the correlation with the target state looks weak. Let's try using these features to create a regression model.

## Regression With Lag Features

In [None]:
def pre_processor_lag(data_frame):
    return lag_features(normalize_columns(data_frame, sensor_cols))

In [None]:
from sklearn.linear_model import LinearRegression

lm_lag = LinearRegression()

features_with_lag = pre_processor_lag(train_data)
lm_lag.fit(features_with_lag.values, train_labels['state'].values)

In [None]:
test_features_lag = pre_processor_lag(test_data)
test_predictions_lag = lm_lag.predict(test_features_lag.values)
test_predictions_frame_lag = pd.DataFrame(test_predictions_lag, columns=['state'])
test_predictions_frame_lag['sequence'] = test_data['sequence'].unique()
test_predictions_frame_lag.describe()

In [None]:
test_predictions_frame_lag['state'] = test_predictions_frame_lag['state'].clip(0, 1).round()
test_predictions_frame_lag.describe()

In [None]:
test_predictions_frame_lag.to_csv('submission.csv', index=False)

Since the model is not improving much, it is time for a new notebook with a new approach!