<h1> Tabular Playground Series - April 2022</h1>  
Time series classification  

"You've been provided with thousands of sixty-second sequences of biological sensor data recorded from several hundred participants who could have been in either of two possible activity states. Can you determine what state a participant was in from the sensor data?"


<h2> Frame the problem </h2>

The objective of this month's problem is for 60-second sequences of sensor date is to predict the probability the subject is one of two states.  
The evaluation is based on the area under the ROC curve between the predicted probability and the observed target.  

Problem: supervised classification 

<h2> Import libraries </h2>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.linear_model import LogisticRegression

<h2> Get the data </h2>

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
df_train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv', index_col = 'sequence')
df_test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
df_submission_example = pd.read_csv('../input/tabular-playground-series-apr-2022/sample_submission.csv')

In [None]:
df_train.head(1)

In [None]:
df_train_labels.head(1)

In [None]:
df_test.head(1)

In [None]:
df_submission_example.head(1)

<h2> Exploratory Data Analysis </h2>
Feature exploration - correlations, outlier

Target label is a binary feature (0/1)

In [None]:
df_train_labels.state.unique()

Target variable is well balanced in train data

In [None]:
df_train_joined = df_train.merge(df_train_labels, on=['sequence'])

In [None]:
df_train_joined.groupby('state')['sequence'].count()

The train dataset has 1,558,080 rows with 16 columns

In [None]:
df_train.shape

A subject will have multiple sequences that have a total of 60 steps. These steps represent 1 second of sensor data.

In [None]:
df_train[(df_train['sequence']==0)].head()

All sequences have 60 steps

In [None]:
df = df_train.groupby(['sequence'])['step'].count().reset_index()
df.step.unique()

All columns are numerical and do not include any null entries

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

In [None]:
df_train.hist(figsize=(20,20), xrot=45)
plt.show()

In [None]:
df_train.describe()

Evaluating target to sensors

In [None]:
## Are all predictors independent of each other?

https://towardsdatascience.com/13-key-code-blocks-for-eda-classification-task-94890622be57

<h2> Prepare the data </h2>  
Apply data transformations identified in the previous step.  <br>
Apply data cleaning, feature selection & engineering, feature scaling for value standardisation/normalisation.

In [None]:
# convert step sensors into mean
sensor = ['00','01','02','03','04','05','06','07','08','09','10','11','12']

for i in sensor:
    mean_value = df_train.groupby(['sequence','subject'])[f"sensor_{i}"].mean()
    mean_value = mean_value.rename(f"sensor_{i}_mean")

                                                          

In [None]:
df_train = df_train.merge(mean_value, on=['sequence','subject'], how='left')

In [None]:
df_train = df_train[['sequence','subject','sensor_12_mean']].drop_duplicates()

In [None]:
# Repeat for test

for i in sensor:
    mean_value = df_test.groupby(['sequence','subject'])[f"sensor_{i}"].mean()
    mean_value = mean_value.rename(f"sensor_{i}_mean")
    
df_test = df_test.merge(mean_value, on=['sequence','subject'], how='left')
df_test = df_test[['sequence','subject','sensor_12_mean']].drop_duplicates()

<h2> Model the data </h2>

#### Logistic Regression 
Predicted variable is binary

In [None]:
df_train.head()

In [None]:
df_train_labels.head()

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 10000)
model.fit(df_train, df_train_labels)
pred = model.predict(df_test)

result = pd.DataFrame()
result['sequence'] = df_test.sequence
result['state'] = pred


In [None]:
result.to_csv('submission.csv')