## Overview

In this competition, you'll classify 60-second sequences of sensor data, indicating whether a subject was in either of two activity states for the duration of the sequence.

**Files and Field Descriptions**

- **train.csv**: the training set, comprising ~26,000 60-second recordings of thirteen biological sensors for almost one thousand experimental participants
    - *sequence* - a unique id for each sequence
    - *subject* - a unique id for the subject in the experiment
    - *step* - time step of the recording, in one second intervals
    - *sensor_00* - sensor_12 - the value for each of the thirteen sensors at that time step
- **train_labels.csv**: the class label for each sequence.
    - *sequence* - the unique id for each sequence.
    - *state* - the state associated to each sequence. This is the target which you are trying to predict.
- **test.csv**: the test set. For each of the ~12,000 sequences, you should predict a value for that sequence's state.
- **sample_submission.csv**: a sample submission file in the correct format.

## Importing packages and loading dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
from itertools import chain
from sklearn import metrics
import scipy.stats
%matplotlib inline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score



In [None]:
# Load the data

train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')

test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')

## Exploratory Data Analysis (EDA) with Pandas and NumPy

Thank you [AMBROSM ](https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense) for his always useful and inspiring EDA notebbok.

In [None]:
train

In [None]:
train_labels

In [None]:
train_labels.tail()

In [None]:
test.head()

In [None]:
len(train['subject'].unique())

In [None]:
np.sort(train['subject'].unique())

In [None]:
len(test['subject'].unique())

In [None]:
len(train['sequence'].unique())

In [None]:
len(train_labels['sequence'].unique())

In [None]:
train_labels['sequence'].unique()

In [None]:
len(test['sequence'].unique())

In [None]:
test['sequence'].unique()

In [None]:
train.info()

In [None]:
train_labels.info()

In [None]:
test.info()

In [None]:
print(f'Subject numbering in train: from {train.subject.min()} to {train.subject.max()}')
print(f'Subject numbering in test: from {test.subject.min()} to {test.subject.max()}')
print()

Comments:
- There are **25968 sequences** (labeled **from 0 to 25967**) in the **train** with 672 subjects.
- The train data has ***1558080*** rows, which makes sense since we have that each sequence has **60 steps, one step per second ** (25968*60=1558080). 
- No missing value.
- Every sequence has **60 * 13 = 780 features.**
- The **test** data has **12218 sequences (labeled from 25968 to 38185**)
- The **train and test subjects are different**, we cannot use the subject as a feature.
- We need to predict what state are the sequence in the test data, labeled from 25968 to 38185.

## Creating 'state' column

In [None]:
train = train.merge(train_labels, how='left')
train.head(123)

In [None]:
list(train.columns[3:16])

In [None]:
train[list(train.columns[3:16])]

In [None]:
train['median'] = train[list(train.columns[3:16])].median(axis = 1)
train

In [None]:
# counting how many sequences per subject
count_sub = pd.DataFrame(train.subject.value_counts().sort_values().reset_index() )
count_sub

In [None]:
count_sub['number of sequences'] = (count_sub['subject']/60).astype(int) #dividing by 60 seconds to obtain the right count
count_sub.drop(['subject'], axis = 1, inplace = True)

In [None]:
count_sub['subject'] = count_sub['index']
count_sub.drop(['index'], axis = 1, inplace = True)
count_sub

In [None]:
plt.figure(figsize=(30,8))
plt.bar(count_sub['subject'], count_sub['number of sequences'])

In this way, by using the train-labels, we know which state was the sequence. 
It looks that in order to gather information for classificaton it is useful to group by sequence.

In [None]:
train.columns[3:15]

In [None]:
#train_pivoted = train.pivot(index=['subject', 'sequence', 'step'], columns = 'state', values=[col for col in train.columns if 'sensor_' in col])

train_pivoted = train.pivot(index=['subject', 'sequence', 'state'], columns = 'step', values=[col for col in train.columns if 'sensor_' in col])

train_pivoted

In [None]:
train_pivoted.iloc[train_pivoted.index.get_level_values(0) == 437 ] #the subject 437 has the biggest numbers of sequences

In [None]:
np.sort(train_pivoted.index.unique('subject'))

In [None]:
subjects = pd.DataFrame(train_pivoted.index.get_level_values(0))
subjects

In [None]:
states = pd.DataFrame(train_pivoted.index.get_level_values(2))
states

In [None]:
count_state = pd.concat([subjects, states], axis = 1)
count_state

In [None]:
count_subject = pd.DataFrame(count_state['subject'].value_counts().sort_index())
count_subject['count'] = count_subject['subject']
count_subject.drop('subject', axis = 1).index.name = 'subject'
count_subject.drop('subject', axis = 1)

In [None]:
plt.figure(figsize=(30,8))
sns.countplot(count_state['subject'], hue = count_state['state'])#, order = [1, 0])

The graph shows that subjects with more sequences tend to be on state 1.

## Features correlation ##

In [None]:
# #features correlation

colormap = plt.cm.RdBu
plt.figure(figsize=(18,15));
plt.title('Features correlation', y=1.05, size=20);
features  = [col for col in train.columns if col not in ('sequence','step','subject', 'state')]
sns.heatmap(train[features].corr(),linewidths=0.1, vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

## Visualisations ##

In [None]:
serie = pd.DataFrame(train.loc[(train['subject'] == 0) & (train['sequence'] == 207)].set_index('step')) 
serie.head()

In [None]:
list(serie.columns[2:15])

In [None]:
serie['sensor_00']

In [None]:
# plot data sensors of subject 0, sequence 207 along steps (state 0)

plt.figure(figsize=(30,8))
for i in list(serie.columns[2:15]):
    plt.plot(serie[i], label = i)
    plt.grid(True)
plt.legend()
plt.show()

In [None]:
row_1=serie.iloc[0]
row_1[2:15]

In [None]:
# I need to plot for each second all the sensors  (subject 0, sequence 207, state 0)

plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)
plt.legend()
plt.show()

In [None]:

plt.figure(figsize=(30,8))
for i in (0,1,2,3):
    plt.plot((serie.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0)
plt.figure(figsize=(30,8))
totmed = []
for i in range(0,59):
    median = serie.iloc[i][2:15].median()
    totmed.append(median) 
plt.plot(totmed)
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0)
plt.figure(figsize=(30,8))
totmed = []
for i in serie.columns[2:15]:
    median = serie[i].median()
    totmed.append(median) 
plt.plot(totmed)
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
serie1 = pd.DataFrame(train.loc[(train['subject'] == 327) & (train['sequence'] == 25967)].set_index('step')) 
serie1.head()

In [None]:
# Plot for each second all the sensors  (subject 0, sequence 207, state 0,blue   vs  subject 327, sequence 25967, state 0,red)
plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie.iloc[i])[2:15], label = i, color ='blue')
    plt.plot((serie1.iloc[i])[2:15], label = i, color ='red')
    plt.grid(True)
plt.grid(True)
#plt.legend()
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0,blue   vs  subject 327, sequence 25967, state 0,red)
plt.figure(figsize=(30,8))
totmed = []
totmed1 = []
for i in range(0,59):
    median = serie.iloc[i][2:15].median()
    totmed.append(median) 
    median1 = serie1.iloc[i][2:15].median()
    totmed1.append(median1)
plt.plot(totmed, label = 'median subject 0',color = 'blue')
plt.plot(totmed1,label = 'median subject 327', color = 'red')
plt.legend()
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0)
plt.figure(figsize=(30,8))
totmed1 = []
for i in serie1.columns[2:15]:
    median1 = serie1[i].median()
    totmed1.append(median) 
plt.plot(totmed1)
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
serie2 = pd.DataFrame(train.loc[(train['subject'] == 0) & (train['sequence'] == 1174)].set_index('step')) 
serie2.head()

In [None]:
# Plot for each second all the sensors  (subject 0, sequence 1174, state 1)
plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie2.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0,blue   vs  subject 0, sequence 1174, state 1,green)
plt.figure(figsize=(30,8))
totmed = []
totmed2 = []
for i in range(0,59):
    median = serie.iloc[i][2:15].median()
    totmed.append(median) 
    median2 = serie2.iloc[i][2:15].median()
    totmed2.append(median2)
plt.plot(totmed, label = 'median subject 0, state 0',color = 'blue')
plt.plot(totmed2,label = 'median subject 0, state 1', color = 'green')
plt.legend()
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0)
plt.figure(figsize=(30,8))
totmed2 = []
for i in serie2.columns[2:15]:
    median2 = serie2[i].median()
    totmed2.append(median) 
plt.plot(totmed2)
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
serie3 = pd.DataFrame(train.loc[(train['subject'] == 0) & (train['sequence'] == 5008)].set_index('step')) 
serie3.head()

In [None]:
# Plot for each second all the sensors  (subject 0, sequence 5008, state 1)
plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie3.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# median of all sensors at each step (subject 0, sequence 207, state 0,blue   vs  subject 0, sequence 1174, state 1,green)
plt.figure(figsize=(30,8))
totmed = []
totmed2 = []
totmed3 = []
for i in range(0,59):
    median = serie.iloc[i][2:15].median()
    totmed.append(median) 
    median2 = serie2.iloc[i][2:15].median()
    totmed2.append(median2)
    median3 = serie3.iloc[i][2:15].median()
    totmed3.append(median3)
plt.plot(totmed, label = 'median subject 0, state 0',color = 'blue')
plt.plot(totmed2,label = 'median subject 0, state 1', color = 'green')
plt.plot(totmed3,label = 'median subject 0, state 1', color = 'brown')
plt.legend()
plt.grid(True)
plt.ylim([-2, 2])
plt.show()

In [None]:
serie4 = pd.DataFrame(train.loc[(train['subject'] == 0) & (train['sequence'] == 7366)].set_index('step')) 
serie4.head()

In [None]:
plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie4.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)

plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(30,8))
totmed = []
for i in range(0,59):
    median = serie4.iloc[i][2:15].median()
    totmed.append(median) 
plt.ylim([-2, 2])    
plt.plot(totmed)
plt.grid(True)
plt.show()

In [None]:
serie5 = pd.DataFrame(train.loc[(train['subject'] == 1) & (train['sequence'] == 195)].set_index('step')) 
serie5.head()

In [None]:
plt.figure(figsize=(30,8))
for i in range(0,59):
    plt.plot((serie5.iloc[i])[2:15], label = i)
    plt.grid(True)
plt.grid(True)

plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(30,8))
totmed = []
for i in range(0,59):
    median = serie5.iloc[i][2:15].median()
    totmed.append(median) 
plt.ylim([-2, 2])
plt.plot(totmed)
plt.grid(True)
plt.show()