# Tabular Playground Series April 2022 - EDA

> This month's challenge is a time series classification problem.

> You've been provided with thousands of sixty-second sequences of biological sensor data recorded from several hundred participants who could have been in either of two possible activity states. Can you determine what state a participant was in from the sensor data?

**References:**

Some of this EDA takes inspiration from AmbrosM - [Notebook](https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense/notebook)

**Still in progress.**

# EDA

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-apr-2022/train.csv")
test_df = pd.read_csv("../input/tabular-playground-series-apr-2022/test.csv")
sample_sub = pd.read_csv("../input/tabular-playground-series-apr-2022/sample_submission.csv")
train_labels = pd.read_csv("../input/tabular-playground-series-apr-2022/train_labels.csv")

In [None]:
display(train_df.head())
display(test_df.head())

In [None]:
#train_df.info()
#test_df.info()
#train_labels.info()

In [None]:
#display(train_labels.head())
display(train_labels.tail())

# EDA

In [None]:
train_df.isnull().any().sum().sum()

In [None]:
print("train rows:", train_df.shape[0])
print("test rows:", test_df.shape[0])
print("train has ", train_df.shape[0]/test_df.shape[0], "times more rows than test")

In [None]:
display(train_df[train_df["sequence"] == 0].head())
display(train_df[train_df["sequence"] == 0].tail())

### Sequence

In [None]:
print("Lowest sequence ID:", train_df["sequence"].unique().min())
print("Highest sequence ID:", train_df["sequence"].unique().max())
print("Expected number of sequences:", train_df["sequence"].unique().max() - train_df["sequence"].unique().min() + 1)
print("Number of sequences:", train_df["sequence"].nunique())

In [None]:
train_df["sequence"].value_counts().unique()

In [None]:
train_df["step"].value_counts().unique()

**Observations:**
- There are 25968 sequences in the train data.
- There are are no missing sequence IDs
- All sequences have 60 rows, corresponding to the 60 steps.

**Insights:**
- Each step represents one second. So the row for each step gives the sensor measurements at that second.

### Subject

In [None]:
print("Lowest subject ID:", train_df["subject"].unique().min())
print("Highest subject ID:", train_df["subject"].unique().max())
print("Expected number of subjects:", train_df["subject"].unique().max() - train_df["subject"].unique().min() + 1)
print("Number of subjects:", train_df["subject"].nunique())

**Observations:**
- There are 672 subjects in the train data.
- There are no missing subject IDs.

In [None]:
temp_df_train = train_df.groupby("subject")["sequence"].nunique().reset_index()
temp_df_train["dataset"] = "Train"
temp_df_test = test_df.groupby("subject")["sequence"].nunique().reset_index()
temp_df_test["dataset"] = "Test"
temp_df_combined = pd.concat([temp_df_train,temp_df_test], ignore_index=True)
f,ax = plt.subplots(figsize=(15,7))
sns.histplot(data = temp_df_combined, x="sequence", bins = 50, hue="dataset");
ax.set_title("Number of sequences distribution for train subjects");
ax.set_xlabel("Number of sequences for participant");

In [None]:
print("Lowest number of sequences per subject in train:", temp_df_train["sequence"].sort_values().head(15).tolist())
print("Lowest number of sequences per subject in test:", temp_df_test["sequence"].sort_values().head(15).tolist())

print("Highest number of sequences per subject in train:", temp_df_train["sequence"].sort_values(ascending=False).head(15).tolist())
print("Highest number of sequences per subject in test:", temp_df_test["sequence"].sort_values(ascending=False).head(15).tolist())

**Observations:**
- The number of sequences per subject varies.
- Theres normally around 15-75 sequences per subject.
- The train data has different subjects than the test data.
- The distributions in the number of train/test participants are relatively similar.
- Some participants in the train data have very few sequences recorded.

### Target labels

This is a binary classification problem where we must classify each sequence. Looking at the target labels:

In [None]:
target_counts = train_labels.state.value_counts(sort=False).reset_index().rename(columns={"state":"count","index":"state"})
target_counts["proportion"] = train_labels.state.value_counts(normalize=True, sort=False)
sns.barplot(data = target_counts, x="state", y="count");
display(target_counts)

In [None]:
from statsmodels.stats.proportion import proportions_ztest

#perform one proportion z-test
zval, pval = proportions_ztest(count=target_counts.loc[0,"count"], nobs=target_counts.loc[:,"count"].sum(), value=0.5)
print("p-value = {0:0.03f}".format(pval), "- p > 0.05, we reject the null hypothesis that the classes are not balanced.")

**Observations:**
- The classes are balanced.

In [None]:
train_df = train_df.merge(train_labels, on="sequence")

In [None]:
states = train_df.groupby(["subject","sequence"])["state"].first()
states = states.groupby("subject").value_counts().rename("state_number").reset_index()
states = states.pivot(index = "subject", columns="state", values="state_number")
states = states.fillna(value=0) # pivot can create missing values when value_count is empty
states["mean"] = ((0 * states[0]) + (1 * states[1])) / (states[0] + states[1])
display(states.tail())

f,ax = plt.subplots(figsize=(14,6))
ax = sns.histplot(data = states, x = "mean", bins = 60);
ax.set_xlabel("Mean sequence state for each subject");
ax.set_title("Distribution of the mean sequence state for each subject");

In [None]:
print("Number of subjects with all 0 activity states:", len(states[states["mean"] == 0]))
print("Number of subjects with all 1 activity states:", len(states[states["mean"] == 1]))
print("Mean of the mean sequence state for each participant:", states["mean"].mean())
print("Mean of the mean sequence state for each participant (exluding 0):", states.loc[states["mean"]!=0,"mean"].mean())

Observations:
- Although the sequence state classes are balanced, the average participant is more likely to have more sequences with activity state 0. This means some participants must have a lot of sequences with mainly activity states of 1.
- Some participants have all 0 activity states.

**Hypothesis:** The number of sequences for each participants has an impact on the mean sequence state for that participant.

In [None]:
temp_df = temp_df_train.merge(states, on="subject")
f,ax = plt.subplots(figsize = (20,7))
sns.scatterplot(data=temp_df, x="sequence", y="mean");
ax.set_xlabel("Number of sequences for participant");
ax.set_ylabel("Mean sequence state for participant");

**Observations:**

- There appears to be a non-linear correlation between the number of sequences for the participant and the mean sequence state.
- The interesting pattern of "mini curves" is just related to how the mean is calculated with the number of sequences for that participant (1/9, 1/10, 1/11, 1/12 etc.) - we can ignore this

**Hypothesis:** The order of the sequences for each participant contain some information (perhaps they haven't been randomised?)

In [None]:
temp_df = train_df.groupby(["subject","sequence"])["state"].mean().reset_index()
display(temp_df)
for sub in temp_df["subject"].unique():
    temp_df.loc[temp_df["subject"] == sub, "seq"] = np.arange(temp_df.loc[temp_df["subject"] == sub].shape[0])
    
seq_lengths = train_df.groupby(["subject"])["sequence"].nunique().rename("seq_length").reset_index()
temp_df = temp_df.merge(seq_lengths, on="subject")
temp_df["seq_perc"] = temp_df["seq"] / temp_df["seq_length"]
display(temp_df.head())

In [None]:
sns.histplot(data=temp_df, x="seq_perc", hue="state", bins = 30);

**Conclusions:**
It's unlikely that the sequence order contains any info, other than the effect of sequence length.

# Sensors

In [None]:
plt.subplots(figsize=(20,25))
#plt.tight_layout(h_pad=0, w_pad=0)
sensor_cols = ['sensor_00', 'sensor_01', 'sensor_02',
       'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06', 'sensor_07',
       'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12']
for i, column in enumerate(sensor_cols):
    plt.subplot(5,3,i+1)
    #sns.histplot(train_df[column]) - too slow
    plt.hist(train_df[column], bins = 100)
    plt.title(column)

Observations:
- Theres probably some outliers for all sensors

In [None]:
plt.subplots(figsize=(20,25))
for i, column in enumerate(sensor_cols):
    plt.subplot(5,3,i+1)
    plt.hist(train_df[column], bins = 200, range=(train_df[column].quantile(0.015), train_df[column].quantile(0.985) ))
    plt.hist(test_df[column], bins = 200, range=(train_df[column].quantile(0.015), train_df[column].quantile(0.985) ))
    plt.title(column)

In [None]:
plt.hist(train_df["sensor_12"],bins = 300, range=(train_df["sensor_12"].quantile(0.2), train_df["sensor_12"].quantile(0.8) ))
plt.hist(test_df["sensor_12"],bins = 300, range=(train_df["sensor_12"].quantile(0.2), train_df["sensor_12"].quantile(0.8) ))
plt.title("sensor_12");

**Observations:**

- Symetrical 0-centered distributions.
- The distributions for sensors 2, 8 and 12 are unusual.
- Sensor 2 takes mostly discrete values in multiples of 0.33 with some exceptions.
- Sensor 8 takes discrete values in multiples of 0.1.
- Sensor 12 has a very long tailed distribution.
- Train and test sensor distributions are very similar.

In [None]:
plt.subplots(figsize=(12,12))
sns.heatmap(train_df[["state"] + sensor_cols].corr(),annot=True, cmap="RdYlGn", fmt = '0.3f', vmin=-1, vmax=1, cbar=False);

I am concerned that the outliers might be affecting correlations, removing these outliers, by dropping all rows that contain outliers in any sensor measurements.

In [None]:
train_nooutliers_df = train_df.drop(columns=["sequence","subject","step"])
len_before = train_nooutliers_df.shape[0]
quantile_df = train_nooutliers_df.quantile([0.015,0.985])
train_nooutliers_df = train_nooutliers_df.apply(lambda x: x[(x >= quantile_df.loc[0.015,x.name]) & (x <= quantile_df.loc[0.985,x.name])], axis=0)
train_nooutliers_df = train_nooutliers_df.dropna()
len_after = train_nooutliers_df.shape[0]
print("Dropped:", (len_before-len_after)/(len_before)*100, "% of rows")
#train_nooutliers_df

Changes in state when we remove outliers:

In [None]:
temp_df = train_nooutliers_df["state"].value_counts(sort=False).reset_index().rename(columns={"state":"count","index":"state"})
temp_df["count"] = temp_df["count"]
temp_df["proportion"] = train_nooutliers_df["state"].value_counts(normalize=True, sort=False)
display(temp_df)
sns.barplot(data = temp_df, x="state", y="count");

Correlations between sensors without outliers:

In [None]:
plt.subplots(figsize=(12,12))
sns.heatmap(train_nooutliers_df.corr(),annot=True, cmap="RdYlGn", fmt = '0.3f', vmin=-1, vmax=1, cbar=False);

In [None]:
sns.pairplot(train_nooutliers_df.sample(1000).loc[:,~train_nooutliers_df.columns.isin(["state","sensor_12"])]);

**Observations:**

- Some sensors are highly correlated with each other.
- At first glance sensor 2 looks like the most important feature for predicting state.
- Removing outliers makes the correlations between sensors much greater.
- ~20% of rows contain atleast 1 outlier sensor measurement.

### Value counts

The number of unique values for every sensor:

In [None]:
train_val_counts = [train_df[column].nunique() for column in sensor_cols]
test_val_counts = [test_df[column].nunique() for column in sensor_cols]

pd.DataFrame(index= ["train","test"], columns=sensor_cols, data = [train_val_counts, test_val_counts])

The value counts of every sensor:

In [None]:
value_count_df  = train_df["sensor_00"].value_counts().head(10).rename("sensor_00 count").reset_index().rename(columns={"index":"sensor_00 index"})
for sensor in sensor_cols[1:]:
    temp_df = train_df[sensor].value_counts().head(10).rename(sensor + " count").reset_index().rename(columns={"index": sensor + " index"})
    value_count_df = pd.concat([value_count_df, temp_df], axis=1)
display(value_count_df)

**Observations:**
- sensors 04, 10, 12 have the most unique values.
- Sensors 0, 1, 3, 5, 6, 7, 9 , 11 have an order of magnitude below.
- Sensors 02, 08 have the fewest values. 
- Some values still occur a lot more fequently than we would expect.
- For most sensors one value occurs much more frequently than the others - this value is not necessarily exactly 0.

### Sensor distributions for each state

In [None]:
plt.subplots(figsize=(20,25))
for i, column in enumerate(sensor_cols):
    plt.subplot(5,3,i+1)
    sns.histplot(train_nooutliers_df.sample(5000), x=column, hue ="state", bins=200)
    #plt.hist(train_df[column], bins = 100)
    plt.title(column)

**Observations:**

- Sensor 02 will likely be very useful for classification

### Duplicates

In [None]:
train_pivot_df = train_df.pivot(columns="step", index=["sequence","subject","state"], values=sensor_cols)
test_pivot_df = test_df.pivot(columns="step", index=["sequence","subject"], values=sensor_cols)
#train_pivot_df
#test_pivot_df

Duplicated rows in the training data:

In [None]:
train_pivot_df[train_pivot_df.duplicated(keep=False)].sort_values(by=["subject",("sensor_12",59)])

Duplicated rows in the test data:

In [None]:
test_pivot_df[test_pivot_df.duplicated(keep=False)].sort_values(by=["subject",("sensor_12",59)])

**Observations:**


- There are only 22 duplicates in the training data (11 pairs).
- In the training data all duplicated rows come in pairs from the same subject - perhaps this could be from some error when the measurements were recorded.

## Time Series

In [None]:
def plot_range(df,sequences):
    plot_df = df.loc[df["sequence"].isin(sequences)]
    state_number = plot_df.loc[plot_df["sequence"]==sequences[0],"state"].values[0]
    color = ["red" if state_number==0 else "blue"]
    axs = plot_df.loc[plot_df["sequence"]==sequences[0],  sensor_cols+["step"]].plot( x="step", ylim=(-3,3), subplots=True, xlim=(0,59), figsize=(25,50), legend=False, color=color, alpha=0.2)
    for sequence in sequences[1:]:
        state_number = plot_df.loc[plot_df["sequence"]==sequence,"state"].values[0]
        color = ["red" if state_number==0 else "blue"]
        axs = plot_df.loc[plot_df["sequence"].isin([sequence]), sensor_cols+["step"]].plot(x = "step", subplots=True, ax = axs, legend=False, color=color, alpha=0.2)
    for n,ax in enumerate(axs):
        if n==0:
            ax.set_title(sensor_cols[n] + "   (red = state 0, blue = state 1)")
        else:
            ax.set_title(sensor_cols[n])

In [None]:
plot_range(train_df, sequences = range(1))

In [None]:
plot_range(train_df, sequences = [1,7])

In [None]:
plot_range(train_df, sequences = range(24,45))

In [None]:
plot_range(train_df, sequences = range(200,400))

In [None]:
plot_range(train_df, sequences = range(2000,3000))

# Test data

We apply some small checks on the test data

In [None]:
display(test_df.head())
display(test_df.tail())

In [None]:
test_df.isnull().any().sum().sum()

**Observations:**
- There are no null missing values - but we should still check for missing sequences

### Sequences

In [None]:
print("Lowest sequence ID:", test_df["sequence"].unique().min())
print("Highest sequence ID:", test_df["sequence"].unique().max())
print("Expected number of sequences:", test_df["sequence"].unique().max() - test_df["sequence"].unique().min() + 1)
print("Number of sequences:", test_df["sequence"].nunique())

In [None]:
print("Lowest subject ID:", train_df["sequence"].unique().max())

**Observations:**
- There are 12218 sequences in the test data
- There are no missing sequence IDs
- The test sequence IDs start immediately after the train sequence IDs end

In [None]:
print("Lowest subject ID:", test_df["subject"].unique().min())
print("Highest subject ID:", test_df["subject"].unique().max())
print("Expected number of subjects:", test_df["subject"].unique().max() - test_df["subject"].unique().min() + 1)
print("Number of subjects:", test_df["subject"].nunique())

In [None]:
print("Lowest subject ID:", train_df["subject"].unique().max())

**Observations:**
- There are 319 subjects in the test data.
- There are no IDs missing between 672 and 990.
- The test subject IDs start immediately after the train subjects IDs end.