<a href="https://www.kaggle.com/code/yaaangzhou/zzz-clean-dataset-for-modeling?scriptVersionId=142572555" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Created by Yang Zhou**

**[ZZz]Clean dataset for modeling**

**10 Sep 2023**

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">[ZZz]Clean dataset for modeling</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Detect sleep onset and wake from wrist-worn accelerometer data</center></p>

***

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">Some Insights</center>

1. This dataset is inspired by [@Carl McBride Ellis](https://www.kaggle.com/carlmcbrideellis)'s [notebook](https://www.kaggle.com/code/carlmcbrideellis/zzzs-make-clean-starter-dataset-target). He selected the cleanest 37 series out of 277 series.

2. However, [@Greg Kiar](https://www.kaggle.com/gkiar07) proposed: **detecting when events cannot be found is an important part of the challenge, so be careful about getting too comfortable working on this clean sample.**

3. Based on this idea, I chose to add some more data to the 37 clean series.

In [1]:
import pandas as pd
import random

In [2]:
train_events = pd.read_csv("/kaggle/input/child-mind-institute-detect-sleep-states/train_events.csv")

In [3]:
series_with_nan = train_events.groupby('series_id')['step'].apply(lambda x: x.isnull().any())
no_nan_series = series_with_nan[~series_with_nan].index.tolist()

In [4]:
series_with_nan.value_counts()

step
True     240
False     37
Name: count, dtype: int64

I will randomly keep some data among the 240 data with missing values.

In [5]:
random_indexes = random.sample(series_with_nan.index.tolist(), 20)
final_indexes = no_nan_series+random_indexes
print(len(final_indexes))

57


In [6]:
# code source: https://www.kaggle.com/code/carlmcbrideellis/zzzs-make-clean-starter-dataset-target

def get_train_series(series):
    train_series = pd.read_parquet("/kaggle/input/child-mind-institute-detect-sleep-states/train_series.parquet", filters=[('series_id','=',series)])
    train_events = pd.read_csv("/kaggle/input/child-mind-institute-detect-sleep-states/train_events.csv").query('series_id == @series')
    
    train_events = train_events.dropna()
    train_events["step"]  = train_events["step"].astype("int")
    train_events["awake"] = train_events["event"].replace({"onset":1,"wakeup":0})

    train = pd.merge(train_series, train_events[['step','awake']], on='step', how='left')
    train["awake"] = train["awake"].bfill(axis ='rows')

    train['awake'] = train['awake'].fillna(1) # awake
    train["awake"] = train["awake"].astype("int")
    return(train)

In [7]:
final_train_data = []

for series_id in final_indexes:
    train = get_train_series(series_id)
    final_train_data.append(train)

In [8]:
final_data = pd.concat(final_train_data).reset_index(drop=True)
final_data.head()

Unnamed: 0,series_id,step,timestamp,anglez,enmo,awake
0,08db4255286f,0,2018-11-05T10:00:00-0400,-30.845301,0.0447,1
1,08db4255286f,1,2018-11-05T10:00:05-0400,-34.181801,0.0443,1
2,08db4255286f,2,2018-11-05T10:00:10-0400,-33.877102,0.0483,1
3,08db4255286f,3,2018-11-05T10:00:15-0400,-34.282101,0.068,1
4,08db4255286f,4,2018-11-05T10:00:20-0400,-34.385799,0.0768,1


In [9]:
final_data.to_parquet('57series_data.parquet')