## Data importation

In [1]:
import dask.dataframe as dd
import pandas as pd

In [2]:
# Read targets
train_events = pd.read_csv('../data/train_events.csv')

In [3]:
train_events.shape

(14508, 5)

In [4]:
train_events.head(10)

Unnamed: 0,series_id,night,event,step,timestamp
0,038441c925bb,1,onset,4992.0,2018-08-14T22:26:00-0400
1,038441c925bb,1,wakeup,10932.0,2018-08-15T06:41:00-0400
2,038441c925bb,2,onset,20244.0,2018-08-15T19:37:00-0400
3,038441c925bb,2,wakeup,27492.0,2018-08-16T05:41:00-0400
4,038441c925bb,3,onset,39996.0,2018-08-16T23:03:00-0400
5,038441c925bb,3,wakeup,44400.0,2018-08-17T05:10:00-0400
6,038441c925bb,4,onset,57240.0,2018-08-17T23:00:00-0400
7,038441c925bb,4,wakeup,62856.0,2018-08-18T06:48:00-0400
8,038441c925bb,5,onset,,
9,038441c925bb,5,wakeup,,


**Events data description:**

Sleep logs for series in the training set recording onset and wake events

- series_id - Unique identifier for each series of accelerometer data in train_series.parquet.
- night - An enumeration of potential onset / wakeup event pairs. At most one pair of events can occur for each night.
- event - The type of event, whether onset or wakeup.
- step and timestamp - The recorded time of occurence of the event in the accelerometer series.

In [5]:
# Read test data
test_data = pd.read_parquet('../data/test_series.parquet')

In [6]:
test_data.shape

(450, 5)

In [7]:
test_data.head(10)

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.6367,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.6368,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.6368,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.6368,0.0215
5,038441c925bb,5,2018-08-14T15:30:25-0400,2.6367,0.0217
6,038441c925bb,6,2018-08-14T15:30:30-0400,2.6367,0.0217
7,038441c925bb,7,2018-08-14T15:30:35-0400,2.6367,0.0218
8,038441c925bb,8,2018-08-14T15:30:40-0400,2.798,0.0223
9,038441c925bb,9,2018-08-14T15:30:45-0400,3.0847,0.0217


As the training data is too large to read directly with Pandas, we will use Dask to import it and to shape it into the desired format.

In [8]:
# Read train data
train_data = dd.read_parquet('../data/train_series.parquet', engine='pyarrow')

In [9]:
train_data

Unnamed: 0_level_0,series_id,step,timestamp,anglez,enmo
npartitions=28,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,string,uint32,string,float32,float32
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [10]:
train_data.head(10)

Unnamed: 0,series_id,step,timestamp,anglez,enmo
0,038441c925bb,0,2018-08-14T15:30:00-0400,2.6367,0.0217
1,038441c925bb,1,2018-08-14T15:30:05-0400,2.6368,0.0215
2,038441c925bb,2,2018-08-14T15:30:10-0400,2.637,0.0216
3,038441c925bb,3,2018-08-14T15:30:15-0400,2.6368,0.0213
4,038441c925bb,4,2018-08-14T15:30:20-0400,2.6368,0.0215
5,038441c925bb,5,2018-08-14T15:30:25-0400,2.6367,0.0217
6,038441c925bb,6,2018-08-14T15:30:30-0400,2.6367,0.0217
7,038441c925bb,7,2018-08-14T15:30:35-0400,2.6367,0.0218
8,038441c925bb,8,2018-08-14T15:30:40-0400,2.798,0.0223
9,038441c925bb,9,2018-08-14T15:30:45-0400,3.0847,0.0217


**Data description:**

Each series is a continuous recording of accelerometer data for a single subject spanning many days.

- series_id - Unique identifier for each accelerometer series.
- step - An integer timestep for each observation within a series.
- timestamp - A corresponding datetime with ISO 8601 format %Y-%m-%dT%H:%M:%S%z.
- anglez - As calculated and described by the GGIR package, z-angle is a metric derived from individual accelerometer components that is commonly used in sleep detection, and refers to the angle of the arm relative to the vertical axis of the body
- enmo - As calculated and described by the GGIR package, ENMO is the Euclidean Norm Minus One of all accelerometer signals, with negative values rounded to zero. While no standard measure of acceleration exists in this space, this is one of the several commonly computed features

## Shape data 

Let's start with the 100k first rows in the training dataset, so that we can then apply it to the whole dataset

In [11]:
import numpy as np

In [12]:
train_data_head = train_data.head(100000, compute=True)

In [13]:
train_data_head.shape

(100000, 5)

In [14]:
# Convert the timestamp columns to datetime format
train_data_head['timestamp'] = pd.to_datetime(train_data_head['timestamp'], format='%Y-%m-%dT%H:%M:%S%z', utc=True)
print(f"train_data_head:\n{train_data_head.head()}\n")
test_data['timestamp'] = pd.to_datetime(test_data['timestamp'], format='%Y-%m-%dT%H:%M:%S%z', utc=True)
print(f"test_data:\n{test_data.head()}\n")
train_events.loc[:, 'timestamp'] = pd.to_datetime(train_events['timestamp'], format='%Y-%m-%dT%H:%M:%S%z', utc=True)
print(f"train_events:\n{train_events.head()}\n")

train_data_head:
      series_id  step                 timestamp  anglez    enmo
0  038441c925bb     0 2018-08-14 19:30:00+00:00  2.6367  0.0217
1  038441c925bb     1 2018-08-14 19:30:05+00:00  2.6368  0.0215
2  038441c925bb     2 2018-08-14 19:30:10+00:00  2.6370  0.0216
3  038441c925bb     3 2018-08-14 19:30:15+00:00  2.6368  0.0213
4  038441c925bb     4 2018-08-14 19:30:20+00:00  2.6368  0.0215

test_data:
      series_id  step                 timestamp  anglez    enmo
0  038441c925bb     0 2018-08-14 19:30:00+00:00  2.6367  0.0217
1  038441c925bb     1 2018-08-14 19:30:05+00:00  2.6368  0.0215
2  038441c925bb     2 2018-08-14 19:30:10+00:00  2.6370  0.0216
3  038441c925bb     3 2018-08-14 19:30:15+00:00  2.6368  0.0213
4  038441c925bb     4 2018-08-14 19:30:20+00:00  2.6368  0.0215

train_events:
      series_id  night   event     step                  timestamp
0  038441c925bb      1   onset   4992.0  2018-08-15 02:26:00+00:00
1  038441c925bb      1  wakeup  10932.0  2018-08-15 10

Let's calculate the target for each row in our training dataframe:

In [15]:
# Create a new column for target value
train_data_head['awake'] = pd.NA
# Iterate through the train_events dataframe and assign awake status
for i in range(0, len(train_events) - 1):   
    current_event = train_events.iloc[i]
    next_event = train_events.iloc[i + 1]
    series_id = current_event['series_id']
    
    # Check that we are on the same test subject
    if series_id == next_event['series_id']:
    
        # Set awake = 0 (sleep time) between onset and wakeup
        if current_event["event"] == 'onset' and next_event['event'] == 'wakeup':
            current_time = current_event['timestamp']
            next_time = next_event['timestamp']
            # Check that we do have data for this time
            if not pd.isna(current_time) and not pd.isna(next_time):
                train_data_head.loc[
                (train_data_head['series_id'] == series_id) &
                ((train_data_head['timestamp'] > current_time) &
                (train_data_head['timestamp'] < next_time)),
                'awake'
            ] = 0
                
        # Set awake = 1 (awake time) between wakeup and next onset
        elif current_event["event"] == 'wakeup' and next_event['event'] == 'onset':
            current_time = current_event['timestamp']
            next_time = next_event['timestamp']
            # Check that we do have data for this time
            if not pd.isna(current_time) and not pd.isna(next_time):
                train_data_head.loc[
                (train_data_head['series_id'] == series_id) &
                ((train_data_head['timestamp'] > current_time) &
                (train_data_head['timestamp'] < next_time)),
                'awake'
            ] = 1
            

In [16]:
# Number of rows during wake
train_data_head.loc[train_data_head['awake'] == 1,].shape[0]

36792

In [17]:
# Number of rows during sleep
train_data_head.loc[train_data_head['awake'] == 0,].shape[0]

29767

In [18]:
# Drop NaNs and useless columns
train_data_head = train_data_head.dropna().drop(['series_id', 'step'], axis=1)
print(train_data_head.head(), "\n")
print(train_data_head.shape)

                     timestamp     anglez    enmo awake
4993 2018-08-15 02:26:05+00:00 -78.664902  0.0099     0
4994 2018-08-15 02:26:10+00:00 -78.465897  0.0101     0
4995 2018-08-15 02:26:15+00:00 -78.454597  0.0098     0
4996 2018-08-15 02:26:20+00:00 -78.537804  0.0098     0
4997 2018-08-15 02:26:25+00:00 -78.446999  0.0099     0 

(66559, 4)


We went from a training dataframe of shape (100000, 5) and an event dateframe of shape (14508, 5), to a single (66559, 4) dataframe. The column 'awake' is our target value for the deep learning model.

Now that we have the method to shape the data into the desired form, we can apply it to the training data in batches, and see if we can then use the whole pandas dataframe without taking as much memory.