# CMI-SleepState-Detection
## Child Mind Institute - Detect Sleep States
### Detect sleep onset and wake from wrist-worn accelerometer data
_______________________________________________________________________
# [Kaggle Competition](https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/overview)
________________________________________________________________________
# Author Details:
### Name: Najeeb Haider Zaidi
### Email: zaidi.nh@gmail.com
### Profiles: [Github](https://github.com/snajeebz)  [LinkedIn](https://www.linkedin.com/in/najeebz) [Kaggle](https://www.kaggle.com/najeebz)
### License: Private, Unlicensed, All the files in this repository under any branch are Prohibited to be used commercially or for personally, communally or privately unless permitted by author in writing.
### Copyrights 2023-2024 (c) are reserved only by the author: Najeeb Haider Zaidi
________________________________________________________________________
# Attributions:
## The Dataset has been provided by Child Mind Institute. in [Kaggle Competition](https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/overview) which the author is participating in and authorized to use the dataset solely for the competition purposes.
________________________________________________________________________

In [7]:
import pandas as pd
import numpy as np 
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt 
import datetime as dt
import string as str
from datetime import datetime as dts
#Disable warning
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_colwidth', None)


In [2]:
# Color printing
# inspired by https://www.kaggle.com/code/ravi20076/sleepstate-eda-baseline
from colorama import Fore, Style, init
from pprint import pprint
def PrintColor(text:str, color = Fore.BLUE, style = Style.BRIGHT):
    "Prints color outputs using colorama using a text F-string";
    print(style + color + text + Style.RESET_ALL);
    
# inspired by https://www.kaggle.com/code/rishabh15virgo/cmi-dss-first-impression-data-understanding-eda
def summarize_dataframe(df):
    summary_df = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary_df['missing#'] = df.isna().sum().values*100
    summary_df['missing%'] = (df.isna().sum().values*100)/len(df)
    summary_df['uniques'] = df.nunique().values
    summary_df['first_value'] = df.iloc[0].values
    summary_df['last_value'] = df.iloc[len(df)-1].values
    summary_df['count'] = df.count().values
    #sum['skew'] = df.skew().values
    desc = pd.DataFrame(df.describe().T)
    summary_df['min'] = desc['min']
    summary_df['max'] = desc['max']
    summary_df['mean'] = desc['mean']
    return summary_df

# Dataset Description provided by CMI
- #### The dataset comprises about 500 multi-day recordings of wrist-worn accelerometer data annotated with two event types: onset, the beginning of sleep, and wakeup, the end of sleep. 
- #### The task is to detect the occurrence of these two events in the accelerometer series.
- #### Each data series represents this continuous (multi-day/event) recording for a unique experimental subject.


  - A single sleep period must be at least 30 minutes in length
  - A single sleep period can be interrupted by bouts of activity that do not exceed 30 consecutive minutes
  - No sleep windows can be detected unless the watch is deemed to be worn for the duration (elaborated upon, below)
  - The longest sleep window during the night is the only one which is recorded
  - If no valid sleep window is identifiable, neither an onset nor a wakeup event is recorded for that night.
  - Sleep events do not need to straddle the day-line, and therefore there is no hard rule defining how many may occur within a given period. However, no more than one window should be assigned per night. For example, it is valid for an individual to have a sleep window from 01h00–06h00 and 19h00–23h30 in the same calendar day, though assigned to consecutive nights
  - There are roughly as many nights recorded for a series as there are 24-hour periods in that series.

- #### Though each series is a continuous recording, there may be periods in the series when the accelerometer device was removed. 
  - These period are determined as those where suspiciously little variation in the accelerometer signals occur over an extended period of time, which is unrealistic for typical human participants.
  - Events are not annotated for these periods, and you should attempt to refrain from making event predictions during these periods: an event prediction will be scored as false positive.

- #### Note that this is a Code Competition, in which the actual test set is hidden. In this public version, we give some sample data in the correct format to help you author your solutions. The full test set contains about 200 series.

# Files and Field Descriptions
- ## train_series.parquet 
  - Series to be used as training data. Each series is a continuous recording of accelerometer data for a single subject spanning many days.
  - series_id - Unique identifier for each accelerometer series.
  - step - An integer timestep for each observation within a series.
  - timestamp - A corresponding datetime with ISO 8601 format %Y-%m-%dT%H:%M:%S%z.
  - anglez - As calculated and described by the GGIR package, z-angle is a metric derived from individual accelerometer components that is commonly used in sleep detection, and refers to the angle of the arm relative to the vertical axis of the body
  - enmo - As calculated and described by the GGIR package, ENMO is the Euclidean Norm Minus One of all accelerometer signals, with negative values rounded to zero. While no standard measure of acceleration exists in this space, this is one of the several commonly computed features
- ## train_events.csv - Sleep logs for series in the training set recording onset and wake events.
  - series_id - Unique identifier for each series of accelerometer data in train_series.parquet.
  - night - An enumeration of potential onset / wakeup event pairs. At most one pair of events can occur for each night.
  - event - The type of event, whether onset or wakeup.
  - step and timestamp - The recorded time of occurence of the event in the accelerometer series.


In [3]:
#train_series=pd.read_parquet(path="d:/Documents and Settings/Kaggle Competitions/train_series.parquet", engine='auto')
train_events=pd.read_csv("Dataset/train_events.csv")
#summarize_dataframe(train_events)

In [31]:
def tscv(dt):
    d=dts.strptime(dt, "%Y-%m-%dT%H:%M:%S%z")
    #d = dts.fromisoformat(dt)
    ts=dts.timestamp(d)
    #print('=', end ="")
    return ts

In [25]:
train_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14508 entries, 0 to 14507
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  14508 non-null  object 
 1   night      14508 non-null  int64  
 2   event      14508 non-null  object 
 3   step       9585 non-null   float64
 4   timestamp  9585 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 566.8+ KB


In [32]:
#train_events['Year', "Month", 'Day', 'Hour','Min','Sec','TZ']=
train_events['ts']=train_events['timestamp'].dropna().apply(lambda x: tscv(x))


In [35]:
train_events.sort_values(by=['series_id','ts','night'])


Unnamed: 0,series_id,night,event,step,timestamp,ts
0,038441c925bb,1,onset,4992.0,2018-08-14T22:26:00-0400,1.534300e+09
1,038441c925bb,1,wakeup,10932.0,2018-08-15T06:41:00-0400,1.534330e+09
2,038441c925bb,2,onset,20244.0,2018-08-15T19:37:00-0400,1.534376e+09
3,038441c925bb,2,wakeup,27492.0,2018-08-16T05:41:00-0400,1.534412e+09
4,038441c925bb,3,onset,39996.0,2018-08-16T23:03:00-0400,1.534475e+09
...,...,...,...,...,...,...
14505,fe90110788d2,34,wakeup,581604.0,2017-09-07T09:17:00-0400,1.504790e+09
14438,fe90110788d2,1,onset,,,
14439,fe90110788d2,1,wakeup,,,
14506,fe90110788d2,35,onset,,,


In [40]:
train_events['ts'][(train_events['series_id']=='038441c925bb') & (train_events['night']==1)]

0        1.534300e+09
1        1.534330e+09
2        1.534376e+09
3        1.534412e+09
4        1.534475e+09
             ...     
14319             NaN
14366    1.553651e+09
14367    1.553680e+09
14438             NaN
14439             NaN
Name: ts, Length: 588, dtype: float64

In [9]:
print('Info: \n',train_events.info())
print('\n Describe: \n',train_events.describe())
print('\n Head: \n',train_events.head(500))



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14508 entries, 0 to 14507
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  14508 non-null  object 
 1   night      14508 non-null  int64  
 2   event      14508 non-null  object 
 3   step       9585 non-null   float64
 4   timestamp  9585 non-null   object 
 5   year       9585 non-null   object 
 6   month      9585 non-null   object 
 7   day        9585 non-null   object 
 8   hour       9585 non-null   object 
 9   min        9585 non-null   object 
 10  sec        9585 non-null   object 
 11  tz         9585 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 1.3+ MB
Info: 
 None

 Describe: 
               night           step
count  14508.000000    9585.000000
mean      15.120072  214352.123944
std       10.286758  141268.408192
min        1.000000     936.000000
25%        7.000000   95436.000000
50%       14.000000  200604.000

## Analyzing Onset and Wakeup Categories

In [15]:
print('\n Wakeup Entries: \n',train_events[(train_events['event']=='wakeup')].count())
print('\n Onset Entries: \n',train_events[(train_events['event']=='onset')].count())




 Wakeup Entries: 
 series_id    7254
night        7254
event        7254
step         4794
timestamp    4794
dtype: int64

 Onset Entries: 
 series_id    7254
night        7254
event        7254
step         4791
timestamp    4791
dtype: int64


In [36]:
train_events['series_id'].value_counts(ascending=True)

series_id
349c5562ee2c      4
10469f6765bf      8
13b4d6a01d27     10
60e51cad2ffb     10
3a9a9dc2cbd9     12
               ... 
cfeb11428dd7     94
f56824b503a0    100
fb223ed2278c    106
f564985ab692    124
78569a801a38    168
Name: count, Length: 277, dtype: int64

In [14]:
Onsettrain_events.loc[train_events['event'] == 'wakeup', 'step'].dropna().count())
print('\n Wakeup count: ',train_events.loc[train_events['event'] == 'wakeup', 'step'].count())
print('\n Onset Dropna count: ',train_events.loc[train_events['event'] == 'onset', 'step'].dropna().count())
print('\n Onset count: ',train_events.loc[train_events['event'] == 'onset', 'step'].count())


 Wakeup Dropna count:  4794

 Wakeup count:  4794

 Onset Dropna count:  4791

 Onset count:  4791


In [23]:
sample_wakeup = train_events.loc[train_events['event'] == 'wakeup', 'step'].dropna()
sample_onset = train_events.loc[train_events['event'] == 'wakeup', 'step'].dropna()
sample_wakeup

1         10932.0
3         27492.0
5         44400.0
7         62856.0
11        97860.0
           ...   
14497    511284.0
14499    529104.0
14501    547152.0
14503    560604.0
14505    581604.0
Name: step, Length: 4794, dtype: float64

In [46]:
print(train_events['step'].isna().groupby(train_events['series_id']).sum())
print(train_events['series_id'].value_counts())

series_id
038441c925bb     8
03d92c9f6f8a    58
0402a003dae9    12
04f547b8017d    32
05e1944c3818     2
                ..
fa149c3c4bde    16
fb223ed2278c    96
fbf33b1a2c10     8
fcca183903b7     2
fe90110788d2     4
Name: step, Length: 277, dtype: int64
series_id
78569a801a38    168
f564985ab692    124
fb223ed2278c    106
f56824b503a0    100
cfeb11428dd7     94
               ... 
3a9a9dc2cbd9     12
60e51cad2ffb     10
13b4d6a01d27     10
10469f6765bf      8
349c5562ee2c      4
Name: count, Length: 277, dtype: int64


In [5]:
summarize_dataframe(train_series)


Unnamed: 0,dtypes,missing#,missing%,uniques,first_value,last_value,count,min,max,mean
series_id,object,0,0.0,3,038441c925bb,0402a003dae9,450,,,
step,uint32,0,0.0,150,0,149,450,0.0,149.0,74.5
timestamp,object,0,0.0,450,2018-08-14T15:30:00-0400,2018-12-18T12:57:25-0500,450,,,
anglez,float32,0,0.0,305,2.6367,7.0299,450,-88.367996,68.460503,-56.177723
enmo,float32,0,0.0,183,0.0217,0.0081,450,0.0,0.9802,0.030276


In [25]:
train_series['steps'].isna().groupby(train)

array(['038441c925bb', '03d92c9f6f8a', '0402a003dae9'], dtype=object)

# Observation:
- As evident from the summary and the nature of the data, it should have 

In [12]:
train_events.describe()

Unnamed: 0,night,step
count,14508.0,9585.0
mean,15.120072,214352.123944
std,10.286758,141268.408192
min,1.0,936.0
25%,7.0,95436.0
50%,14.0,200604.0
75%,21.0,317520.0
max,84.0,739392.0


In [13]:
train_series.describe()

Unnamed: 0,step,anglez,enmo
count,450.0,450.0,450.0
mean,74.5,-56.177723,0.030276
std,43.3485,39.331936,0.06795
min,0.0,-88.367996,0.0
25%,37.0,-88.216599,0.0
50%,74.5,-79.989449,0.0133
75%,112.0,-29.100624,0.03525
max,149.0,68.460503,0.9802


In [14]:
train_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   series_id  450 non-null    object 
 1   step       450 non-null    uint32 
 2   timestamp  450 non-null    object 
 3   anglez     450 non-null    float32
 4   enmo       450 non-null    float32
dtypes: float32(2), object(2), uint32(1)
memory usage: 12.4+ KB


## Plan:
- There are two categories of data, onset and sleep. 
- We should train two models Sleep Positive/Negative and Onset Positive/Negative with probability and combine the results.
- In order to train two models, we need to separate training and create two CSV in this file.
- In the 2nd file we will create two models and train these with two different sets of the data.
- Based on the results we will decide the further plan of action.

In [43]:
train_events['night'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84],
      dtype=int64)