## Introduction
Thanks Kaggle bot, I'll take the tour from here. We have five sets of data prefixed **gamer*N*-** (from five computer gamers) each containing:

- A diary of annotations in **annotations.csv** including:
  - ['sleep-2-peak' reaction time](https://sleep-2-peak.com/) each hour
  - caffeine and food ingress and egress
  - self-assessment [Stanford sleepiness scale (1-7)](https://web.stanford.edu/~dement/sss.html) each hour
- A red light transmission PPG time-series sampled at approx 100Hz
  - this spans two files, **ppg-2000-01-01.csv** and **ppg-2000-01-02.csv**, each about 12hrs long

## Exploratory Analysis
To begin this exploratory analysis, first import libraries and define functions for plotting the data using `matplotlib`.  [Here is a superb catalogue of plots and their statistical usefulness.](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

There are 5 x (1 + 2) = 15 csv files in the dataset:


In [None]:
os.listdir('../input')

### (

The next hidden code cells are Kaggle's default functions for plotting data. Click on the "Code" button in the published kernel to reveal the hidden code. We're not using these for this dataset.

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


### )

## Pick a gamer 'gamer[1-5]'

In [None]:
gamerID = 'gamer5'

### Let's load the annotations file: ../input/gamer*N*-annotations.csv

In [None]:
dateCols = ['Datetime']
anots = pd.read_csv('../input/' + gamerID + '-annotations.csv', parse_dates=dateCols)
anots.dataframeName = gamerID + '-annotations'
anots.shape

Let's take a quick look at what the data looks like:

In [None]:
anots.head(30)

Hopefully there will be some correlation between the Stanford Sleepiness Self-Assessment and the Reaction Time Test results:

In [None]:
#dfxy = anots.pivot(index='Datetime', columns='Event') # Unpack key,value pairs as columns with time x-axis
#dfxy.head()

Compare self-assessment with measured reaction times:

In [None]:
sss = anots[anots.Event == "Stanford Sleepiness Self-Assessment (1-7)"].drop('Event', axis=1).copy()
sss['SelfAssess'] = sss['Value'].map(lambda x: float(x))

rt = anots[anots.Event == "Sleep-2-Peak Reaction Time (ms)"].drop('Event', axis=1).copy()
rt['ReactTime'] = rt['Value'].map(lambda x: float(x))

diary = anots[anots.Event == "Diary Entry (text)"].drop('Event', axis=1).copy()

fatigueplot = plt.figure(figsize=(7,4), dpi= 150)
axsa = fatigueplot.add_subplot(1,1,1)
axsa.set_title('Sleepiness of ' + gamerID + ' through episode')
axsa.set_xlabel('Time of day')
plt.xticks(rotation=90)
axsa.set_xlim(pd.Timestamp('2000-01-01 11:00'), pd.Timestamp('2000-01-02 11:00:00'))
axsa.xaxis.set_major_locator(mpl.dates.MinuteLocator())
axsa.xaxis.set_major_formatter(mpl.dates.DateFormatter('%d %H:%M'))
#axsa.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%H:%M'))
axsa.set_ylabel('sleepiness self assessment (1-7)', color='b')
axsa.set_ylim(0.9,7.1)
axrt = axsa.twinx()
axrt.set_ylabel('reaction time (ms)', color='r')

axsa.plot('Datetime', 'SelfAssess', 'b-', data=sss)
axrt.plot('Datetime', 'ReactTime', 'r-', data=rt)

for item in diary.iterrows():
    s = item[1]
    axsa.axvline(s.Datetime, linewidth=0.2, color='g')
    axsa.text(s.Datetime, -1.0, s.Value, rotation=90, fontsize='xx-small',
              color='g', alpha=0.5, horizontalalignment='right')

#fatigueplot.autofmt_xdate()
plt.show()

In [None]:
fatigueplot.savefig('./'+ gamerID +'-annotations.png')

### Let's check the PPG time-series files: ../input/gamer*N*-ppg-2000-01-01.csv

In [None]:
nRowsRead = 2000 # specify 'None' if want to read whole file
# gamer1-ppg-2000-01-01.csv has 2,996,500 rows (about 12hrs) in reality
# but we are only previewing the first 2000 rows
ppg = pd.read_csv('../input/' + gamerID + '-ppg-2000-01-01.csv', delimiter=',', nrows = nRowsRead)
ppg.dataframeName = gamerID + '-ppg-2000-01-01.csv'
ppg.shape

Let's take a quick look at what the data looks like:

In [None]:
ppg.head(10)

Note that the microsecond timestamps jump in steps of a few tens of microseconds even though the samples are nominally about 10,000 microseconds apart. In fact timestamps are even more messed up at the start of the file, so let's first just plot them as if they are just an equi-spaced array of samples.

This should be a nice simple time-series showing each pulse:

In [None]:
ts = plt.figure(figsize=(7,4), dpi= 150)
ax = ts.add_subplot(1,1,1)
ax.set_title('PPG time series for ' + gamerID)
ax.set_xlabel('Sample')
ax.set_xticklabels([])
ax.set_ylabel('Red transmission', color='r')
ax.plot('Time', 'Red_Signal', 'r-', data=ppg)
ts.show()

This plots the samples side-by side, so the step-changes of the timestamps are not significant. Let's look at how the timestamp steps affect the detail:

In [None]:
ppg['timestamp'] = pd.to_datetime(ppg['Time'])

ppg.head()

In [None]:
def compare_t_axes(data_range):
    """
    plots time series as a resampled series above the time-stamped version

    Parameters
    ----------
    data_range : pandas.dataframe
        Pandas dataframe with columns 'Red_signal' for y-axis and 'timestamp' for x-axis
    """
    ts, (ax, tsax) = plt.subplots(2, figsize=(7,4), dpi= 150)
    ts.suptitle('PPG time series')
    ax.set_title('at constant sample rate:')
    ax.set_ylabel('Red tx', color='r')
    ax.xaxis.set_visible(False)
    ax.plot('Time', 'Red_Signal', 'r-', data=data_range)
    
    tsax.set_title('and using timestamps:')
    tsax.set_xlabel('Timestamp Time')
    plt.xticks(rotation = 90)
    tsax.set_ylabel('Red tx', color='r')
    tsax.plot('timestamp', 'Red_Signal', 'r-', data=data_range)
    ts.show()

In [None]:
compare_t_axes(ppg)

So for the first 800 samples at the beginning of data capture Raspberry Pi Linux serial port buffering makes funny things happen with the timestamps in the Time column. It does look like it settles down after a couple of seconds though, so if we ignore the first 700 samples it looks better:

In [None]:
compare_t_axes(ppg[1000:])

Let's zoom in to look at the quantisation noise that the jumping timestamp causes close-up:

In [None]:
compare_t_axes(ppg[1000:1100])

For signal processing purposes we should probably treat it as a series of equi-spaced samples and calculate the exact sample rate from a lengthy sample to determine accurate R-R intervals. All the statistical methods work only with time-series resampled at a constant samping rate anyway.

### Finally check 3rd file: ../input/gamer*N*-ppg-2000-01-02.csv

In [None]:
nRowsRead = 2000 # specify 'None' if want to read whole file
# gamer1-ppg-2000-01-02.csv has 3,177,175 rows (approx 12hrs) in reality,
# but we are only loading/previewing the first 2000 rows
ppg2 = pd.read_csv('../input/' + gamerID + '-ppg-2000-01-02.csv', delimiter=',', nrows = nRowsRead)
ppg2.dataframeName = gamerID + '-ppg-2000-01-02.csv'
ppg2['timestamp'] = pd.to_datetime(ppg2['Time'])
ppg2.head(20)

This is just the continuation of the first file after midnight. There's also the same first 1000 samples garbage timestamps issue as there was at the beginning of data capture in the previous day's file:

In [None]:
compare_t_axes(ppg2)

 And 1ms jumps in the timestamp are still best ignored on an already noisy signal:

In [None]:
compare_t_axes(ppg2[1000:1200])

It's not a bad signal and getting accurate R-R intervals good enough for Heart Rate Variability analysis should be achievable.