report/results.tex


\section{Results} \label{results}

\subsection{A new efficient \py{} package}
Several algorithms to extract features from univariate time series have already
been implemented in the \py{} package \pyeeg{}\cite{bao_pyeeg:_2011}.
Unfortunately, some of them were critically slow, and could therefore not realistically be used in the present study.
Preliminary investigation of \pyeeg{} source code revealed that runtimes may be improved mainly by vectorising expressions and pre-allocating of temporary arrays.
Therefore, systematic reimplementation of all algorithms in \pyeeg{} was undertaken.
Very significant improvements in performance were achieved for almost all functions(table~\ref{tab:benchmark}).
Critically, sample and approximate entropies\cite{richman_physiological_2000} became usable in
reasonable time.
For instance, 40h of CPU time were originally required to compute sample entropy on a all 5 second epochs in a 24h recording.
The improved implementation cut this down to 55min.
\input{./tables/benchmark}

Importantly, several mathematical inconsistencies between the original code and the mathematical definitions were also detected.
This affected five of the eight reimplemented functions(table~\ref{tab:benchmark}, rightmost column).
Details of the corrections performed are provided, as notes, in the documentation of the new package (see appendix, section 2.3).
Numerical results for the three functions were consistent throughout optimisation.

In order to facilitate feature extraction, several data structures and routines, which were not 
available in \pyeeg{}, were also implemented in a new \py{} package named \pr{}.
Briefly, extensions of \texttt{numpy} arrays\cite{walt_numpy_2011} providing
meta-data, sampling frequency, and other attributes were used to represent time series.
User friendly indexing with strings representing time was also developed.
In addition, a container for time series of discrete annotation levels, each linked to a confidence level, was built.
Importantly, a container for multiple time series, which supports heterogeneous (between time series) sampling frequencies was implemented.
The new package also provides visualization, input/output, and wrappers for resampling and discrete wavelet decomposition.
Finally, unittests were implemented to ensure persistence of mathematical and programmatic validity throughout developmental stage.
A complete documentation of \pr{} is provided as an appendix to the report herein.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Twenty variables can generate accurate predictions}
Including temporal information (see next section) will result in multiplication of the number of variables, rendering computation difficult, and prediction potentially less accurate.
Therefore, recursive feature elimination\cite{menze_comparison_2009} based on
Gini variable importance was undertaken.
Starting with all 164 variables, random forests were trained, and the number of features was reduced by a factor $1.5$  by eliminating the least important variables.
For each iteration, the stratified cross-validation error (see material and methods) was computed (fig.~\ref{fig:variable_elimination}). Five replicates were performed.

\input{./figures/variable_elimination}
The predictive accuracy globally increases with the number of variables.
Interestingly, using only the two most important variables already results in less that 20\% average error.
Additionally, this increase in predictive accuracy is very moderate for ($p>9$). This indicates that dimensionality can be considerably reduced without largely impacting accuracy.
For further investigation, $p=21$ was considered to be a good compromise between error and computational load. Their relative importance is listed in table~\ref{tab:importances}.
\input{./tables/importances}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Including temporal information improves prediction}
%~ Manual scorer usually use contextual information in order to make a decision concerning a given state.
%~ For instance, implicit assumptions are made on the temporal consistency of the states.
%~ Therefore, it seems important to account for contextual information when building a predictor.
%~ In this study, two different approaches were pursued.
Two distinct strategies were used to include temporal information in the feature space:
either the local means of features over different ranges were added to the features
of each epochs (eq.~\ref{eq:window}),
or the features of neighbouring time points were included (eq.~\ref{eq:tau}).

A comparison of both approaches is proposed in figure \ref{fig:temporal_integration}.
Significant improvement was achieved by both methods by including even little temporal information ($\tau = 1$, $n=\{1,3\}$, $p-value < 5.10^{-3}$).
Although both methods improve performance, it appears that the inter-recording variablility is systematically lower using convolution method.
For instance, the standard variation in error between all recordings is always lower than 0.03 for convolutions of window size greater than two,
whilst it is systematically above 0.031 for all the values of $\tau$.
The most significant difference with the original set of variables was achieved using $n = \{1,3,7,15,31\}$ ($p-value < 10^{-6}$).
Therefore, this new set of 105 variables was used to train the final predictor.
Combining both convolution and lag approaches did not improve prediction any further (data not shown).

\input{./figures/temporal_integration}


\subsection{Structural differences with ground truth}

\input{./figures/struct_assess}

After selecting important variables and defining new ones, general and structural performance of the classifier was assessed.
For this purpose, random forests were trained on samples accounting for the unbalanced prevalence.
Predictions were generated using the same stratified sampling approach.
That is, predictions on each time series were made by a classifier for which
the training set did not contain any point from this time series.


\input{./tables/confus}

The global confusion matrix is presented in table \ref{tab:confus}. The overall accuracy compared to manual scoring was 0.92.
For \gls{nrem} and wake, both specificity and positive predictive value were above 0.92. However,
for \gls{rem} epochs, the specificity is only 0.74, and the false detection rate is 15\%.

In order to investigate the structural differences between ground truth and the predicted states,
three metrics describing physiological properties of sleep were computed (fig.~\ref{fig:struct_assess}).

Prevalence of states (fig.~\ref{fig:struct_assess}A) is a widely used metric to
describe sleep patterns.
No significant difference was found between the ground truth and predicted prevalences ($p-value > 0.13$ for all, z test on the interaction terms of $\beta$ regression).

Sleep fragmentation is often assessed by a combination of two variables: number of episodes of a given state 
and average duration of all episodes per state\cite{pang_unexpected_2009}. The
effect of prediction on both variables were therefore studied (fig.~\ref{fig:struct_assess}B and C,
respectively).

Large differences were observed between the number of \gls{rem} sleep and awake episodes between the ground truth and the predicted data
($p-value < 10^{-15}$ for both, t test on the interaction terms of Poisson GLMM).

In addition, significant structural differences were found between methods for the average duration of awake ($p-value < 10^{-3}$) episodes 
(t test on the interaction terms of linear mixed model). 

\subsection{Attribution of confidence score to predictions}

Since classification may be inaccurate, it would be interesting to associate `confidence' score to each prediction.
An entropy-based confidence metric $c$ (eq. \ref{eq:entropy}) was defined for this purpose.
In order to validate it, the average cross-validation error was computed for different range of confidence (fig\ref{fig:error}A).

\input{./figures/error}

As expected, the probability of misclassification decreases monotonically with $c$.
In addition, error rates tend to zero when the confidence value is one, and, for confidences close to zero, the predictor is very inaccurate.
These characteristics indicate that $c$ can be used as a supporting value for predictions.
One application of such a confidence level could be to provide users with an overall quality assessment.
In addition, it  makes it possible to display confidence while visually inspecting a recording (fig\ref{fig:error}C) in order to facilitate resolution of ambiguities.