## Morpheo project
# Sleep scoring assignment

*Please execute the cell bellow to initialize the notebook environment*

In [1]:
%autosave 0
%matplotlib notebook

from __future__ import division, print_function
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import keras
import h5py

plt.rcParams.update({'figure.figsize': (4.5, 3.5), 'lines.linewidth': 2.0})

Autosave disabled


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Data visualization

### Import dataset
Investigate the structure of polysomnography records in HDF5 format.

**Suggestions**
* Open HDF5 database `mesa-sleep-0001_s`
* Print table names and their shapes

In [37]:
path = 'data/mesa-sleep-0001_s'
import h5py
filename = path
f = h5py.File(filename, 'r')
print('table'+'\t'+'shape')
for i in f :
    print(str(i),'\t',str(f[i].shape))
    
    

# insert your code here

table	shape
EEG1 	 (1439, 1920)
EEG2 	 (1439, 1920)
EEG3 	 (1439, 1920)
EKG 	 (1439, 1920)
EMG 	 (1439, 1920)
EOG-L 	 (1439, 1920)
EOG-R 	 (1439, 1920)
stages 	 (1439,)


**EXPECTED OUTPUT**

### Data types
Check data type of tables and their records for EEG tables (`EEG*`) and hypnogram table (`stages`).

The hypnogram is split in 30 s intervals of recording, called *epochs*. Each epoch is assigned a sleep score.

Print data type of EEG tables and their records.

In [43]:
print('table\t table type\t\t\t\t record type\n')

# insert your code here
print('EEG1\t',type(f['EEG1']),'\t',f['EEG1'].dtype)
print('EEG2\t',type(f['EEG2']),'\t',f['EEG2'].dtype)
print('EEG3\t',type(f['EEG3']),'\t',f['EEG3'].dtype)
print('stages\t',type(f['stages']),'\t',f['stages'].dtype)



    



table	 table type				 record type

EEG1	 <class 'h5py._hl.dataset.Dataset'> 	 float32
EEG2	 <class 'h5py._hl.dataset.Dataset'> 	 float32
EEG3	 <class 'h5py._hl.dataset.Dataset'> 	 float32
stages	 <class 'h5py._hl.dataset.Dataset'> 	 int32


**EXPECTED OUTPUT**

### Convert data to Numpy arrays
Export EEG channels to array `x` and hypnogram to array `y`.

Print variable type of arrays `x` and `y`, and their contents.  

**Suggestions**
* Concatenate tables `EEG*` into array `x` with shape `(3, 1439, 1920)`
* Save table `stages` into array `y` with shape `(1, 1439)`

In [69]:
# insert your code here
import numpy as np
a=np.array(f['EEG1'])
b=np.array(f['EEG2'])
c=np.array(f['EEG3'])
d=np.concatenate((a,b,c),axis=1)
x=d.reshape(3,1439,1920)
y=np.array(f['stages'])

print('var \t var type \t\t\t element type \t var shape')
print('x''\t',type(x),'\t',x.dtype,'\t',x.shape)
print('y''\t',type(x),'\t',x.dtype,'\t',x.shape)




[0 0 0 ..., 0 0 0]
var 	 var type 			 element type 	 var shape
x	 <class 'numpy.ndarray'> 	 float32 	 (3, 1439, 1920)


**EXPECTED OUTPUT**

### Visualize data
Visualize data from EEG channels and hypnogram by plotting epoch 1000 of each.

**Suggestions**
* Plot first 200 samples of epoch 1000 of array `x`. Add a small value to each channel to separate them vertically.
* Plot all samples from array `y`.

In [5]:
# insert your code here

**EXPECTED OUTPUT**

<img src="figures/eeg_time.png" style="height: 350px;float: left;">
<img src="figures/hypnogram_time.png" style="height: 350px;float: left;">

## Data pre-processing

### Basic statistical metrics
Print minimum, maximum, mean and standard deviation of EEG channels in array `x`, and plot their histogram.

Print unique elements of array `y` and their proportions.

**Suggestions**
* Use functions `np.min()`, `np.max()`, `np.mean()` and `np.std()` to print statistics of array `x`
* Use function `plt.hist()` from Matplotlib to plot histogram of array `x`
* Print table of sleep stage proportions in `y`

In [6]:
print('EEG\t min\t\t max\t\t mean\t\t\t std\n')

# insert your code here

EEG	 min		 max		 mean			 std



**EXPECTED OUTPUT**

<img src="figures/eeg_histogram_pre.png" style="height: 350px;float: left;">

### Remove mean from EEG data
Remove channel mean from EEG channels, print basic statistical metrics and plot histogram.

**Suggestions**
* Reshape matrix `x` into shape (3, ?)
* Use function `np.mean()` with `axis` and `keepdims` keywords measure mean of EEG channels
* Remove mean of EEG channels from array `x`
* Reshape matrix `x` into original shape

In [7]:
# insert your code here

**EXPECTED OUTPUT**

<img src="figures/eeg_histogram.png" style="height: 350px;float: left;">

## Prepare data for training

### Write data import function
Write function `load_data()` to import EEG channels and hypnogram from HDF5 database.

In [8]:
def load_data(path):
    """Import EEG channels and hypnogram from HDF5 database.
    path: filesystem path of HDF5 database
    
    returns x: array containing EEG channels
            y: array containing hypnogram
    """
    
    x = None
    y = None
    
    # insert your code here
        
    return (x, y)

In [9]:
path = 'data/mesa-sleep-0002_s'

x, y = load_data(path)

if x is not None:
    print(x[0,0,:5])
    print(y[0,1000:1005])

**EXPECTED OUTPUT**

### Split dataset into train and validation sets
Split arrays `x` and `y` into train and validation sets `x_train`, `x_val`, `y_train` and `y_val`. The validation set contains 300 epochs from each HDF5 database.

Print the shapes of the new arrays.

**Note:** the function `np.random.seed(seed=0)` from Numpy is used to replicate the expected output.

**Suggestions**
* Create boolean array `idx` with 1439 elements initialized with `False` values
* Use function `np.random.choice()` to randomly select (without replacement) 300 elements and set them to `True`
* Split `x` into `x_train` and `x_val` according to array `idx`
* Use function `np.random.seed(seed=0)` from Numpy to replicate the expected output

In [10]:
np.random.seed(seed=0)

x_val = None
y_val = None
x_train = None
y_train = None

# insert your code here

**EXPECTED OUTPUT**

### Generate train and validation sets
Create train and validation sets in arrays `x_train`, `y_train`, `x_val` and `y_val` from HDF5 databases `mesa-sleep-0001_s`, `mesa-sleep-0002_s`, `mesa-sleep-0006_s`, `mesa-sleep-0014_s` and `mesa-sleep-0016_s`.

Print the shapes of train and validation datasets. Print basic statistical metrics of array `x_train`.

In [11]:
np.random.seed(seed=0)

paths = ['data/mesa-sleep-0001_s', 'data/mesa-sleep-0002_s',
         'data/mesa-sleep-0006_s', 'data/mesa-sleep-0014_s',
         'data/mesa-sleep-0016_s']

# insert your code here

**EXPECTED OUTPUT**

### Generate test set
Create test set `x_test` and `y_test` from HDF5 database `mesa-sleep-0021_s`.

In [12]:
path = 'data/mesa-sleep-0021_s'

x_test, y_test = load_data(path)

# insert your code here

**EXPECTED OUTPUT**

## Model setup

### Write input/output conversion functions
Write function `to_input()` to convert EEG data into 2-dimensional array by concatenating EEG channels. 

Write function `to_output()` to sleep scores into `one-hot-encoding`.

In [13]:
from keras import backend as K
from keras.layers import Input, Dense, Layer
from keras.models import Sequential

def to_input(x):
    """Convert data array to shape (batch, data).
    x: array with shape (channel, batch, data)
    
    returns x_out: array x with shape (batch, data)
    """
    
    x_out = None
    
    # insert your code here
    
    return x_out

def to_output(y):
    """Convert label array to one-hot-encoding with shape (batch, data).
    y: label array with shape (1, batch)
    
    returns: x_out (array with shape (batch, label))
    """
    
    y_out = None
    
    # insert your code here
   
    return y_out

In [14]:
if to_input(x_train) is not None:
    
    print('var\t\t shape\n')
    for item, item_name in ([[to_input(x_train), 'to_input(x_train)'],
                             [to_output(y_train), 'to_output(y_train)']]):
        print(item_name, '\t', item.shape)

    print('\n')
    print(to_input(x_train)[:2])
    print(to_output(y_train)[:2])

**EXPECTED OUTPUT**

### Convert datasets to network input/output format
Convert datasets into format compatible with network using functions `to_input()` and `to_output()`.

Print shapes of the new arrays.

In [15]:
input_train = to_input(x_train)
input_val = to_input(x_val)
input_test = to_input(x_test)

output_train = to_output(y_train)
output_val = to_output(y_val)
output_test = to_output(y_test)

if input_train is not None:
    input_shape = input_train.shape[1]
    output_shape = output_train.shape[1]

print('var\t\t shape\n')

# insert your code here

var		 shape



**EXPECTED OUTPUT**

### Softmax model
Implement simple network with softmax output layer with library Keras (https://keras.io/).

Write function `model_softmax()` that returns the compiled model.

Use adadelta optimizer, binary crossentropy loss and categorical accuracy metric.

In [16]:
def model_softmax():
    """Define softmax network
    
    returns m: Keras model with softmax output
    """
    
    m = None
    
    # insert your code here
    
    return m

In [17]:
model = model_softmax()

if model is not None:
    model.summary()

**EXPECTED OUTPUT**

### Train softmax model
Train softmax network during 5 training epochs, batch size of 32 with sample shuffling.

**Suggestions**
* Use method `fit()` with keywords `epochs`, `batch_size` and `shuffle` to train model
* Use method `evaluate()` to evaluate performance metrics in validation and test sets

In [18]:
np.random.seed(seed=0)

n_epochs = 5

model = model_softmax()

# insert your code here

**EXPECTED OUTPUT**

### Performance evaluation with cross-validation
Estimate model performance on unseen data by implementing leave-one-out cross-validation scheme with function `cross_validation()`.

In [19]:
def cross_validation(paths, model_ref, n_epochs=5, verbose=True):
    """Leave-one-out crossvalidatin scheme
    paths: list containing paths of HDF5 databases
    model_ref: Keras model
    n_epochs: number of training epochs
    verbose: print intermediate results
    
    returns models: list with trained Keras models
            metrics: list with training metrics
    """
    
    models = None
    metrics = None

    np.random.seed(seed=0)

    # insert your code here
    
    return (models, metrics)

### Train softmax model with cross-validation
Train softmax model with cross-validation, 5 training epochs, batch size of 32 with sample shuffling.

In [20]:
paths = ['data/mesa-sleep-0001_s', 'data/mesa-sleep-0002_s',
         'data/mesa-sleep-0006_s', 'data/mesa-sleep-0014_s',
         'data/mesa-sleep-0016_s', 'data/mesa-sleep-0021_s']

models, model_test = cross_validation(paths, model_softmax, n_epochs=n_epochs)

# insert your code here

**EXPECTED OUTPUT**

### Shallow ANN model
Implement single hidden layer with 250 ReLU units and softmax output.

Write function `model_ann()` that returns the compiled model.

Use adadelta optimizer, binary crossentropy loss and categorical accuracy metric.

In [21]:
def model_ann():
    """Define shallow ANN model
    
    returns m: shallow ANN Keras model
    """
    
    m = None
    
    # insert your code here
    
    return m

In [22]:
model = model_ann()

if model is not None:
    model.summary()

**EXPECTED OUTPUT**

### Train shallow ANN model with cross-validation
Train shallow ANN model with cross-validation, 5 training epochs, batch size of 32 with sample shuffling.

In [23]:
models, model_test = cross_validation(paths, model_ann, n_epochs=n_epochs)

# insert your code here

**EXPECTED OUTPUT**

## Additional questions

Make changes to the model and training data and investigate performance impacts.

**Suggestions**
* Try other types of activation units
* Add additional hidden layers 
* Use different backprop optimizer
* Change mini-batch size
* Include additional polysomnograph channels 
* Use different data pre-processing operations
* Use spectral input representation
* Use models from cross-validation to implement committee of networks with majority voting