Taking a look at the energy distributions for particular sensor channels throws up some interesting results. 

* For all patients, some of the sensor channels exhibit a bi-modal behaviour with high and low energy states, perhaps corresponding to whether the patient was awake or asleep. 
* For patient 2, we find that some of the channels exhibit glitchy behaviour whereby the channel energy drops to near zero for extended periods of time (which is distinct from the all zero dropouts common to all patients). Furthermore, some of this glitchy behaviour is confined to only the training set suggesting that the train and test data sets may cover two disjoint time intervals.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy.io import loadmat
import glob, re, math
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def natural_key(string_):
    return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_)]

# Generate the feature specified by 'func' with each matching file split into 'num_splits' parts
def generate_feature(file_pattern, num_splits, func):
    files = sorted(glob.glob(file_pattern), key=natural_key)
    n_files = len(files)
    feature = np.zeros((n_files*num_splits,16))
    for i in range(n_files):
        path = files[i]
        try:
            mat = loadmat(path)
            data = mat['dataStruct']['data'][0, 0]
            split_length = data.shape[0]/num_splits
            for s in range(num_splits):
                split_start = split_length*s
                split_end = split_start+split_length
                for c in range(16):
                    channel_data = data[split_start:split_end,c]
                    zero_fraction = float(channel_data.size - np.count_nonzero(channel_data))/channel_data.size
                    # Exclude sections with more than 10% dropout
                    if zero_fraction > 0.1:
                        feature[i*num_splits+s,c] = float('nan')
                    else:
                        feature[i*num_splits+s,c] = func(channel_data)
        except:
            for s in range(num_splits):
                for c in range(16):
                    feature[i*num_splits+s,c] = float('nan')
    return feature

# Simple log energy feature
def log_energy(data):
    return math.log(np.std(data))

Calculate the features...
Each ten minute file is split up into sixty ten second intervals.

In [None]:
train1_negative_log_energy = generate_feature('../input/train_1/*0.mat', 60, log_energy)
test1_log_energy = generate_feature('../input/test_1/*.mat', 60, log_energy)

train2_negative_log_energy = generate_feature('../input/train_2/*0.mat', 60, log_energy)
test2_log_energy = generate_feature('../input/test_2/*.mat', 60, log_energy)

## Patient 1 - High/low energy states
All of the sensors for patient 1 seem to show some sort of bimodal behaviour, suggestive of distinct low and high energy states (asleep vs awake?). This effect is particularly pronounced for channel 13 whose energy distribution is shown below for the train (blue) and test (green) sets.

In [None]:
sns.distplot(train1_negative_log_energy[:,13][~np.isnan(train1_negative_log_energy[:,13])], axlabel='Log energy (Channel 13, Patient 1)')
sns.distplot(test1_log_energy[:,13][~np.isnan(test1_log_energy[:,13])], label='Test')

The effect is somewhat less pronounced for channel 5...

In [None]:
sns.distplot(train1_negative_log_energy[:,5][~np.isnan(train1_negative_log_energy[:,5])], axlabel='Log energy (Channel 5, Patient 1)')
sns.distplot(test1_log_energy[:,5][~np.isnan(test1_log_energy[:,5])])

We can also see this bimodal behaviour when we plot the energy of the train/test sets in sorted file order (remembering that each file generates sixty ten second features). Clearly the test set has been shuffled.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharey=True)
ax1.plot(train1_negative_log_energy[:,13], '.', ms=1)
ax1.set_xlabel('Train')
ax2.plot(test1_log_energy[:,13], '.', ms=1)
ax2.set_xlabel('Test')
ax1.set_title('Log energy, Patient 1, Channel 13')
plt.show()

## Patient 2 - Glitchy sensors
Moving on to patient 2 we find that two of the sensor channels seem to exhibit glitchy behaviour. 

Channel 3 in both the train and test sets has the following energy distribution:

In [None]:
sns.distplot(train2_negative_log_energy[:,3][~np.isnan(train2_negative_log_energy[:,3])], axlabel='Log energy (Channel 3, Patient 2)')
sns.distplot(test2_log_energy[:,3][~np.isnan(test2_log_energy[:,3])])

At first sight, this looks like it might just be a another case of the high/low energy states that we saw before. However, unlike the earlier example, here there is a separation of several orders of magnitude between the high and low energy states, suggesting that it's something different.

When we plot the energy in sorted file order, we see a very different picture to before, the low energy states are confined towards a section at the end of the training set. (Perhaps the sensor developed a fault at some point?) And the behaviour continues into the test set.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharey=True)
ax1.plot(train2_negative_log_energy[:,3], '.', ms=1)
ax1.set_xlabel('Train')
ax2.plot(test2_log_energy[:,3], '.', ms=1)
ax2.set_xlabel('Test')
ax1.set_title('Log energy, Patient 2, Channel 3')
plt.show()

Channel 9 also exhibits glitchy behaviour. However, interestingly, it's confined to the training set:

In [None]:
sns.distplot(train2_negative_log_energy[:,9][~np.isnan(train2_negative_log_energy[:,9])], axlabel='Log energy (Channel 9, Patient 2)')
sns.distplot(test2_log_energy[:,9][~np.isnan(test2_log_energy[:,9])])

Again, there's a clear ordering to the training set with all the glitchy behaviour occurring in the middle of the sorted file order. Futhermore, since the glitchy behaviour isn't there at all in the test set, we can surmise that the two sets cover different periods of time.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharey=True)
ax1.plot(train2_negative_log_energy[:,9], '.', ms=1)
ax1.set_xlabel('Train')
ax2.plot(test2_log_energy[:,9], '.', ms=1)
ax2.set_xlabel('Test')
ax1.set_title('Log energy, Patient 2, Channel 9')
plt.show()

## Further thoughts
* Might these high/low energy states be predictive of seizures?
* Could models be improved by excluding data from channels 3 and 9 for patient 2?