In [2]:
%pylab inline

from xgboost.sklearn import XGBClassifier
from typing import Tuple
import xgboost as xgb
import pandas as pd
import json
import sklearn
import seaborn as sns

dataset = pd.read_csv('data/multisession-eeg.csv')
fromstring = lambda array_str: np.fromstring(array_str, dtype=float, sep=',')
dataset.raw_fft = dataset.raw_fft.apply(fromstring)
# dataset.raw_fft.iloc[0]

Populating the interactive namespace from numpy and matplotlib


# Passthoughts

What if you could simply *think your password*? That's the premise behind *passthoughts*. We'll lay this out as a classification problem:

> Given a reading, and a person, is that person who they claim to be?

We'll structure this problem as follows: For each subject, we'll train a classifier. That subject's readings will be positive example, and everyone else's readings will be negative examples.

We can make this a little fancier by having people use specific thoughts (e.g. "focus on your breathing," "sing a song in your head," etc). We'll make sure our methods can handle this case, but for the time being, we'll just use the `"unabeled"` readings - people doing nothing in particular.

We'll use subject `A` as our "target" individual. We will train on this subject, and train against the other subjects in the corpus (subjects `B` and `C`).

In [118]:
def to_matrix (series):
    return np.array([ x for x in series ])

def readings_right_subject_right_task (subj, task, session=0):
    return to_matrix(dataset[
        (dataset['subject'] == subj) &
        (dataset['label'] == task) &
        (dataset['session'] == session)
    ].raw_fft)

def readings_wrong_subj_any_task (subj):
    return to_matrix(dataset[
        (dataset['subject'] != subj)
    ].raw_fft)


def readings_wrong_subj_any_task_or_right_subj_wrong_task (subj, task):
    a = to_matrix(dataset[ 
        (dataset['subject'] == subj) & (dataset['label'] != task)
    ].raw_fft)
    b = readings_wrong_subj_any_task (subj)
    return np.concatenate((a, b))

In [81]:
positive = readings_right_subject_right_task('A', 'unlabeled', 0)
negative = readings_wrong_subj_any_task('A')
positive.shape, positive.shape

((40, 516), (40, 516))

Notice how we structured our positive and negative examples:

- *Positive examples*: The right person thinking the right task.

- *Negative examples*: The wrong person thinking any task (whether it is right or wrong).

In the context of passthoughts, we could consider other possibilites for selecting positive and negative features. 


For instance, we could also structure them in the following way:
- positive examples: The right person thinking the right task.
- negative examples: The wrong person thinking any task (whether it is right or wrong) or the right person thinking the wrong task.

Possible consequences of this setup, pros and cons:  
- This configuration penalizes for not doing the right task even though being the correct person. That is, the inherence factor is taken to be more important in this configuration than the knowledge factor when classifying. In the original configuration, the inherence factor and the knowledge factor are equally important. This can be seen as a good configuration, if we want to train the classifier on a thinking style of the person (of a "how the person thinks") instead of "what a person thinks". This would mean, however, that we assume, that something like an individual thinking style exists.
- The disadvantage to this is that there is non-stationarity, which might lead to a higher FRR (False Rejection Rate) for the right subject over time.
- Also, since we place so much emphasis on the thinking style in this setting, the assumption of the individual thinking style should be correct. We can test for this: would we get a higher FAR with my configuration? This is indeed the case (see below). That means, the classifier cannot distinguish so well between the right subject for different (correct and incorrect) tasks. So the inherence factor is very important in the correct classification, maybe even so, as to stipulate an inherent "thinking style".

We might evaluate this selection with the function "readings_wrong_subj_any_task_or_right_subj_wrong_task" above, which makes the set of negative examples bigger. I will do that below.


Now, we'll turn these data into our feature/label matrices `X` and `y`.

In [82]:
X = np.concatenate([positive, negative])

In [83]:
y = np.array([ 0 for x in positive] + [ 1 for x in negative])
assert X.shape[0] == y.shape[0]

Note that we are assigning `0` to "positive" examples, and `1` to "negative" examples. That means `0` will mean "ACCEPT" and `1` will mean "REJECT."

Now, let's train and test a classifier! And estimate the classifier's accuracy.


In [None]:

def fresh_clf () -> XGBClassifier:
    return XGBClassifier(
        objective= 'binary:logistic',
        seed=27)

def xgb_cross_validate (
    X: np.array,
    y: np.array,
    nfold: int=7
) -> Tuple[XGBClassifier, pd.DataFrame]:
    # eval_metrics:
    # http://xgboost.readthedocs.io/en/latest//parameter.html
    metrics = ['error@0.1', 'auc']
#     metrics = [ 'auc' ]
    # we use the @ syntax to override the default of 0.5 as the threshold for 0 / 1 classification
    # the intent here to to minimize FAR at the expense of FRR
    alg = fresh_clf()
    xgtrain = xgb.DMatrix(X,y)
    param = alg.get_xgb_params()
    cvresults = xgb.cv(param,
                      xgtrain,
                      num_boost_round=alg.get_params()['n_estimators'],
                      nfold=nfold,
                      metrics=metrics,
                      early_stopping_rounds=100
                      )
    alg.set_params(n_estimators=cvresults.shape[0])
    alg.fit(X,y,eval_metric=metrics)
    return alg, cvresults

In [85]:
X_train, X_validate, y_train, y_validate = sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.33, 
    random_state=42)
clf, cvres = xgb_cross_validate(X_train, y_train)

In [86]:
clf.score(X_validate, y_validate)

0.98329355608591884

For authentication, what we want even more than "accuracy". Here are two metrics:

- False Acceptance Rate (FAR): The percentage of readings *not* from subject A incorrectly classified "ACCEPT."
- False Rejection Rate (FRR): The percentage of readings *from* subject A incorrectly classified 'REJECT."

For authentication /security/, we want FAR to be as low as possible (so nobody can break in).
For authentication /usability/, we want FRR to be low (so user's don't get frustrated constantly re-trying their passthought).

In [87]:
def far_frr (classifier, features, labels):
    # predict all the labels
    y_pred = classifier.predict(features)
    false_accepts = 0
    false_rejects = 0
    for predicted, actual in zip(y_pred, labels):
        # if we should have rejected,
        # but in fact accepted,
        if (actual == 1) and (predicted == 0):
            # increment false accepts
            false_accepts += 1
        # if we should have accepted,
        # but in fact rejected,
        if (actual == 0) and (predicted == 1):
            # increment false rejections
            false_rejects += 1
    # calculate proportions for each
    far = false_accepts / len(list(filter(lambda x: x == 0, y_pred)))
    frr = false_rejects / len(list(filter(lambda x: x == 1, y_pred)))
    return far, frr

In [88]:
far, frr = far_frr(clf, X_validate, y_validate)
f'FAR: {far*100}% - FRR: {frr*100}%'

'FAR: 30.0% - FRR: 0.9779951100244498%'

Create the positive and negative examples.

In [119]:
positive2 = readings_right_subject_right_task('A', 'unlabeled', 0)
negative2 = readings_wrong_subj_any_task_or_right_subj_wrong_task('A', 'unlabeled')
positive2.shape, positive2.shape

((40, 516), (40, 516))

In [120]:
X2 = np.concatenate([positive2, negative2])

In [121]:
y2 = np.array([ 0 for x in positive2] + [ 1 for x in negative2])
assert X2.shape[0] == y2.shape[0]

In [122]:
X_train2, X_validate2, y_train2, y_validate2 = sklearn.model_selection.train_test_split(
    X2, y2, 
    test_size=0.33, 
    random_state=42)
clf2, cvres2 = xgb_cross_validate(X_train2, y_train2)

In [123]:
clf2.score(X_validate2, y_validate2)

0.98601398601398604

In [124]:
far2, frr2 = far_frr(clf2, X_validate2, y_validate2)
f'FAR: {far*100}% - FRR: {frr*100}%'

'FAR: 0.0% - FRR: 100.0%'

So with this defintion of the positive / negative examples, we get at a higher FAR and FRR. It is especially noteworthy to look at the higher FAR: this could indicate that the classifier takes more into account how a subject thinks (i.e. a certain style of EEG over time) than what a subject thinks (i.e. the tasks a subject thinks about: unlabeled, breathe, song).

Now, these results might be good. 

But our classifier's accuracy could be misleading.   

Why? 

# Nonstationarity

We are training, and testing, using data recorded over a single session. However, EEG changes over time, a property known as *nonstationarity*. Will our great results still hold a few weeks later?

Let's take subject `A`'s data from sessions 1 and 2, which were recorded a few weeks after session 0.

In [95]:
X_subja_sess1 = readings_right_subject_right_task('A', 'unlabeled', 1)
X_subja_sess2 = readings_right_subject_right_task('A', 'unlabeled', 2)
X_subja_later = np.concatenate([X_subja_sess1, X_subja_sess2])
y_subja_later = [ 0 for x in X_subja_later ]

Now, let's try the classifier we trained on the original data, testing it on the later data.

In [96]:
far, frr = far_frr(clf, X_subja_later, y_subja_later)
f'FAR: {far*100}% - FRR: {frr*100}%'

'FAR: 0.0% - FRR: 100.0%'

Nonstationarity is a problem for us. After all, we can calibrate our target subject, but we then expect them to leave the lab and go use the device later on. If their state changes so much that they can no longer be authenticated, we can't very well claim our system is accurate!

So let's quantify and qualify *what* is changing in EEG signals over time.

We could:
- Study subject `A`'s recordings over the three sessions provided here.
- Study one subject's recordings over the course of a year.

Some questions to spur investigation:

- What features of readings cause a classifier that works on earlier recordings fail on later ones?
- What features remain the same? Are there any?
- What might be the source of these changing features? Changing placement in the EEG device? Changing properties of the brain?

In some situations, we might be interested in passthoughts that change over time. In others, we might not. One possibility to discover what features of readings cause a classifier that works on earlier recordings fail on later ones, would be to bandpass to the different types of waves (alpha, beta, theta, gamma). However, this is not possible with the current data, since that is already bandpassed, including all of these types of waves. We could reverse engineer the to_power_spectrum function from lab one.

Also, if we had time domain data, a more advanced project could view the brain as a nonlinear dynamic system. It could find out whether the EEG data follows either a normal distribution or rather a power law. Features such as these, i.e. seeing the brain as following chaos theory, could shed new lights on what properties are changing in EEG signals over time. This view would take the stance that the source that is changing EEG signals over time is internal to the system.

A different approach could be as follows: The features of a task might cause the features of readings to change over time. For instance, we might hypothesize that breathing is a very common task that we have done millions of times in our lives. Just breathing (and thinking about it) should not change (much) over time. Compare this to the task of looking at a face: there is this concept in neuroscience of a grandmother neuron (or the Halle Berry neuron), which activates when a person sees his grandmother. Now something similar might exist (maybe in form of a pattern, and not a single cell) for every new task, such as, in this case: for every new face. Furthermore, we might argue that it needs time for this pattern (or single cell) to "lock in" - analogous to a muscle that needs to be accustomed to a new movement: in the beginning, you try out different movements until you converge to the perfect movement. 
So, the newness of a task might cause the features of readings to change over time.

Let us look at the different tasks:

In [125]:
print('Session 0:', dataset[(dataset['subject']=='A') & (dataset['session'] == 0)]['label'].unique())
print('Session 1:', dataset[(dataset['subject']=='A') & (dataset['session'] == 1)]['label'].unique())
print('Session 2:', dataset[(dataset['subject']=='A') & (dataset['session'] == 2)]['label'].unique())

Session 0: ['unlabeled' 'breathe' 'song' 'song_o' 'sport' 'breathe_o' 'speech' 'face']
Session 1: ['unlabeled' 'calibration' 'word_x' 'phrase_x' 'face_x' 'breatheopen'
 'song_x' 'sport_x']
Session 2: ['unlabeled' 'calibration' 'breatheclosed' 'word_x' 'word_c' 'phrase_x'
 'phrase_c' 'face_x' 'face_c' 'breatheopen' 'song_x' 'song_c' 'sport_x'
 'sport_c']


Surprise, surprise, there are breathing tasks. Let us try the above analysis with the one of the breathing tasks then.

In [126]:
positive3 = readings_right_subject_right_task('A', 'breathe_o', 0)
negative3 = readings_wrong_subj_any_task('A')
positive3.shape, positive3.shape

((23, 516), (23, 516))

In [127]:
X3 = np.concatenate([positive3, negative3])

In [128]:
y3 = np.array([ 0 for x in positive3] + [ 1 for x in negative3])
assert X3.shape[0] == y3.shape[0]

In [129]:
X_train3, X_validate3, y_train3, y_validate3 = sklearn.model_selection.train_test_split(
    X3, y3, 
    test_size=0.33, 
    random_state=42)
clf3, cvres3 = xgb_cross_validate(X_train3, y_train3)

In [130]:
clf3.score(X_validate3, y_validate3)

0.99273607748184023

In [131]:
far3, frr3 = far_frr(clf3, X_validate3, y_validate3)
f'FAR: {far*100}% - FRR: {frr*100}%'

'FAR: 0.0% - FRR: 100.0%'

In [132]:
# non-stationarity
X_subja_sess1 = readings_right_subject_right_task('A', 'breatheopen', 1)
X_subja_sess2 = readings_right_subject_right_task('A', 'breatheopen', 2)
X_subja_later2 = np.concatenate([X_subja_sess1, X_subja_sess2])
y_subja_later2 = [ 0 for x in X_subja_later ]

In [133]:
far3, frr3 = far_frr(clf3, X_subja_later2, y_subja_later2)
f'FAR: {far*100}% - FRR: {frr*100}%'

'FAR: 0.0% - FRR: 100.0%'

The data so far fits my theory. There seems to be a very good classifier for the "breatheopen" task, as far as non-stationarity and FAR is concerned. Compared to other subjects, it classifies with a very low FAR and across time, for the same subject, it also classifies with a very low FAR.
However, we might want to check for other tasks, too.

There are of course, other possible sources of changing features:
Newness and adapting brain patterns can be integrated under the concept of brain plasticity. The brain's neurons rewire constantly (from shorter time frames, as can be seen in conditioning experiments) to longer time frames, such as learning a very complex skill. There might be too few repetitions of the tasks for brain or synaptic plasticity to take place in a manner that is significant to change EEG signals. But we would expect a stronger effect over a longer period.

Another possible source of changing features might actually also be the placement of the EEG device, but not necessarily: If we used multiple electrodes, the classifier should have enough information to be able to filter out the differences in placement (at least most of it). However, using only one electrode on a different place might dramatically change the signal. Just think about a blink of an eye recorded by an electrode on FP1 vs. that same blink recorded by an electrode on P3 - that would be a huge difference. So, in more general terms, the placement of the electrodes should affect the change of the features more, the less electrodes we use.

And then, there are so many possible other factors: e.g. stress, which affects neuroplasticity, thus EEG signal changes. And so on for many other smaller factors to be taken into account. This view contrasts with the view of nonlinear dynamic systems in that we would view the source that is changing EEG signals over time as external to the system.

I would have also wanted to visualize it to do the above data analysis not just for the breathe task, but eyeball potential other tasks that might look promising for new insights, or that might support or serve as counterexamples to my theory. Now, that will be a task for future endeavours.

In [None]:
dataset.reset_index(inplace=True)

# Plot the raw_fft with standard error
sns.tsplot(data=dataset, time="index", unit="subject",
           condition="label", value="raw_fft")
