# Dataset
We will explore this dataset: https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State#

> All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

In [3]:
import tensorflow as tf
data_dir = "../../data/raw"
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/EEG%20Eye%20State.arff"
datapath = tf.keras.utils.get_file(
        "eeg", origin=url, untar=False, cache_dir=data_dir
    )

You can load the arff file with scipy

In [4]:
from scipy.io import arff
data = arff.loadarff(datapath)

The data is a tuple of a description and observations

In [34]:
len(data), type(data)

(2, tuple)

Description

In [11]:
data[1]

Dataset: EEG_DATA
	AF3's type is numeric
	F7's type is numeric
	F3's type is numeric
	FC5's type is numeric
	T7's type is numeric
	P7's type is numeric
	O1's type is numeric
	O2's type is numeric
	P8's type is numeric
	T8's type is numeric
	FC6's type is numeric
	F4's type is numeric
	F8's type is numeric
	AF4's type is numeric
	eyeDetection's type is nominal, range is ('0', '1')

There are about 15k observations

In [18]:
len(data[0])

14980

The observations are tuples of floats and a byte as label

In [41]:
data[0][0]

(4329.23, 4009.23, 4289.23, 4148.21, 4350.26, 4586.15, 4096.92, 4641.03, 4222.05, 4238.46, 4211.28, 4280.51, 4635.9, 4393.85, b'0')

In [37]:
labels = []
for x in data[0]:
    labels.append(int(x[14]))

In [39]:
import numpy as np
np.array(labels).mean()

0.4487983978638184

About 45% of the data has closed eyes.

# Excercises

- create a get_eeg function that downloads the data to a given path
- build a Dataset that yields a X, y tuple of tensors. Every observation should be a sequences of 0s or 1s. Remember: a dataset should implement `__get_item__` and `__len__`.
- figure out what the length distribution is of your dataset: how many timestamps do you have for every observation? On average, median, min, max?
- take this into account for your choice of an architecture and for the padding / windowing
- create a dataloader that yields timeseries that are:
    - windowed (easy)
    - windowed and padded (medium)
    - windowed, padded and batched (hard)
- Test appropriate architectures