# Dataset
We will explore this dataset: https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State#

> All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

In [26]:
import torchtext
torchtext.disable_torchtext_deprecation_warning()
from mads_datasets import datatools
from pathlib import Path
data_dir = Path.home() / ".cache/mads_datasets/egg"
if not data_dir.exists():
    data_dir.mkdir(parents=True)

filename = "EGG.arff"
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/EEG%20Eye%20State.arff"
datatools.get_file(data_dir=data_dir, filename=filename, url=url, unzip=False)
datapath = data_dir / filename

[32m2025-02-23 01:29:40.480[0m | [1mINFO    [0m | [36mmads_datasets.datatools[0m:[36mget_file[0m:[36m95[0m - [1mFile /home/sarmad/.cache/mads_datasets/egg/EGG.arff already exists, skip download[0m


You can load the arff file with scipy

In [2]:
from scipy.io import arff
data = arff.loadarff(datapath)

The data is a tuple of a description and observations

In [3]:
len(data), type(data)

(2, tuple)

Description

In [4]:
data[1]

Dataset: EEG_DATA
	AF3's type is numeric
	F7's type is numeric
	F3's type is numeric
	FC5's type is numeric
	T7's type is numeric
	P7's type is numeric
	O1's type is numeric
	O2's type is numeric
	P8's type is numeric
	T8's type is numeric
	FC6's type is numeric
	F4's type is numeric
	F8's type is numeric
	AF4's type is numeric
	eyeDetection's type is nominal, range is ('0', '1')

There are about 15k observations

In [5]:
len(data[0])

14980

The observations are tuples of floats and a byte as label

In [6]:
data[0][0]

np.void((4329.23, 4009.23, 4289.23, 4148.21, 4350.26, 4586.15, 4096.92, 4641.03, 4222.05, 4238.46, 4211.28, 4280.51, 4635.9, 4393.85, b'0'), dtype=[('AF3', '<f8'), ('F7', '<f8'), ('F3', '<f8'), ('FC5', '<f8'), ('T7', '<f8'), ('P7', '<f8'), ('O1', '<f8'), ('O2', '<f8'), ('P8', '<f8'), ('T8', '<f8'), ('FC6', '<f8'), ('F4', '<f8'), ('F8', '<f8'), ('AF4', '<f8'), ('eyeDetection', 'S1')])

In [7]:
for x in data[0][0]:
    print(type(x))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.bytes_'>


Let's cast the byte ot int

In [8]:
labels = []
for x in data[0]:
    labels.append(int(x[14]))

In [9]:
import numpy as np
np.array(labels).mean()

np.float64(0.4487983978638184)

About 45% of the data has closed eyes.

# Excercises 1

- download the data to a given path. You can use the datatools.py method get_file for that, and wrap it with the prerpocessing.

<font color='green'>

**Solution:** the dataset is downloaded to the respective path folder.

</font>

- build a custom Dataset that yields a $X, y$ tuple of tensors. $X$ should be sequential in time. Remember: a dataset should implement `__get_item__` and `__len__`.

<font color='green'>

**Solution:** A custom Dataset named `EEGEyeDataset` is defined to return the $X$ and $y$ when called by dataloader

</font>


- You can try to implement your own datafactory. Study all the examples in `mads_datasets` sourcecode.

<font color='green'>

**Solution:** the custom datafactory is implemented by using `BaseDatastreamer` from the `mads_datasets` library.

</font>

- note that you could model this as both a classification task, but also as a sequence-to-sequence task! For this excercise, make it a classification task with consecutive 0s or 1s only.

<font color='green'>

**Solution:** at first classification task is implemented by fixing the sequence length to 1.

</font>

- Note that, for a training task, a seq2seq model will probably be more realistic. However, the classification is a nice excercise because it is harder to set up.
- figure out what the length distribution is of your dataset: how many timestamps do you have for every consecutive sequence of 0s and 1s? On average, median, min, max?

<font color='green'>

**Solution:** the provided dataset is from one time-series data for one time, hence there is no length distribution, for each time-stamp 14 values are noted for 0s and 1s as the output for close or open eyes at that moment.

</font>

- create a dataloader that yields timeseries with (batch, sequence_lenght). You can implement: windowed, padded and batched.
    1. yielding a windowed item should be the easy level
    2. yielding windowed and padded is medium level 
    3. yielding windowed, padded and batched is expert level, because the windowing will cause the timeseries to have different sizes. You will need to buffer before you can yield a batch.

<font color='green'>

**Solution:** The dataloaders are implemented for all the levels. As discussed above that the dataset doesnot require any padding.

</font>


1. Upload this to github. 
2. Put your dev notebooks in a seperate folder
3. Put all your functions in the src folder
4. Use a formater & linter
5. Add a single notebook, that sources the src folder. Indicate which level you got (1, 2 or 3)
6. and that shows your dataloader works:
    - it should not give errors because it runs out of data! Either let is stop by itself, or run forever.
    - batchsize should be consistent (in case 1 and 2, batchsize is 1)
    - sequence length is allowed to vary


<font color='green'>

**Code added below**

</font>

In [53]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(datapath, comment='@', header=None)  

# Convert last column to target variable
X_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values.astype(int)

# Normalize the values to make it easier for the model to learn
scaler = StandardScaler()
X_data = scaler.fit_transform(X_data)

In [55]:
X_data.shape, y_data.shape

((14980, 14), torch.Size([0]))

In [30]:
import torch
from torch.utils.data import Dataset, DataLoader

# creating a custom dataset
# this is done as it stores the dataset tensors and defines a sequence length for time-series extraction.
class EEGEyeDataset(Dataset):
    def __init__(self, features, labels):
        features = features.reshape(-1, 1, features.shape[1])
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

<font color='green'>

**Dataloader for level 1**

</font>

In [32]:
from torch.nn.utils.rnn import pad_sequence
from typing import List, Tuple

class preprocess_level1:
    def __call__(self, batch: List[Tuple]) -> Tuple[torch.Tensor, torch.Tensor]:
        X, Y = zip(*batch)
        return X, torch.stack(Y)

dataset = EEGEyeDataset(X_data, y_data)
dataloader_level1 = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=preprocess_level1())

<font color='green'>

**Dataloader for level 2 with padding**

</font>

In [33]:
class preprocess_level2:
    def __call__(self, batch: List[Tuple]) -> Tuple[torch.Tensor, torch.Tensor]:
        X, Y = zip(*batch)
        X = pad_sequence(X, batch_first=True, padding_value=0)
        return X, torch.stack(Y)

dataset = EEGEyeDataset(X_data, y_data)
dataloader_level2 = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=preprocess_level2())

<font color='green'>

**Dataloader for level 3 wtih padding and sorted sequence**

</font>

In [34]:

class preprocess_level3:
    def __call__(self, batch: List[Tuple]) -> Tuple[torch.Tensor, torch.Tensor]:
        batch.sort(key=lambda x: len(x[0]), reverse=True)
        X, Y = zip(*batch)
        X = pad_sequence(X, batch_first=True, padding_value=0)
        return X, torch.stack(Y)

dataset = EEGEyeDataset(X_data, y_data)
dataloader_level3 = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=preprocess_level3())

In [35]:
import torch.nn as nn
import gin

# Custom LSTM model configurable with custom_models.gin file
gin.enter_interactive_mode()
@gin.configurable
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        print(x.shape)
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out

In [None]:
from mads_datasets.base import BaseDatastreamer

dataset = EEGEyeDataset(X_data, y_data)

# a datafactory is implemented just like 'mads_datasets'
datastreamer = BaseDatastreamer(dataset, batchsize=64, preprocessor=preprocess_level3())

In [37]:
x, y = next(iter(datastreamer.stream()))
x.shape, y.shape

(torch.Size([64, 1, 14]), torch.Size([64]))

In [38]:
from mltrainer.metrics import Metric
import numpy as np
Array = np.ndarray

# a custom Accuracy metric is defined for classification task
class AccuracySeqtoSeq(Metric):
    def __repr__(self) -> str:
        return "Accuracy"

    def __call__(self, y: Array, yhat: Array) -> float:
        yhat = np.argmax(yhat, axis=-1)
        return (yhat == y).mean()

In [74]:
from mltrainer import TrainerSettings, ReportTypes, metrics, Trainer, rnn_models
from torch import optim

# to load the .gin file
gin.parse_config_file("custom_models.gin")

accuracy = AccuracySeqtoSeq()

loss_fn = torch.nn.CrossEntropyLoss()  # loss for classification
log_dir = Path("modellogs/dummy")

settings = TrainerSettings(
    epochs=50,
    metrics=[accuracy],
    logdir=log_dir,
    train_steps=len(datastreamer),
    valid_steps=len(datastreamer),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN,],
    scheduler_kwargs={"factor": 0.5, "patience": 5},
    earlystop_kwargs=None
)

In [None]:
# train on custom LSTM model
model = LSTMModel()

trainer = Trainer(
    model=model,
    settings=settings,
    loss_fn=loss_fn,
    optimizer=optim.Adam,
    traindataloader=datastreamer.stream(),
    validdataloader=datastreamer.stream(),
    scheduler=optim.lr_scheduler.ReduceLROnPlateau,
    device='cpu',
    )

trainer.loop()

[32m2025-02-23 00:48:43.873[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to modellogs/dummy/20250223-004843[0m
100%|[38;2;30;71;6m██████████[0m| 1/1 [00:00<00:00, 150.36it/s]
[32m2025-02-23 00:48:43.887[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 0 train 0.6897 test 0.6855 metric ['0.5938'][0m
100%|[38;2;30;71;6m██████████[0m| 1/1 [00:00<00:00, 167.51it/s]
[32m2025-02-23 00:48:43.897[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 1 train 0.6857 test 0.6817 metric ['0.5938'][0m
100%|[38;2;30;71;6m██████████[0m| 1/1 [00:00<00:00, 116.68it/s]
[32m2025-02-23 00:48:43.912[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 2 train 0.6820 test 0.6786 metric ['0.5938'][0m
100%|[38;2;30;71;6m██████████[0m| 1/1 [00:00<00:00, 114.75it/s]
[32m2025-02-23 00:48:43.928[0m | [1mINFO    [0

# Excercise 2
- build a Dataset that yields sequences of X, y. This time, y is a sequence and can contain both 0s and 1s

<font color='green'>

**Solution:** the Dataset is created to return the X and y, both being the sequence of lenght 10

</font>

- create a Dataloader with this

<font color='green'>

**Solution:** The dataloader streamer is defined by `BaseDatastreamer` from the `mads_datasets` library.

</font>

- Test appropriate architectures (RNN, Attention)

<font color='green'>

**Solution:** The `AttentionGRU` is trained for this sequence dataset. The parameters are defined in `custom_models.gin`. The final accuracy with this model is around 93%

</font>

- for the loss, note that you will need a BCELoss instead of a CrossEntroyLoss

<font color='green'>

**Solution:** `BCEWithLogitsLoss` is used as we are now working in seq 2 seq training and working with logits

</font>

<font color='green'>

**Code added below**

</font>

In [90]:
# creating a custom dataset
# this is done as it stores the dataset tensors and defines a sequence length for time-series extraction.
class EEGEyeDataset2(Dataset):
    def __init__(self, features, labels):
        self.sequence_length = 10
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        return len(self.features) - self.sequence_length + 1

    def __getitem__(self, idx):
        x = self.features[idx:idx + self.sequence_length]
        y = self.labels[idx:idx + self.sequence_length]
        return x, y

In [91]:
from typing import List, Tuple

# CUSTOM defined preprocess to load the X and y
class preprocess_dummy:
    def __call__(self, batch: List[Tuple]) -> Tuple[torch.Tensor, torch.Tensor]:
        batch.sort(key=lambda x: len(x[0]), reverse=True)
        X, Y = zip(*batch)
        X = pad_sequence(X, batch_first=True, padding_value=0)
        return X, torch.stack(Y)

In [92]:
dataset = EEGEyeDataset2(X_data, y_data)

datastreamer = BaseDatastreamer(dataset, batchsize=64, preprocessor=preprocess_dummy())

In [93]:
x, y = next(iter(datastreamer.stream()))
x.shape, y.shape

(torch.Size([64, 10, 14]), torch.Size([64, 10]))

In [121]:
from mltrainer.metrics import Metric
import numpy as np
Array = np.ndarray

# Accuracy metric is defined to work with binary probabilities
class AccuracySeqtoSeq(Metric):
    def __repr__(self) -> str:
        return "Accuracy"

    def __call__(self, y: Array, yhat: Array) -> float:
        yhat = (yhat >= 0.5).astype(int)
        return (yhat == y).mean()

In [124]:
accuracy = AccuracySeqtoSeq()

gin.parse_config_file("custom_models.gin")

loss_fn = torch.nn.BCEWithLogitsLoss()   # loss function for seq 2 seq
log_dir = Path("modellogs/dummy")

settings = TrainerSettings(
    epochs=10,
    metrics=[accuracy],
    logdir=log_dir,
    train_steps=len(datastreamer),
    valid_steps=len(datastreamer),
    reporttypes=[ReportTypes.TENSORBOARD, ReportTypes.GIN,],
    scheduler_kwargs={"factor": 0.5, "patience": 5},
    earlystop_kwargs=None
)

In [125]:
model = rnn_models.AttentionGRU()

trainer = Trainer(
    model=model,
    settings=settings,
    loss_fn=loss_fn,
    optimizer=optim.Adam,
    traindataloader=datastreamer.stream(),
    validdataloader=datastreamer.stream(),
    scheduler=optim.lr_scheduler.ReduceLROnPlateau,
    device='cpu',
    )

trainer.loop()

[32m2025-02-23 02:35:50.378[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to modellogs/dummy/20250223-023550[0m
  0%|[38;2;30;71;6m          [0m| 0/10 [00:00<?, ?it/s]100%|[38;2;30;71;6m██████████[0m| 233/233 [00:11<00:00, 20.51it/s]
[32m2025-02-23 02:36:05.903[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 0 train 0.6660 test 0.6512 metric ['0.5905'][0m
100%|[38;2;30;71;6m██████████[0m| 233/233 [00:11<00:00, 20.63it/s]
[32m2025-02-23 02:36:20.987[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 1 train 0.6425 test 0.6172 metric ['0.6158'][0m
100%|[38;2;30;71;6m██████████[0m| 233/233 [00:11<00:00, 21.02it/s]
[32m2025-02-23 02:36:35.005[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mreport[0m:[36m191[0m - [1mEpoch 2 train 0.6013 test 0.5503 metric ['0.6580'][0m
100%|[38;2;30;71;6m██████████[0m| 233/233 [00:11