# Useckit High-Level Example

This notebook provides an introduction to useckit. It focuses on the high-level API that useckit provides in the form of `Evaluators`. For this reason, we provide a dataset that we analyze further below.  We implement preprocessing functionality and run the final evaluation with `useckit`.

To run this notebook, please first create a virtual environment (step 1), activate it (step 2) and install the requirements (step 3).
1. Create virtual environment: `python -m venv venv`
2. On Microsoft Windows and Powershell: `.\venv\Scripts\activate.ps1`, on Unix-like systems with Bash: `source .\venv\bin\activate`
3. Install dependencies: `pip install -r requirements.txt`

**Tested with**: `Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] on win32` and tensorflow `2.11.0`.

## Step 1: Essential Imports and Setup

We import several libraries from either python core and regular dependencies such as numpy, pandas, and matplotlib. We further provide a convenience-wrapper for `tqdm`, as some terminals (especially on Windows) do not provide sufficient support, thus we wrap `tqdm()` to always print ascii output.

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from glob import glob


def tqdm(*args):  # wrapper for tqdm to always print ascii output
    """Provide a progress bar for iterator operations."""
    from tqdm import tqdm as _tqdm
    return _tqdm(args[0], ascii=True)

In [None]:
# append current working directory to path
# this is necessary so that jupyter interpreter can find the useckit lib to import
# when opening this project where the folder `examples` is located beneath another root directory.
sys.path.append(os.path.dirname(os.getcwd()))

## Step 2: Extracting Dataset

In the following, we are going to download an excerpt of a behavioral biometric dataset.

### Origin of dataset

The dataset originates from the following publication:

```
@article{doi:10.1080/10447318.2022.2120845,
  author = {Jonathan Liebers and Sascha Brockel and Uwe Gruenefeld and Stefan Schneegass},
  title = {Identifying Users by Their Hand Tracking Data in Augmented and Virtual Reality},
  journal = {International Journal of Human–Computer Interaction},
  volume = {0},
  number = {0},
  pages = {1-16},
  year  = {2022},
  publisher = {Taylor & Francis},
  doi = {10.1080/10447318.2022.2120845},
  URL = {https://doi.org/10.1080/10447318.2022.2120845},
  eprint = {https://doi.org/10.1080/10447318.2022.2120845}
}
```

We download the dataset from the internet. This script automatically downloads the file, checks the hashsum and extracts it.

### Contents of dataset

The dataset consists of 16 participants who performed a button-press in Augmented- and Reality. They repeated the button-press in total 12 times per session of the study. The study consisted of two sessions, where each session took place on a different day. We only focus on Virtual Reality here, where the data was elicited through a Meta Quest 2 device. The interaction of the participants took place through hand tracking, i.e., they interacted with the buttons with their tracked hands and fingers and without any controller.

In [4]:
# Unzip

dataset_dir = "./dataset-decompressed/"
print("Testing for files and deciding whether to download and extract dataset ...")
download_and_unzip_dataset = len(glob(f'{dataset_dir}*tsv')) != 768

if not os.path.exists(dataset_dir):
    print('Creating dataset_dir', dataset_dir, 'since it does not exist.')
    os.makedirs(dataset_dir)

if download_and_unzip_dataset:
    print("Downloading dataset ...")
    import urllib.request
    urllib.request.urlretrieve("http://download-useckit-dataset.blindforreview.com", "dataset.zip")

    print("Checking hashsum ...")
    import hashlib
    with open("dataset.zip", "rb") as f:
        bytes = f.read() # read entire file as bytes
        readable_hash = hashlib.sha256(bytes).hexdigest()
        print("sha256sum of dataset.zip is", readable_hash)
        assert readable_hash == "ffb857f0958ee6de80b740e0e7fc25b3ae334ed9255aa54508fc279e604df06e", \
            'hashsum did not match expectation. Please try the download again.'
    print("sha256sum check of dataset.zip is ok.")

    print("Unzipping dataset ...")
    import zipfile
    with zipfile.ZipFile("dataset.zip", "r") as dataset_zipped:
        dataset_zipped.extractall("dataset-decompressed")

    print("Finished extracting dataset to `dataset-decompressed`.")
else:
    print("Dataset already extracted to ./dataset-decompressed/*tsv. Found 768 files. Skipping extraction of dataset.")

Finished extracting dataset to `dataset-decompressed`.


## Step 3: Load dataset into memory

In the following cell, we are going to load the extracted dataset into the jupyter interpreter and thus the memory the computer. Here, we load the files' contents together with some metadata for easier selections later on.

In [None]:
def load_tsv_sample(path: str):
    df = pd.read_csv(path, sep='\t').drop(columns="Unnamed: 0")

    # extract information from each filename
    basename_split = os.path.basename(path).split('_')
    key_scene = basename_split[0].replace('SCENE-', '')
    key_xr = basename_split[1].replace('XR-', '')
    key_participant_id = int(basename_split[2].replace('PID-', ''))
    key_repetition = int(basename_split[3].replace('REP-', ''))  #
    key_session = int(basename_split[4].replace('SESSION-', '').replace('.tsv', ''))  # either '1' or '2' (i.e., first or second study session)

    return {'scene': key_scene, 'xr': key_xr, 'pid': key_participant_id, 'rep': key_repetition, 'sess': key_session, 'df': df}

DATASET_LIST = [load_tsv_sample(s) for s in tqdm(sorted(glob('dataset-decompressed/SCENE-ButtonScene*XR-vr*tsv')))]

## Step 4: Inspect data and check assumptions

In the following, we perform a short inspection of the dataset. Particularly, we create boxplots of the loaded data length and calculate descriptive statistics to better understand the interactions in the dataset.

In [None]:
# check the following assumptions:
# 1. we only load vr-data from the dataset (no ar data)
assert all([x['xr'] == 'vr' for x in DATASET_LIST])

# 2. we only load button-scene data
assert all([x['scene'] == 'ButtonScene1H' for x in DATASET_LIST])

# 3. we load data for 16 participants with 2 sessions with 12 repetitions each
assert len(DATASET_LIST) == 16*2*12
for pid in range(0, 15+1):
    for session in [1, 2]:
        for repetition in range(1, 12+1):
            assert len([d for d in DATASET_LIST if d['pid'] == pid and d['sess'] == session and d['rep'] == repetition]) == 1, \
                   f'Did not find pid {pid}, session {session}, repetition {repetition} in DATASET_LIST.'

# 4. we plot the length of each sample in a boxplot
from useckit.util.utils import analyze_samplelength

analyze_samplelength(DATASET_LIST)

## Step 5: Preprocessing

In the following, we perform a pre-processing of the data.

The pre-processing consists of the following steps:
a. Coordinate transformations
b. Interaction extractions
c. Right hand data extraction
d. Handling of tracking loss (NaN replacements.
e. Min-Max Normalization, and
f. Padding.

### Step 5a: Coordinate Transformation

We perform a coordinate transformation to actually focus on the human behavior in the data. For this, we subtract every column of the tabular data from its very first value. This allows us to make the data invariant to its original position.

For example, body height (denoted by the y-axis of the HMD) is a very strong biometric feature. However, it is more of a physiological biometric feature than a behavioral one. Humans can be easily distinguished by their body height. By subtracting all values in the column from the very first value, we can remove the initial value and rather focus on changes in body height, in case of the y-axis of the HMD. This works for all other columns similarly, too.

In [None]:
for d in tqdm(DATASET_LIST):
    for c in d['df'].columns:
        if 'position_' in c:
            d['df'][c] -= d['df'][c].iloc[0]

### Step 5b: Extract Interactions

Each interaction has a different length, as seen by each file having a different number of lines. To process these files in deep learning, we need to unify the shape and thus the lengths of the various data files. First, we extract the actual human interaction from the file by checking for certain `Unity.Events` such as `OnButtonTouchBegin()` and `OnButtonTouchEnd()`. We consider all values between these two markers and additionally +/- 1 seconds.

In [None]:
def interaction_extractor_button_scene(df: pd.DataFrame) -> tuple:
    """Find when the button in the button scene was touched in df's `Unity.Event` column and when it was let go. Return indexes of this interaction."""
    # define start searchstring and determine its index

    # replace NaNs in Unity.Event column with emptystr
    df['Unity.Event'] = df['Unity.Event'].fillna(value='')

    start_searchstr = 'OnButtonTouchBegin()'
    startidx = df.loc[df['Unity.Event'].str.contains(start_searchstr, regex=False)].index[0]

    # define end searchstring and determine its index
    end_searchstr = 'OnButtonTouchEnd()'
    endidx = df.loc[df['Unity.Event'].str.contains(end_searchstr, regex=False)].index[0]

    return startidx, endidx

# create a positive and negative offset
start_offset = -72
stop_offset = abs(start_offset)

for d in DATASET_LIST:
    start_idx, stop_idx = interaction_extractor_button_scene(d['df'])
    start = start_idx + start_offset
    stop = stop_idx + stop_offset
    if start < 0:
        start = 0
    assert 0 <= start < stop

    d['df'] = d['df'].iloc[start:stop]

    if len(d['df']) < 10:
        print(f'Warning: sample {d} has a length of {len(d["df"])} with start_idx {start_idx}, stop_idx {stop_idx}')

    del start_idx, stop_idx, start, stop

We check the extracted interactions' lengths by creating a boxplot:

In [None]:
descriptive_stats = analyze_samplelength(DATASET_LIST)
descriptive_stats

### Step 5c: Extract Right Hand Interaction Only

As all participants in the original study were right-handed, we intend to focus only on the righ thand's position (Euler coordinates) and orientation (Quaternion). Subsequently, we want to remove all other columns.

In [None]:
# determine columns that actually belong to the user's hands
# we only select the right hand here as all participants were right-handed and thus interacted with the right hand with the button
hand_columns = [c for c in DATASET_LIST[0]['df'].columns if 'R_' in c and ('.position' in c or '.rotation.quaternion' in c)]
#print('\n'.join(hand_columns))

# apply these columns, i.e., remove all data that does not have a connection to user's hands
for d in tqdm(DATASET_LIST):
    d['df'] = d['df'][hand_columns]

### Step 5d: Handle Tracking Loss

In case the tracking of the Meta Quest 2 fails, no values are provided that are interpreted as not a number (NaN). We siply replace those values with 0f.

In [None]:
# fillna
for d in tqdm(DATASET_LIST):
    d['df'] = d['df'].fillna(value=0.0)

### Step 5e: Apply MinMax normalization

We apply a MinMax to fit all values from the elicited data into the interval $[0; 1]$.

In [None]:
import sklearn.preprocessing

for d in tqdm(DATASET_LIST):
    d['df'] = sklearn.preprocessing.MinMaxScaler().fit_transform(d['df'])

### Step 5f: Apply Padding

At last, we apply a pre-padding to unify the shape of the data. Until now, each dataframe had a varying length. We chose a `maxlen` of 180 which corresponds to the mean plus standard deviation of the boxplot above. Also, at the end of the preprocessing, we export the data into numpy arrays for faster processing.


In [None]:
maxlen = descriptive_stats["mean"] + descriptive_stats["sd"]

In [None]:
from tensorflow.keras.utils import pad_sequences  # https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences
DATA: np.array = pad_sequences(sequences=[d['df'] for d in DATASET_LIST],
                               maxlen=maxlen,
                               dtype='float16',
                               padding='pre',
                               truncating='pre',
                               value=0.0)
LABELS: np.array = np.array([d['pid'] for d in DATASET_LIST])
SESSION: np.array = np.array([d['sess'] for d in DATASET_LIST])

## Step 6: Creating a useckit dataset

Here, we create a Dataset object of useckit. It helps us to specify into which sets we want to assign our data. Particularly, we want to put the Session 1 into training. Nex,t we want to use it also for testing but as the enrollment dataset. For determining the final performance metrics, we put the second session into the matching dataset.

Since we do not specify a specific part for validation, the enrollment data will be used also for validating the trained models. However, this is not an issue, since we perform a hold-out validation anyways, where the second session is reserved for testing by remaining unseen during training.

In [None]:
import useckit

In [None]:
DATASET = useckit.Dataset(trainset_data=DATA[SESSION == 1],
                          trainset_labels=LABELS[SESSION == 1],
                          testset_enrollment_data=DATA[SESSION == 1],
                          testset_enrollment_labels=LABELS[SESSION == 1],
                          testset_matching_data=DATA[SESSION == 2],
                          testset_matching_labels=LABELS[SESSION == 2],)

In [None]:
DATA[SESSION == 1][0]

Here, we set some essential useckit parameters. First, we disable verbose output and we train for 10 epochs.

In [None]:
verbose=False
epochs=10

## Step 7: Training and Evaluation with useckit

In the following, we are going to fit the deep learning models in useckit's paradigms to our data. By calling the evaluation methods, we are going to generate our final metrics. Useckit contains support for TSC Classifications (cf., Step 7a), Distance Metric trainings (cf., Step 7b), and Anomaly Detection techniques (cf., Step 7c).

### Step 7a: Training the useckit TSC Evaluator

The TSC Evaluator performs closed-set identification using time-series classification techniques.

In [None]:
from useckit.Evaluators import TSCEvaluator

tsc = TSCEvaluator(DATASET, epochs=epochs, verbose=verbose)
tsc.evaluate()
print("Evaluating TSC finished. Check your disk for results.")

### Step 7b: Distance Learning

Next, we train and evaluate the distance learning functions.

In [None]:
from useckit.Evaluators import DistanceLearningEvaluator

dle = DistanceLearningEvaluator(DATASET, epochs=epochs, verbose=verbose)
dle.evaluate()

### Step 7c: Anomaly Detection

At last, we train and evaluate the autoencoder-based anomaly detectors.

In [None]:
from useckit.Evaluators import AnomalyDetectionEvaluator

ade = AnomalyDetectionEvaluator(DATASET, epochs=epochs, verbose=verbose)
ade.evaluate()

In [None]:
print("Execution finished.")