## <center>TypeLikeU: Keystroke Dynamics Prediction using Deep Neural Networks</center>

<center> COMP576 Final Project Proposal, Fall 2022 </center>

<center>Author: Xingya Wang</center>

## 1. Abstract (TO-DO)

The goal of this project is to develop a deep-learning based model `TypeLikeU` for predicting users' keystroke dynamics. Since there is no prior work on this problem based on the literature on keystroke dynamics that the author can find, the emphasis of this project will be constructing new features, experimenting with the variables, and achieving a baseline understanding of what works and what does not. In particular, we sythesized some feature engineering methods from the Keystroke Dynamics Authentication (KDA) literature and proposed some new feature enginnering methods for encoding keystroke dynamics while preserving the sequential nature of keystrokes as well as the positional information from keyboard layouts. Finally, the results from this project can be served as a baseline for future work on this topic.

## 2. Background and Motivation

Keystroke dynamics refers to the detailed timing information of key pressese and releases associated to a typing event of an individual on a keyboard. Such information can be characteristic enough to be used as biometric data for identifying individuals: dating back to the late 19th century, telegram operators could be identified according to their unique tapping rhythem [[1](https://en.wikipedia.org/wiki/Keystroke_dynamics#cite_note-7)]; then, starting from the early 2000s, with the rise of computer usage and global network access, keystroke dynamics has been studied in the context of cybersecurity as a potentially more efficient and reliable method for authenticating users [[1](https://en.wikipedia.org/wiki/Keystroke_dynamics#cite_note-7)]. 

Traditional authentication methods requires constructive information such as passwords and QR codes, which has no deterministic connections to individual users. In seeking for higher security level, authentication methods through biometric data such as fingerprint, facial recognition, and iris scan are invented. However, the application of these biometric authentication systems is currently limited to specialized tasks and environments such as airports and policy agencies due to their requirements of specialized hardware. Keystroke dynamics based authentication, on the other hand, requires no special hardware and provides a non-intrusive means of authentication; moreover, depending on the implementation of such system, keystroke dynamics authentication can be a continuous process, as oppose to the traditional one-time authentication style [[2](https://aip.scitation.org/doi/pdf/10.1063/1.5133925?cookieSet=1)].

Keystroke dynamics authentication (KDA) is intrinsically a __classification problem__: giving a sequence of character entries and corresponding keystroke dynamics, a model aims to match a user in the database or determine the data is entered by an imposter. There has been extensive work done within the scope of such classification problem to design and compare different models, provide scalable solutions, and incorporate unseen users. However, another naturally related problem to KDA in the context of cybersecurity has received much less attention: the replicability of individuals' keystroke dynamics -- knowing a certain amount of historical keystroke dynamics data of a user, can _an imposter_ generate similar keystroke patterns to fool the authentication system? More explicitly, this is the __regression problem__ related to keystroke dynamics.

In the context of KDA, the regression problem can serve to test the security level of keystroke dynamicÏ based authentication systems (classification model); or even as the generator player of a GAN network [[X](https://en.wikipedia.org/wiki/Generative_adversarial_network)]. 
There has been experiments conducted which desmonstrate the vulnerability of keystroke dynamics based authentication scheme [[3](https://dl-acm-org.ezproxy.rice.edu/doi/10.1145/3372420), [4](https://dl-acm-org.ezproxy.rice.edu/doi/pdf/10.1145/2516960)]; though, the mimicry attacks are generally generated through human imposters (untrained or trained e.g. via intentional practice or via assisted softwares) [[5](https://flyer.sis.smu.edu.sg/ndss13-tey.pdf), [6](https://ieeexplore.ieee.org/document/8487341), [3](https://dl-acm-org.ezproxy.rice.edu/doi/pdf/10.1145/3372420)], or are launched through algorithmatic programs designed with traits studied and extracted through statistical analysis by experts [[4](https://dl-acm-org.ezproxy.rice.edu/doi/10.1145/2516960)]. These mimicry attacks against the existing keystroke dynamics based authentication scheme are not the most efficient and often require substantial amount of human intervention efforts; on the other hand, since these keystroke dynamics based authentication scheme are often machine learning or deep learning models, why not have another machine learning or deep learning model carry out the mimicry attacks? 

In this project, we want to study the regression problem related to keystroke dynamics by designing and training deep neural networks to predict the users' keystroke dynamics. 

## 3. Challenges and Related Work

Given that the classification problem related to keystroke dynamics has been studied extensively for an extended period of time, it is quite surprising that, there has been no prior attempt to design and train a deep neural network that solves the regression problem in the existing literature (that the author can find). Our final project is thus extremely __experimental__ and requires substantial amount of effort to be spent on __trying out feasible approaches__. With that being said, from feature engineering to model designs, literature on the classification problem will serve as our main inspiration.

### 3.1. Keystroke Features

Feature engineering is a crucial step in considering keystroke dynamics for several reasons. __(1)additional features?__ Generally, the only data that can be collected about the user's typing behavior are the (javascript) keycodes and the corresponding timestamps associated to key presses and releases, from which only little information such as the latencies (i.e. timestamps differences) from pairs of consecutive keystroke events can be obtained. __(2)features on different scales?__ On one hand, the keycodes are in nature categorical with around 116 unique keycodes (QWERTY keyboard with keypad) and thus probably needed to be one-hot encoded, whereas the latencies (in milliseconds) generally take value in large range $\approx[0, 10^5]$ of integers. __(3)discarding outliers?__ On the other hand, the latencies are very unstable with large standard deviation and extreme outliers; this is even more so with _transition latencies_ involving timestamps from different keys. __(4)insufficient data per user?__ Finally, there is an intrinstic paradox in the "necessary" training samples per user: while deep learning models need more keystroke data per user to learn and distinguish different users, it is not realistic to require a large amount of keystrokes data from a single user from application standpoint. So, our only hope is for the model to learn enough "common" typing patterns from data of a large pool of users, and be able to use little data to distinguish individual user's keystroke pattern and adapts to it.

#### 3.1.1. Choice of Unit Token

There has been many attempts to construct keystroke features from the limited information. Most commonly, the unit token in keystroke data are either single keystroke (uni-graph) or a pair of consecutive keystrokes (di-graph), and the corresponding timestamps are processed into latencies: differences between the keypress and release timestamps. For a pair of keystrokes, there are six latencies one can compute; however, three of the latencies together uniquely determines the rest [[Alsultan](https://www.researchgate.net/publication/313742321_Keystroke_dynamics_authentication_A_survey_of_free-text), [15](https://arxiv.org/pdf/2101.05570.pdf), [13](https://www.sciencedirect.com/science/article/pii/S0167404820301334)]. 
<center><img src="../img/latencies.png" alt="Latencies Illustration" width="600"/></center>
We will refer to the four latencies (IL, PL, RL, TL) that involves more than one key transition latencies.

#### 3.1.2. Input Format: Sequential or Image-like

Traditionally, the keystroke data are considered as time series and are organized into sequential format where each time step corresponds to a unit token represented in its feature vector. For example, with respect to unit token being _uni-graph_, a common feature vector resembles the following:
<center>
    <img src="../img/unigraph.png" alt="Unigraph with ALL latencies (HL, PL, IL, RL)" width="350" title="Unigraph with ALL latencies (HL, PL, IL, RL)"/>
</center>

or the uni token being _di-graph_, a common feature vector resembles the following:

<center>
     <img src="../img/digraph.png" alt="Digraph with ALL latencies (HL, PL, IL, RL)" width="410" title="Digraph with ALL latencies (HL, PL, IL, RL)"/>
</center>

As briefly mentioned earlier, a sequential input vector has some intrinsic downsides: on one hand, the javascript keycodes need to be one-hot encoded, which makes the input vector very sparse; on the other hand, the non-zero entries of a feature vector include the latencies which takes value in integer range much larger than $[0,1]$.  

With the popularity gained by convolutional neural networks in recent years, there are attempts using CNN models to tackle the keystroke dynamics problem. The most immediate adaptation for this purpose is to take the fixed time-window length sub-dataframe as "images" and pass them directly into a CNN model. Beyond that, there are many other methods proposed to encode the features into more "image"-like features. For example, in [[14](https://arxiv.org/pdf/2107.07009.pdf)], the authors proposed a new feature engineering method, called the keystroke dynamics image (KDI), to encode keystrokes (in a fixed window) into "an image": each "color channel" of the image corresponds to a single feature, and the $(i,j)$-entry of each matrix encodes the feature value corresponding to the (ordered) pair of keystrokes with (encoded) keycodes $(i,j)$. 

<center> 
    <img src="../img/KDI.png" alt="Keystroke Dynamics Image (Keycode Modified)" width="350"/>
</center>

The authors compaired this KDI input format against the traiditional sequential input format (each row represents a keystroke) using two different models: a CNN model and a CNN-RNN model. Even though the KDI input format outperforms the tradictional sequential featurel based on their experiement results, there lacked the comparison of the sequential input format training on a purely RNN model, so it's unclear whether KDI is indeed a better feature encoding method, or it is only better then taking fixed time-window length sub-dataframe as "images". In [[Piugie](https://hal.archives-ouvertes.fr/hal-03716818/document)], the authors proposed another "image"-like feature engineer method by converting the latencies feature vector ]into a matrix using _squarenorm()_ function in _Matlab_, which essentially computes the pair-wise distances of the latencies; the matrix is then transformed into a 3D-tensor using the _imagesc()_ function in _Matlab_. The authors then uses multiple state-of-art pretrained convnet models to process these feature images, among which GoogleNet seems to perform the best in terms of Equal Error Rate (EER).


These image-like feature engineering methods, however, has a common critical downside to the more traditional sequantial input formats: it forgets the instrinsic sequential ordering of the timestamps (beyond a length of 2 in di-graphs) within the fixed-window size by bagging them into a single matrix or separates them into different matrices. This missing piece of information might not be critical in the classification task, but is intuitively less likely to be unimportant in keystroke timestamps predicion (the regression task), as users tends to produce more mistakes later into a continuous typing session. 

#### 3.1.3. Additional Features

In addtion to keycodes and timestamp latencies, another implicit feature associated with keystroke dynamics that might be useful for training is key positions on the keyboard (note: we limit to the case of QWERTY keyboard in this project). For example, in [[Singh](https://ieeexplore.ieee.org/abstract/document/5990168)] a keyboard grouping technique was introduced to classify keycodes based on their locations on the keyboard, which was divided into eight sections: two left and right halves and then each half divided into four lines representing the rows of the keyboard. In [[Alsultan2](https://ieeexplore.ieee.org/document/6722548)], by considering key-pairs on keyboard within a certain finger distance, the authors incorporated keyboard layout into their proposed keystroke dynamics authentication system to calculate similarity between the timing features typed and the user's profile data using the Euclidean distance. 

### 3.2. Timing Latencies and Outliers

Keystroke dynamics datasets, especially those created for the purpose of training free-text keystroke system, generally contain many noises caused by extreme hold latencies or transition latencies involving more than two keys. Preprocessing the noises present in these keystroke dynamics datasets could be crucial, as otherwise the outliers might influence the ability for any model to learn. In [[Alsultan](https://www.researchgate.net/publication/313742321_Keystroke_dynamics_authentication_A_survey_of_free-text)], the authors surveyed past choices made for removing noise from the keystroke data: some researchers chose to keep only the keystrokes for which the hold latencies and transition latencies are below a predefined standard deviation were added to the users pool [16, 17]; some other researchers, like in [18], choise to fix hard minimum and maximum to limit the range that timing latencies can take values in, where the choice of range of the latencies (i.e. the minimum and maximum).

The outliers and large range of values taken by the keystroke timing latencies can be caused by "bad" data: a pause from a pressed key to the next could just be the typers are getting distracted, and in this case, the extreme data shouldn't be accounted as one of the typer's keystroke trait for the model to learn; however, there are also extreme data points which can just be usual behavior of certain users and thus should be picked up by the model and use it as a user's trait to distinguish them from others. Thus, one needs to be careful when discarding outliers, as there might be "True" outliers that the model is supposed to learn. For example, consider the following box plot of the distributions of keystroke Press Latencies of six users:

<center>
    <img src="../img/outliers-samples.png" alt="Outlier Samples" width="850" title="Outlier Samples"/>
</center>

Notice that PL $\approx 1200$ seems to be an outlier data for User 87; however, it is a completely normal datapoint for User 83. Similarly, PL $\approx 2800$ seems to be an outlier data for User 35, but it is a relatively normal data for User 83. Moreover, some of these points outside of the 4 time Interquartile Range (IQR) can even be crucial information for distinguishing users: for example, User 83 and User 144 have relatively small number of points outside of $4*$IQR, whereas the other four users both have relatively large number of points outside of $4*$IQR. Then, consider User 102, which has three data points that are clearly outliers: this could be where the user was distracted and temporarily zoomed out of their continuous typing session. Extreme outliers of this type can negatively influence the model training process. Finally, a common pattern that we are seeing here is that for each of the user, there are generally no more than 5-10 extreme outlier data points.

Based on these observation on individual user's keystroke latencies, we can conclude that dropping outliers on a global level is __not__ going to be very useful, as users sits on different keystroke latency ranges; outlier filters must be applied on user-level. To filter out outliers, we can either use quantile-ranges, setting standard deviations, or hard minima and maxima. However, based on our observation for outlier samples, it might be more effective to filter out absolute number of data points for each user: the smallest and largest 3-10 latencies.

### 3.3. Model Selection

<!--  User Embedding: TypeNet model, CNN based
Table IV of [[Piugie](https://hal.archives-ouvertes.fr/hal-03716818/document)]: some example of state-of-art pretrained CNN models performance.
-->

Suppose our task was to predict the keystroke dynamics of a __single__ user, and we have a sufficient amount of the user's keystroke data, then the keystroke regression problem would be relatively easy, and a basic RNN neural network should be sufficient to reach good performance: the input of the model can just be the keycodes, and the output be the user's keystoke timestamps. However, the reality is that our dataset consists of a large number of users, and for each of the users, the amount of keystroke data is relatively small. So, any model that would perform well on the dataset will probably need enough user-specific input data to "embed" and distinguish the different users (e.g., only keycodes are not sufficient), and learn enough common patterns from the data of all users so that the model is able to make predictions for an individual user based only on a small amount of data. Due to this observation of our dataset, I constructed the following model architecture with the precise problem statement on the right:
<center> 
    <img src="../img/model-architecture-statement.png" alt="Problem Statement" width="800"/>
</center>

In particular, our model will consist of a user-embedding layer, a keycode-embedding layer, from which the outputs get concatenated and fed into a common model to make predictions for the user's current keystroke time latencies.

#### 3.3.1. TypeNet-based Variations 

In [15], the authors designed and trained a Siamese network called the TypeNet to perform one-shot classification of the users based on their keystroke dynamics. In particular, the user embedding layer of TypeNet is a LSTM based network of the following structure:

<center>
    <img src="../img/TypeNet-structure.png" alt="TypeNet Structure" width="180" title="TypeNet Structure"/>
</center>

The model consists of two LSTM layers of 128 units with (tanh activation function). Sandwiched Between the LSTM layers are batch normalization and dropout at a rate of 0.5 to avoid overfitting. Additionally, each LSTM layer has a recurrent dropout rate of 0.2.

We want to use this model structure as a base model for the user-embedding layer in our model. However, since we are using Google Colab to perform our trainings, the addition of recurrent dropout will slow the training down substantially since the Colab are not able to use cuDNN kernel to speed up training. Moreover, in our past experience working with RNN models, as well as in experiments conducted in [13], GRU layers generally outperform LSTM layers. Finally, the number of units used in TypeNet is pretty high; even though in theory we should have enough training examples for such high amount of units (see __Section 4.1 The Dataset__ for description of our dataset), due to actual training resource limitations as well as training time limit, we will only use a small portion of the entire dataset. So to prevent overfitting, we will probably also need to adjust the number of units in the TypeNet model.

#### 3.3.2. CNN-based Model

In [14], along with the Keystroke Dynamics Image (KDI) feature engineer method, the authors proposed a convnet model structure of the following architecture:
<center>
    <img src="../img/CNN-KDI.png" alt="CNN Structure in [14]" width="700" title="CNN Structure in [14]"/>
</center>
According to the experiment results recorded in [14], the CNN model structure above outperforms the other CNN-RNN model struture in classifying the user in both of the datasets they considered.

We will try to adapt (part of) this model structure to our user embedding layer and keycode embedding layer with the image-like input format. In addition, notice that the above CNN structure has a similar feel to the LeNet-5 model structure that we implemented and experimented with the dataset CIFAR-5 in assignment-2 in the sense that both convnet structure has the retangular box becoming thinner and longer later into the layers, and the difference of the two model structures lies in the frequency of pooling layers appearances among the convolutional layers. We will be creative with the number and position of pooling layers in our experiment.

#### 3.3.3. Keycode-To-Time Model: the average user problem

As we briefly mentioned in the beginning of **Section 3.3**, that if our dataset only consists of a single user, the regression task would have been much simpler. To be able to inspect whether certain time latencies are more complicated for a model to predict, we eliminate the distraction of variations coming from the different users by taking the average of time latencies over all users and create the _average user_ profile. Using this average user profile, we will be able to see if 

### 3.4. Loss & Training Performance

#### 3.4.1. Choice of the Loss Function

Here is a list of common regression loss functions:
* Mean Squared Error
* Mean Absolute Error
* LogCosh
* Huber
* Mean Squared Percentage Error

Since our dataset contains many outliers, the Mean Squared Error loss is probably not a very good choice: due to the squaring on the differences, large errors are emphasized and have great effect on the value of the performance metric. We will experiment with different measures and consider which of the loss function improves the training performance.

#### 3.4.2. Interpreting the Loss

There is not a baseline to compare our performance against. So the only way to assess our performance is to compare with the statistics of the original values in the dataset. 

* for each of the users: we calculate the following statistics and plot a histogram to understand the distribution.
    * standard deviation 
    * mean
    * min
    * max
    * median
* and we will also obtain the above statistics on the global level (for all users) and compare the loss.

## 4. The Dataset and Preprocessing

### 4.1. The Raw Dataset

The contexts of studying of keystroke dynamics are characterized in terms of the following aspects:
* __fixed-text versus free-text systems__: fixed-text system refers to authenticate users through typing repetitively over a fixed string of text, whereas free-text system does not have such restrictions and the data gathered from user usually consists of keystrokes of different sentences. 
* __typing device__: in the early times of keystroke dynamics, the most common typing device are mechanical keyboards; in recent years, keystroke dynamics on touchscreen devices has gained a lot of attention as touchscreen functionality have been incorporated into more and more electronic devices.
* __keystroke features__: the most common type of datasets used in the study of keystoke dynamics usually consists of only characters with the associated key press and release timestamps, and perhaps some general meta data about the participants. On the other hand, along with the development in artificial intelligence, we have more tools to process and analyze other input data such as visual recordings and keystroke pressures; thus, there have also been datasets constructed which includes these features on keystroke dynamics.

There are different datasets collected for studying different systems of keystroke dynamics authentication: the two most explored free-text datasets based on mechanical keyboard typing are the _Buffalo Keystroke Dataset_ [[7](https://ieeexplore.ieee.org/document/7823894), [Dataset](https://www.buffalo.edu/cubs/research/datasets.html#title_429116244)], and the _Clarkson II Keystroke Dataset_ [[8](https://ieeexplore.ieee.org/document/8272738), [Dataset](https://citer.clarkson.edu/research-resources/biometric-dataset-collections-2/clarkson-university-keystroke-dataset/)]. There are also datasets collected on mobile devices with touchscreen keyboards, such as [[9](https://userinterfaces.aalto.fi/typing37k/resources/Mobile_typing_study.pdf)]. More traditional fixed-text datasets also exist, such as [[10](http://www.cs.cmu.edu/~keystroke/KillourhyMaxion09.pdf), [Dataset](https://www.cs.cmu.edu/~keystroke/)].

In this project, we will use the open-source dataset proposed in the paper [[11](https://userinterfaces.aalto.fi/136Mkeystrokes/resources/chi-18-analysis.pdf)] which can be accessed online through [[Dataset](https://userinterfaces.aalto.fi/136Mkeystrokes/)]. The dataset consists of 168,000 volunteers typing over 136 million keystrokes (15 sentences per volunteer) via an online typing platform [[Platform](https://typingtest.aalto.fi/)] (here is a screenshot): 

<center> <img src="../img/136m-app-screenshot.png" alt="APP SCREENSHOT" width="400"/> </center>

and the dataset has the following characteristics:
* free-text systems: sentences are not repetitively typed by participants;
* typing devices are mechanical English keyboards (either full keyboard or laptop), and with QWERTY, AZERTY or QWERTZ layout;
* features conatin only keystroke timestamps (as oppose to keypress pressure data, or visual recording of typing actions).

More specifically, the raw dataset is a folder containing, for each participant, a corresponding `<PARTICIPANT_ID>_keystrokes.txt` file with the following information:
* `PARTICIPANT_ID`: unique ID of participant
* `TEST_SECTION_ID`: unique ID of the presented sentence
* `SENTENCE`: sentence shown to the user (each user types 15 sentences)
* `USER_INPUT`: sentence typed by the user after pressing Enter or Next button
* `KEYSTROKE_ID`: unique ID of the keypress 
* `PRESS_TIME`: timestamp of the key down event (in ms) 
* `RELEASE_TIME`: timestamp of the key release event (in ms) 
* `LETTER`: the typed letter
* `KEYCODE`: the javascript keycode of the pressed key

Aside from the keystroke timestamps data from each participant, the dataset also contains a `readme.txt` file and some meta-data about the participants collected in `metadata_participants.txt`:
* `PARTICIPANT_ID`: unique ID of participant
* `AGE`
* `GENDER`
* `HAS_TAKEN_TYPING_COURSE`: whether the participant has taken a typing course (1) or not (0)
* `COUNTRY`
* `KEYBOARD_LAYOUT`: QWERTY, AZERTY or QWERTZ layout of keyboard used
* `NATIVE_LANGUAGE`
* `FINGERS`: choice between 1-2, 3-4, 5-6, 7-8 and 9-10 fingers used for typing
* `TIME_SPENT_TYPING`: number of hours spent typing everyday
* `KEYBOARD_TYPE`: full (desktop), laptop, small physical (e.g on phone) or touch keyboard
* `ERROR_RATE(%)`: uncorrected error rate
* `AVG_WPM_15`: words per minute averaged over the 15 typed sentences
* `AVG_IKI`: average inter-key interval 
* `ECPC`: error Corrections per Character
* `KSPC`: keystrokes per Character
* `AVG_KEYPRESS`: average Keypress duration
* `ROR`: rollover ratio

For the purpose of shortening training time and limited training resources, we are only using the users with "PARTICIPANT_ID" $\leq 7001$  in the project (note: not all integer under 7001 is a "PARTICIPANT_ID"). Moreover, for the most part of the development process, we are only using $\leq 500$ users, since this is already equivalent to XXXXX many keystroke samples. The scope of the problem concerned in the project is largely determined once we chose this dataset: **we will consider _free-text_ systems with _QWERTY keyboards_ and investigate the keystroke dynamics based on _purely timestamps_ information.**

### 4.2. Feature Extractions

The users' keystroke data stored in individual _.txt_ files are imported and stacked into a single Pandas dataframe, and the split into train, dev, and test sets on the user-level respecting the original sequential nature of the keystroke data. For example, the first three rows of the trainset dataframe looks like:
<center> <img src="../img/train-data-head.png" alt="train_data.head" width="1050"/> </center>

where we added the feature "INDEX", which is the index of the keycode in its current sentence, during the phase of importing the data.

Then, we need to extract the timing latency features from the `PRESS_TIME` and `RELEASE_TIME` columns, as well as incorporate the keyboard layout as keycode distances -- either keycode-pair distance, or distance back to the home key "F" and "J". (Note: we included a screenshot of the resulting dataframe in __Section 3.1.2 Input Format: Sequential or Image-like__.) As simple as this task may sound when it is summarized into one-sentence above, the feature extraction phase actually consumed the bulk of my project time, and let me briefly communicates the difficulties here.

#### 4.2.1. Time Latencies Extraction

The intuitive way to extract time latencies from the `PRESS_TIME` and `RELEASE_TIME` columns is to just loop through pairs of consecutive rows and perform the subtractions. However, one needs to be careful here: 
* __we have a large number of rows, to loop through the entire dataframe every time we need to perform time latencies extraction will take up substantial amount of time__

So we need to find a way to vectorize the process and take advantage of the pandas dataframe ability making efficient large dataframe operations. One solution is to make use of _shifts_: for example, suppose we named the training set dataframe `train_data`, then
* to obtain the Hold Latency (`HL`): calculate `train_data['PRESS_TIME'] - train_data['RELEASE_TIME']`
* to obtain the Press Latency (`PL`): constructe the shifted column `shifted_PRESS = train_data['PRESS_TIME][1:]` and pad the end of the column with any value, e.g. 0; callthen, calculate `shifted_PRESS - train_data['PRESS_TIME']`

The other time latencies such as `IL`, `RL` are calculated in the similar fashion. However, there is a serious problem here: 
* __the entries of the shifted columns, e.g., `shifted_PRESS`, might not be what we expected__ 

More specifically, if a character 'A' is at the end of a sentence and a character 'B' is at the beginning of the immediate next sentence, then the `shifted_PRESS` of that row computes the keypress timestamp differences of 'A' and 'B', which can be a huge number since distinct sentences are not necessarily typed under one continuous setting and could even be entered by two different users at completely different times!

Here is a snapshot of the statistics of validation datasets ($0.2%$ of 400 users) __before__ this mistake is fixed:

<center> 
    <img src="../img/digraph-stats-pre-mistake.png" alt="Digraph Statistics (before mistake is fixed)" width="450"/> 
</center>

And here is a snapshot of the statistics of validation datasets ($0.2%$ of 400 users) __after__ this mistake is fixed:

<center>
    <img src="../img/digraph-stats-post-mistake.png" alt="Digraph Statistics (after mistake is fixed)" width="400"/>
</center>

> This was a mistake that I did not realize until __two days after__ the virtual poster presentation on Gathertown: I was looking into ways of discarding the outliers and decided to carefully inspect a couple instances where the time latencies are as large as $10^8$. Once I did that, I realized that this common patter of extremely large time latencies all came from sentence transitions.  \
This artifical outliers created because of this mistake is, of course, __THE MAIN CONTRIBUTING__ reason to why all our models are performly relatively poorly in prior trainings: using the mean absolute error as our loss, the best training loss __was around 600__, and the best validation loss __was around 2000__. For our most current training result, refer to __Section 5. Experiment Results and Discussion__.

#### 4.2.2. Keyboard Layout Encoding

Following the key-pair distance metric suggested in [[Alsultan2](https://ieeexplore.ieee.org/document/6722548)], for a pair of keycodes, we want to compute the relative finger distance between them. To do this, the simplest way I could think of is to firstly encode the QWERTY keyboard key positions with corresponding javascript keycodes as a Pandas dataframe, and then construct a Python dictionary where the keys are keycodes and the corresponding values are the $(i, j)$ position of the keycode in the Pandas dataframe. Since there are no keycode-keyboard encoding we can find online, I manually constructed the keycode-board encoding:
<center> <img src="../img/keycode-keyboard.png" alt="Trainset Statistics - Digraph Dataframe" width="650"/> </center>

and using which, I define the Pyhton dictionary used to extract positions of the keycodes. Finally, the key-pair distance metric is the just the manhattan distance between the $(i,j)$ coordinates of a pair of keycodes.

#### 4.2.3. Image-like Input

The KDI feature encoding method from [[14](https://arxiv.org/pdf/2107.07009.pdf)] inspired us to consider the following new image-like input. Recall from the distcussion of this KDI feature in the previous __Section 3.1.2 Input Format: Sequential or Image-like__, we remarked that the KDI feature forgets about the natural sequential nature of the keycodes when being encoded into such a matrix. Thus, we try to modify this disadvantage by adding the `INDEX` (i.e. `I1` or `I2`) as a color channel in the matrix. Moreover, instead of hard coding the rows and columns of each matrix to be corresponding letters like in [[14](https://arxiv.org/pdf/2107.07009.pdf)], we extract the most frequently appeared X many keycodes where X is the length of the matrices. Below is a visualization of color channels of a matrix:

<center> 
    <img src="../img/KDI-group.png" alt="Keystroke Dynamics Image (Keycode Modified)" width="1000"/>
</center>

We comment a couple prominent features:
1. the entries that "lights up" in all matrices are the same: this is from our construction, since the $(i, j)$ entry value will only be non-zero if the both of the keycodes in the keycode-pair appeared in top X (e.g. here is 25) most frequent keycode list. Moreover, $i$ is the frequency ranking of the first key in the keycode-pair, and $j$ is the frequency ranking of the second key.
2. the matrices `I1` and `I2` looks the same: this is because color is a representation of the value in the entry, and since a keycode-pair has consecutive index, the values in `I1` and `I2` are also very similar. In training, we omit the `I2` feature and only use `I1`.
3. the matrix corresponding to `IL` has much lighter background: this is because `IL` contains relatively large amount of negative latencies, which is known as the _rollover_ effect when one is typing -- the second key is pressed before the first key is released; this phenonmenon is even more prominent with fast typers [11].


#### 4.2.4. Incorporate and Streamline ALL the Variables

There are quite a few variables we want to experiment with in this project. Below are a few of the important ones
* sample size
* unigraph V.S. digraph
* use the Average User profile or original user
* include keycode-keyboard features or not
* whether to apply outlier filters; if yes, which filters to apply
* input format: sequential or image-like
* unit window length (=`n_steps`+1, where `n_steps` is the number of characters the model used in each training sample as the user-embedding information)
* which time latencies should be included in the input, which time latencies we will use as output
* for a pair of keycode in digraph, should one-hot encoded vectors be concatenated or added

but of course there are the common hyperparameters such as batch size, learning rate, regularization, optimizer, loss, and etc., that also need to be experimented. And finally, we need to test out a couple different model structures. So, in order to develop the code to incorporate and streamline all these variables, the preprocessing phase took a substantial amount of effort and time of the entire project.

## 5. Experiments, Results and Discussions

> **NOTE: a (partial) detailed tabulated results records can be found under the `results` folder:**  
> * `results/experiments_tracking_details/older_exp_details.html` 
> * `results/experiments_tracking_details/stage_3_exp_details.html`

> For a detailed description of the variables/parameters we are experimenting with, and their potential values, refer to 
> * `results/experiments_tracking_details/variables_values_dict.html`

> For a quick look at the data, including __visualization of inputs__ using different feature extraction methods, refer to 
> * `notebook/data_exploration.ipynb`

Our experiments have gone through three main stages that we will describe in more detail below. From each stage to the next, the prediction accuracy of our model improves significantly in a meaningful way. In particular, the models in our first stage has
* training loss (MSE) $\approx 1.78 * 10^{12}$ 
    * (MAE) $\approx 2.9 * 10^4$
* validation loss (MSE) $\approx 8.16 *  10^{12}$ 
    * (MAE) $\approx 1.3 * 10^5$

which, even they are in miliseconds, are still astronomical numbers. And our most recent best performance on a CNN model has

* training loss (MAE) = 57.72
* validation loss (MAE) = 47.04
* testing loss (MAE) = 46.8

Let us see what we have gone through in the experiments to arrive at our recent model and discuss some next steps.

### 5.1.  A First Look: the Models & the Data (Stage 1)

Refer to the following notebooks in the folder `notebook/stage-1` for model structure code and training details:
* `CNN-1st-attempt.ipynb`
* `TypeNet-1st-attempt.ipynb`

In the first stage of our experiment, we tried the two model architectures CNN-based and TypeNet-based, proposed in [14] and [15] respectively. For the CNN model, we constructed KDI input images, where each image used digraph input features 

<center>
    ['K1', 'K2', 'I1', 'I2', 'HL1', 'IL', 'HL2', 'KD', 'HD']
</center>

from time $T$ to time $T+N$, where $N$ is the `n_steps` window and $T$ shift by a fixed number; then, we set the output to be the vector ['HL2', 'IL'] at time step T+N+1. The idea is that since the keycode corresponding to the T+N+1 timestep latencies is precisely the 'K2' in the $T+N$ timestep, which is baked into the image and so maybe we don't need to enter the keycode separately. 

For the first TypeNet model, we used features

<center>
    ['KEYCODE', 'HL', 'IL', 'RL', 'PL']
</center>

at each timestep $i$ from time $T$ to time $T+N$ as the __first input__ time-series vector, filtered out outliers that are larger than the $95\%$ quartile, and encoded the keycodes as one-hot vectors; then, we used the one-hot encoded keycode vector at time $T+N+1$ as the __second input__, and the latencies ['HL', 'IL'] at timestep $T+N+1$ as the __output__.


> Both of the model performed extremely poorly! 

Especially when we trained our first CNN model (described above), we chose Mean-Squared-Error as the loss function: not only the training and validation lossese we were getting was astronomical ($10^{12}$ digits!):

<center>
    <img src="../img/cnn-1-loss-mse.png" alt="cnn-1-loss-MSE" width="850"/>
    <img src="../img/cnn-1-loss-mae.png" alt="cnn-1-loss-MAE" width="850"/>
</center>

moreover, the loss were not improving much at all and plateaued really quickly. Later, when we switched to using Mean-Absolute-Error as the loss function, the loss looks more reasonable, but still extremely high: just slightly better than just the squared root of the MSE loss ($\approx 10^4$). And the more prominent problem still remains: the losses were not improving at all. This makes us think of whether the __inputs to the CNN model were not enough__: an input does contain the keycode of the current prediction latency; however, this piece of information was baked into the $(i,j)$ index of the last timestep, and even if the model has the index information, it is not a direct piece of information.

So, we added the one-hot encoded keycode vector at timestep $T+N+1$ to the CNN-based model as an additional input vector, concatenate it with the flattened output from the KDI input vector, and process the concatenated vector using to Dense layers to generate the time latencies predition (corresponding notebook: `stage-1/cnn_2nd_attempt.ipynb`):

<center>
    <img src="../img/cnn-2-loss-mae.png" alt="cnn-2-loss-MAE" width="850"/>
</center>

Even though the loss were still quite high, the loss definitely starts to decrease with each epoch now. The keycode vector at timestep $T+N+1$ definitely helped; however, there still seemed to be something wrong.

The TypeNet LSTM-based model, in comparison to the CNN-based model, performed much better out of the box in comparison to the CNN model: the training and validation losses, even with MSE as the loss function, was $\approx 2*10^4$; and the lossese were decreasing with each training epoch. However, we realized something more serious by printing a prediction vector: 

<center>
    <img src="../img/typenet-1-ytrue.png" alt="typenet-1-ytrue" width="150"/>
    <img src="../img/typenet-1-ypred.png" alt="typenet-1-ypred" width="260"/>
</center>

the prediction timestamps were almost the __same for all samples__! It seemed like there must be something wrong with the model, or the data has too much noise that the model was learning the average in some sense.

### 5.2. The Averge User Profile & Understanding Abnormalities (Stage 2)

Since the models on the original dataset performed rather poorly and took a long time to train, I went to TA office hours (Wei Ren Gan) to talk about the problems I was encountering and ask for suggestions on what to try that could potentially help. The TA suggested that I should maybe take a look at the simpler problem first -- the average user keystroke behavior, and see if I can train a model that can predict the average keystroke latencies very well.

So for each type of keystroke latencies and each of the keycode, I extracted the average value and build the __Average User__ profile: at first, I used __mean__ values as the averages; later, however, as I thought more about how to lessen the effect from outliers in the data without straightly discarding the outliers, I added the option to use __medians__ as the averages instead of the means. 

I though I could try to use this "Average User" profile for two purposes:
1. train a keycode-to-time model: this was not possible with the original dataset because we have many users, any model that we constructed would need some inputs (i.e. historical data from timesteps $T$ to $T+N$ of the users) to "distinguish" the users. However, since the "Average User" was constructed so that each keycode corresponds to a set of unique time latencies, we can construct a model in which the inputs are just keycodes (one-hot encoded) and the outputs are the desired time latencies.
2. train the original models with the "Average User" profile first, use it as the pretrained weights and then train on the multi-users dataset: perhaps the training performances can be improved? or at least the training speed?

For the first purpose (corresponding notebook: `stage-2/avg_user_profile.ipynb`): we constructed an extremely simple neural network, consisting of two GRU layers of 128 units stacked on top of each other and end with a (`TimeDistributed`-wrapped) Dense layer; the input to the model are
<center>
    ['K1', 'K2', 'I1', 'I2', 'HL1_avg', 'IL_avg']
</center>
and the outputs are ['HL2_avg']. We use 70 users.

In the fisrt model, we tried the MSE loss again, and the performance was rather poor (training loss $\approx 2.7*10^3$ and validation loss $\approx 2.1*10^4$), considering the data now consists of the "Average User" only. Then, we tried two other losses:

* MAE (10/10 epochs): training loss = 5.14, validation loss = 13.97

* Huber with $\delta=10$ (6/10 epochs):  training loss = 29.07, validation loss = 134.49

where the epochs ended early due to triggering the earlystopping callbacks (`patience=3`). We fixed the MAE loss function, and tried to shortening the gap between the training and validation loss by added more regularizations to each layers; however, the performances were not as good as the original simplest model. On the other hand, we tried add in or users (raised from 70 users to 400 users), and achieved the following performance:

* MAE (10/20 epochs): training loss = 0.16, validation loss = 1.5 (note: these numbers are ALL in __miliseconds__!)

We then tried out the second purpose (corresponding notebook: `stage-2/preprocess_experiment_1.ipynb`): we pretrain the original TypeNet LSTM-based model with the "Average User" data, and then train the model with the original data. The results, however, were not very exicting: with 400 users (transformed to "Average User") and digraph sequential inputs,
* in 10/10 epochs with LogCosh loss, the model trained on the "Average User" reached 
    * training loss = 758.06, validation loss = 10769.83
* in 6/10 epochs (triggered EarlyStopping) with LogCosh loss, the model with pretrained weights from training "Average User" training on the original data reached best
    * training loss = 851.13, validation loss = 11218.40
* in 10/10 epochs with LogCosh loss, the model (not pretrained) trained on the original dataset reached best
    * training loss = 855.85, validation loss = 11218.38

The pretrained model did converge faster (within fewer epochs, we reached a EarlyStopping trigger); however, the training and validation loss did not seem to be better than training from scratch. Another interesting observation is that:
> Even with the "Average User", the TypeNet based original model performed really poorly.

This is a huge alert signal that there is something wrong with the model structure itself. However, before we modified the model structure, we decided to try the "Average User" profile with `avg_mode = 'median'` instead of `avg_mode = 'mean'`, different sample sizes, and etc., but none of them really helped much, except for the input format: with 100 users (transformed to "Average User") and __unigraph__ sequential inputs,
* in 13/13 epochs with LogCosh loss, the model trained on the "Average User" reached: training loss = 28.78, validation loss = 27.33

On the other hand, when we output the prediction results from this model, the problem that model predicting all samples to be the same set of numbers happened again...

So we finally decided to inspect the model structure: the original model (refer to `TypeNet-base_no-keycode_ConcatRNN-base` in the `Model_Architecture_List.ipynb` notebook) uses the LSTM-based TypeNet structure as the User Embedding layer, and concate the user-embeded output with the one-hot encoded keycode to __feed into a GRU layer__ and into a Dense layer to get the output. The reason we used GRU layer to start with was the good performance results we obtained from the keycode-to-time model trained on the "Average User"; however, the GRU layer could be supressing too much useful information and get influenced by the sparseness from the one-hot encoded keycode vector. We decided to try replacing the GRU layer with another Dense layer, and __it worked__!!:
* in 10/10 epochs with LogCosh loss, the model trained on the "Average User" reached: training loss = 6.04, validation loss = 10.56

and the model prediction went through the following change:

<center>
    <img src="../img/GRU-to-Dense.png" alt="GRU-to-Dense" width="850"/>
</center>

Finally, we trained this new model structure on 100 users (original data): the best training and validation loss respectively reached: 94.64 and 583.21.

### 5.3. Outliers? (Stage 2+)

Now that our model performance has improved a lot and the losses seemed much more reasonable, it is time to understand the gap between the training loss and validation loss. In general scenarios, the gap between the training and validation loss is attributed to the model overfitting the taining dataset, and to resolve this, one would usually try to increase the regularization parameters or add in dropout layers. In our case however, the training loss is not low neither, suggesting that we also have an underfitting problem. So, before trying to tune the hyperparameters of the model, I want to firstly understand the elephant in the room:
> The OUTLIERS.

If one still recalls the discussion in **Section 4.2.1. Time Latencies Extraction**, this is what we are getting into again. Around the time when I finished the work described above in Stage 2, I decided to dig into the outliers and answer the following questions:
* how the outliers looks like?
* why are they so extreme?
* is it "okay" to staright up discard them?
* what kind of filtering method should I use?

So I pulled up the data statistics: first 400 users of `data/keystroke-samples` folder:

<center>
    <img src="../img/400-users-stats-old-extractor.png" alt="400-users-stats-old-extractor" width="450"/>
</center>

The `max` of all transition latencies 'IL', 'RL', 'PL' look very abnormal and so I decided to restrict to those large values to see what is going on with those users with huge transition latencies. Finding the user(s) in the data having 'PL' be the max of the 'PL' values, we obtain a single user:

<center>
    <img src="../img/max-outlier-user.png" alt="max-outlier-user" width="450"/>
</center>

Now, let us go back to the un-extracted dataframe and the corresponding keycodes to see what is going on

<center>
    <img src="../img/max-outlier-user-identify.png" alt="max-outlier-user-identify" width="850"/>
</center>

By calculating the corresponding HL (hold latency), we identify the row index of the extreme outlier row appeared in the data. Let us now zoom in to see what is going on locally around that row:

<center>
    <img src="../img/max-outlier-user-local-rows.png" alt="max-outlier-user-local-rows" width="850"/>
</center>

> The problematic row is precisely at a __sentence transition__!!

This immediately rings a bell: when we are extracting the transition latencies, in order to save up time and make use of the efficient vector calulations for Pandas dataframe, we performed a __column-wise operation by shifted columns__. The column operation has no problem for the internal keycodes in a sentence, but for keycodes located at the end of a sentence, we will be taking the transition latencies between the user's last keystroke data of a sentence and the first keystroke data of the next sentence, where there is no guarantee that any two distinct sentences must be typed in a continuous environment (i.e. recall the typing data collection interface mentioned in **Section 4.1. The Raw Dataset**). And this column-wise operation is even more problematic when we firstly split the original dataset into train-dev-test sets and then perform feature extractions.

### 5.4. After The Fix (Stage 3)

Fixing this bug, we finally get into the traditional "deep learning model experiment" phase: we tried out 8 different variations of the model structures, different input format, tried to seek out the ideal learning rates by ploting the learning rates against the training loss, and we also tried the "Average User" again to help interpret the loss. A detailed record of all the experiments that we did after the bug fix can be found in `results/experiments_tracking_details/stage_3_exp_details.html`. 

A couple of the common patterns that we observed are:
1. losses still plateaus very quickly
2. model architecture has been by far the most influential factor
3. input format is also substantial:
    * CNN model -- Our New KDI Inputs: recall that in Stage 1, we adapted the feature engineer method by [14] named KDI to package the input data from timesteps $T$ to $T+N$ into a single "image" (with the number of "color channels" the same as the number of features). The first CNN model with this KDI input was not ideal; so we modified the CNN structure by adding the one-hot encoded keycode vector as an additional input. However, that didn't seem to help too much, neither. 
    * On the other hand, recall that the sequential input format has two inputs (below), and output being the two time latencies (generally 'HL' and 'PL'). 
        * `inputA` generally are the data frame containing info from timesteps $T$ to $T+N$, and
        * `inputB` contains all the `INDEX`, `KEYCODE`, and potentially keyboard features such as `KD` (Keycode-pair Distance) and `HD` (Keycode-to-homekey Distance)
    * So, we decided to make the image-version of sequential input by package __BOTH__ of the `inputA` and `inputB` into respectively two input images (instead of only `inputA` as the image, and `KEYCODE` being one-hot encoded). 

The best training and loss results that we obtained are on the model `KDI_LeNet5_Dense-64-2` (see `notebook/Model_Architecture_List.ipynb` for details) which takes in the new KDI-biInputs that we just described above:
* training loss (MAE) = 57.72
* validation loss (MAE) = 47.04
* testing loss (MAE) = 46.8

Note that this is with 300 users and `train_size`=0.7, `dev_size`=0.15, with
* `inputA`: ['I1', 'HD', 'KD', 'IL', 'RL', 'PL', 'HL']
* `inputB`: ['I1', 'HD', 'KD']
* `output`: ['HL2', 'PL']

### 5.5 Interpreting the Loss & Further Work

After we obtained the above results, we went back again to the `Keycode_TO_Time` model (i.e. the simple 2-GRU layers with a Dense output layer neural network, the see `notebook/Model_Architecture_List.ipynb` for details) in order to understand the value of our current loss result. This decision brings out some quite interesting phenonmenon.

Recall that we were able to train the `Keycode_TO_Time` model with the "Average User" data to extremely good results: 
* `output='HL'`: training loss = 0.16, validation loss = 1.5  (Stage 2)

In fact, we also tried training the `Keycode_TO_Time` model with 
* `output='PL'`: training loss = 1.12, validation loss = 3.66
* `output=['HL', 'PL']`: training loss = 0.73, validation loss = 2.66

This is in some sense not surprising, because we have a large amount of data and there are only as much distinct outputs as there are distinct keycodes. On the other hand, what if we feed into the `Keycode_TO_Time` model with our original data? Then, the loss should be **measuring approximately the "randomness" in the data**, and we can compare to such loss the current loss that we achieved on the `KDI_LeNet5_Dense-64-2` model: using
* `output='PL'`: training loss (MAE) = 151.33, validation loss = 148.38
* `output='HL'`: training loss (MAE) = 40.94, validation loss = 41.45

Averaging out using MAE, we should get
* training loss (MAE) = 96.14
* validation loss (MAE) = 94.92

which is much higher than the lossese that we achieved with the `KDI_LeNet5_Dense-64-2` model. So our loss is indeed valuable and our model is learning the underlying users' keystroke pattern.

In fact, we also tried training 'PL' and 'HL' separately using the CNN-based models, the losses are recorded below for reference:
* `output='PL'`: training loss (MAE) = 89.44, validation loss = 61.59  (`KDI-s_KDI-xs_Dense-ReLU`)
* `output='HL'`: training loss (MAE) = 20.64, validation loss = 21.04  (`KDI-s_KDI-xs_Dense-32-2`)

On the other hand, the extreme gap between the **randomness** of `'HL'` and `'PL'` is worth exploring further: on one hand, one should acknowlegde that `'PL'` contains more variations than `'HL'`, because transition between two keys can be slowed down because of random factors such as the user is trying to read the screen, or blanking out for a second. On the other hand, there are still quite a bit of room to improve the loss: an idea is to try to include more features: recall that the dataset also comes with a _.txt_ file `metadata_participants.txt` which stores the meta information about the users. Perhaps this can help with the model performance.

## Feasibility and Limitations

Training resources are limited: my local machine RAM space cannot support processing more than 200 users' keystroke data into KDI image-like inputs (kernel crashed). So, most of our work are done in the Google Colab environment with Colab Pro+ premium GPUs and High-RAM whenever available. However, even with such resources, we can at most processing 400 users keystroke data. Perhaps there is some way to optimize the data preprocessing step further; however, we have not yet had the chance to try it.

On the other hand, even with the Colab Pro+ premium GPUs and High-RAM run-time shape, it can take a long time to train many expochs for some models, especially with those which call for `GRU` layers with `recurrent_dropout` set to a non-zero value (this significantly slows down training since _cuDNN kernels_ cannot be used under this setting).

Finally, we would like to experiment with different variables that could potentially influence the model performance, and it is not feasible to do this systematically if we want to complete the project in time:
* Besides the hyperparameters, we have many choices on how to preprocess our data (briefly mentioned in **Section 4.2.4 Incorporate and Streamline ALL the Variables**), so in total there are many variables to experiment with **(for an extensive list, consult the experiment records excel sheet)**
* Changes to some of the variables require to go through a substantial portion of the preprocessing stage again, and thus takes up a lot of time and training resources: **for long term purpose, one should try to find a way to save all the preprocessed versions of data into files and load them in when needed; this should immediately frees up a lot of training resources consumed.**

So, currently we need to be highly selective about what to experiment. There could be some other important parameter(s) that we haven't touched upon but can have substantial effect on the model performance.

## 6. Conclusions (TO-DO)

This project is largely experimental since we cannot find any prior work done with similar objective to use as reference. In particular, we have no expectation in baseline model performance; we are not sure whether the dataset we chose is suitable for our prediction task; even though the objective of the project is clear, the specific input and output of the model still need to be determined after a couple quick first experiments. So largely, we are not sure how the project is going to play out at the current stage yet; the important part is to try different approaches and keep records of what works and what does not work.

Finally, we want to mention here another soft application of keystroke dynamics prediction task: if a model can be trained to reach a reasonable accuracy, it then can be incorporated into a typing website for assisted typing training: a model that types like oneself makes the best opponent, thus the name `TypeLikeU`. This was also the starting point that led the author to consider keystroke dynamics.



## References

[1] Wikipedia contributors. (2022, September 26). _Keystroke dynamics_. Wikipedia. https://en.wikipedia.org/wiki/Keystroke_dynamics

[2] Siti Fairuz Nurr Sadikan, Azizul Azhar Ramli, and Mohd Farhan Md. Fudzee , _A survey paper on keystroke dynamics authentication for current applications_, AIP Conference Proceedings 2173, 020010 (2019) https://doi.org/10.1063/1.5133925

[3] Hassan Khan, Urs Hengartner, and Daniel Vogel. 2020. _Mimicry Attacks on Smartphone Keystroke Authentication_. ACM Trans. Priv. Secur. 23, 1, Article 2 (February 2020), 34 pages. https://doi-org.ezproxy.rice.edu/10.1145/3372420

[4] Abdul Serwadda and Vir V. Phoha. 2013. _Examining a Large Keystroke Biometrics Dataset for Statistical-Attack Openings_. ACM Trans. Inf. Syst. Secur. 16, 2, Article 8 (September 2013), 30 pages. https://doi-org.ezproxy.rice.edu/10.1145/2516960

[5] Tey, Chee Meng, Payas Gupta, and Debin Gao. _I can be you: Questioning the use of keystroke dynamics as biometrics._ (2013): 1.

[6] Y. Sun and S. Upadhyaya, _Synthetic Forgery Attack against Continuous Keystroke Authentication Systems,_ 2018 27th International Conference on Computer Communication and Networks (ICCCN), 2018, pp. 1-7, doi: 10.1109/ICCCN.2018.8487341.

[7] Y. Sun, H. Ceker and S. Upadhyaya, _Shared keystroke dataset for continuous authentication,_ 2016 IEEE International Workshop on Information Forensics and Security (WIFS), 2016, pp. 1-6, doi: 10.1109/WIFS.2016.7823894.

[8] C. Murphy, J. Huang, D. Hou and S. Schuckers, _Shared dataset on natural human-computer interaction to support continuous authentication research,_ 2017 IEEE International Joint Conference on Biometrics (IJCB), 2017, pp. 525-530, doi: 10.1109/BTAS.2017.8272738.

[9] Palin, Kseniia, et al. _How do people type on mobile devices? Observations from a study with 37,000 volunteers._ Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 2019.

[10] Killourhy, Kevin S., and Roy A. Maxion. _Comparing anomaly-detection algorithms for keystroke dynamics._ 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 2009.

[11] Dhakal, Vivek, et al. _Observations on typing from 136 million keystrokes._ Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 2018.

[12] Chang, Han-Chih, et al. _Machine Learning and Deep Learning for Fixed-Text Keystroke Dynamics._ Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. 309-329.

[13] Lu, Xiaofeng, et al. _Continuous authentication by free-text keystroke based on CNN and RNN._ Computers & Security 96 (2020): 101861.

[14] Li, Jianwei, Han-Chih Chang, and Mark Stamp. _Free-text keystroke dynamics for user authentication._ Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. 357-380.

[15] Acien, Alejandro, et al. _TypeNet: Deep learning keystroke biometrics._ IEEE Transactions on Biometrics, Behavior, and Identity Science 4.1 (2021): 57-70.

[16] P. Bours and H. Barghouthi. _Continuous Authentication Using Biometric Keystroke Dynamics._ The Norwegian Information Security Conference, 2009.

[17] Monrose, Fabian, and Aviel Rubin. _Authentication via keystroke dynamics._ Proceedings of the 4th ACM Conference on Computer and Communications Security. 1997.

[18] Dowland, P. _A preliminary investigation of user authentication using continuous keystroke analysis._ Proc IFIP Annual Working Conf on Information Security Management and Small System Security, 2001, pp. 27-28. 2001.