## <center>TypeLikeU: Keystroke Dynamics Prediction using Deep Neural Networks</center>

<center> COMP576 Final Project Proposal, Fall 2022 </center>

<center>Author: Xingya Wang</center>

## 1. Abstract

The goal of this project is to develop a deep-learning based model `TypeLikeU` for predicting user keystroke dynamics. Along the way, we propose some new feature enginnering methods for encoding keystroke dynamics while preserving the sequential nature of keystrokes as well as the positional information from keyboard layouts. Finally, we discuss possible extensions of the prediction model `TypeLikeU` to incorporate new users data through transfer learning and Siamese network structure.

## 2. Background and Motivation

Keystroke dynamics refers to the detailed timing information of key pressese and releases associated to a typing event of an individual on a keyboard. Such information can be characteristic enough to be used as biometric data for identifying individuals: dating back to the late 19th century, telegram operators could be identified according to their unique tapping rhythem [[1](https://en.wikipedia.org/wiki/Keystroke_dynamics#cite_note-7)]; then, starting from the early 2000s, with the rise of computer usage and global network access, keystroke dynamics has been studied in the context of cybersecurity as a potentially more efficient and reliable method for authenticating users [[1](https://en.wikipedia.org/wiki/Keystroke_dynamics#cite_note-7)]. 

Traditional authentication methods requires constructive information such as passwords and QR codes, which has no deterministic connections to individual users. In seeking for higher security level, authentication methods through biometric data such as fingerprint, facial recognition, and iris scan are invented. However, the application of these biometric authentication systems is currently limited to specialized tasks and environments such as airports and policy agencies due to their requirements of specialized hardware. Keystroke dynamics based authentication, on the other hand, requires no special hardware and provides a non-intrusive means of authentication; moreover, depending on the implementation of such system, keystroke dynamics authentication can be a continuous process, as oppose to the traditional one-time authentication style [[2](https://aip.scitation.org/doi/pdf/10.1063/1.5133925?cookieSet=1)].

Keystroke dynamics authentication (KDA) is intrinsically a __classification problem__: giving a sequence of character entries and corresponding keystroke dynamics, a model aims to match a user in the database or determine the data is entered by an imposter. There has been extensive work done within the scope of such classification problem to design and compare different models, provide scalable solutions, and incorporate unseen users. However, another naturally related problem to KDA in the context of cybersecurity has received much less attention: the replicability of individuals' keystroke dynamics -- knowing a certain amount of historical keystroke dynamics data of a user, can _an imposter_ generate similar keystroke patterns to fool the authentication system? More explicitly, this is the __regression problem__ related to keystroke dynamics.

In the context of KDA, the regression problem can serve to test the security level of keystroke dynamicÏ based authentication systems (classification model); or even as the generator player of a GAN network [[X](https://en.wikipedia.org/wiki/Generative_adversarial_network)]. 
There has been experiments conducted which desmonstrate the vulnerability of keystroke dynamics based authentication scheme [[3](https://dl-acm-org.ezproxy.rice.edu/doi/10.1145/3372420), [4](https://dl-acm-org.ezproxy.rice.edu/doi/pdf/10.1145/2516960)]; though, the mimicry attacks are generally generated through human imposters (untrained or trained e.g. via intentional practice or via assisted softwares) [[5](https://flyer.sis.smu.edu.sg/ndss13-tey.pdf), [6](https://ieeexplore.ieee.org/document/8487341), [3](https://dl-acm-org.ezproxy.rice.edu/doi/pdf/10.1145/3372420)], or are launched through algorithmatic programs designed with traits studied and extracted through statistical analysis by experts [[4](https://dl-acm-org.ezproxy.rice.edu/doi/10.1145/2516960)]. These mimicry attacks against the existing keystroke dynamics based authentication scheme are not the most efficient and often require substantial amount of human intervention efforts; on the other hand, since these keystroke dynamics based authentication scheme are often machine learning or deep learning models, why not have another machine learning or deep learning model carry out the mimicry attacks? 

In this project, we want to study the regression problem related to keystroke dynamics by designing and training deep neural networks to predict the users' keystroke dynamics. 

## 3. Challenges and Related Work

* A broader overview of the existing approaches that solve the problem
* Challenges and Limitations

Given that the classification problem related to keystroke dynamics has been studied extensively for an extended period of time, it is quite surprising that, there has been no prior attempt to design and train a deep neural network that solves the regression problem in the existing literature (that the author can find). Our final project is thus extremely __experimental__ and requires substantial amount of effort to be spent on __trying out feasible approaches__. With that being said, from feature engineering to model designs, literature on the classification problem will serve as our main inspiration.

### 3.1 Keystroke Features

Feature engineering is a crucial step in considering keystroke dynamics for several reasons. (1)[__more additional features?__]generally, the only data that can be collected about the user's typing behavior are the (javascript) keycodes and the corresponding timestamps associated to key presses and releases, from which only little information such as the latencies (i.e. timestamps differences) from pairs of consecutive keystroke events can be obtained. (2)[__features on different scales__] On one hand, the keycodes are in nature categorical with around 116 unique keycodes (QWERTY keyboard with keypad) and thus probably needed to be one-hot encoded, whereas the latencies (in milliseconds) generally take value in large range $\approx[0, 10^5]$ of integers. (3)[__how to deal with outliers?__]On the other hand, the latencies are very unstable with large standard deviation and extreme outliers; this is even more so with __transition latencies__ involving timestamps from different keys. (4)[__not enough data per user__]Finally, there is an intrinstic paradox in the "necessary" training samples per user: while deep learning models need more keystroke data per user to learn and distinguish different users, it is not realistic to require a large amount of keystrokes data from a single user from application standpoint. So, our only hope is for the model to learn enough "common" typing patterns from data of a large pool of users, and be able to use little data to distinguish individual user's keystroke pattern and adapts to it.

#### 3.1.1 Choice of Unit Token

There has been many attempts to construct keystroke features from the limited information. Most commonly, the unit token in keystroke data are either single keystroke (uni-graph) or a pair of consecutive keystrokes (di-graph), and the corresponding timestamps are processed into latencies: differences between the keypress and release timestamps. For a pair of keystrokes, there are six latencies one can compute; however, three of the latencies together uniquely determines the rest [[Alsultan](https://www.researchgate.net/publication/313742321_Keystroke_dynamics_authentication_A_survey_of_free-text), [15](https://arxiv.org/pdf/2101.05570.pdf), [13](https://www.sciencedirect.com/science/article/pii/S0167404820301334)]. 
<center><img src="img/latencies.png" alt="Latencies Illustration" width="600"/></center>
We will refer to the four latencies (IL, PL, RL, ) that involves more than one key transition latencies.

#### 3.1.2 Input Format: Sequential or Image-like

Traditionally, the keystroke data are considered as time series and are organized into sequential format where each time step corresponds to a unit token represented in its feature vector. For example, with respect to unit token being uni-graph or di-graph, a common feature vector resembles the following:
<center>
    <img src="img/unigraph-df.png" alt="Unigraph" width="280" title="Unigraph Dataframe"/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="img/digraph-df.png" alt="Unigraph" width="280" title="Digraph Dataframe"/>
</center>

As briefly mentioned earlier, a sequential input vector has some intrinsic downsides: on one hand, the javascript keycodes need to be one-hot encoded, which makes the input vector very sparse; on the other hand, the non-zero entries of a feature vector include the latencies which takes value in integer range much larger than $[0,1]$.  

With the popularity gained by convolutional neural networks in recent years, there are attempts using CNN models to tackle the keystroke dynamics problem. The most immediate adaptation for this purpose is to take the fixed time-window length sub-dataframe as "images" and pass them directly into a CNN model. Beyond that, there are many other methods proposed to encode the features into more "image"-like features. For example, in [[14](https://arxiv.org/pdf/2107.07009.pdf)], the authors proposed a new feature engineering method, called the keystroke dynamics image (KDI), to encode keystrokes (in a fixed window) into "an image": each "color channel" of the image corresponds to a single feature, and the $(i,j)$-entry of each matrix encodes the feature value corresponding to the (ordered) pair of keystrokes with (encoded) keycodes $(i,j)$. 

<center> <img src="img/kdi.png" alt="Keystroke Dynamics Image" width="300"/> </center>

The authors compaired this KDI input format against the traiditional sequential input format (each row represents a keystroke) using two different models: a CNN model and a CNN-RNN model. Even though the KDI input format outperforms the tradictional sequential featurel based on their experiement results, there lacked the comparison of the sequential input format training on a purely RNN model, so it's unclear whether KDI is indeed a better feature encoding method, or it is only better then taking fixed time-window length sub-dataframe as "images". In [[Piugie](https://hal.archives-ouvertes.fr/hal-03716818/document)], the authors proposed another "image"-like feature engineer method by converting the latencies feature vector ]into a matrix using _squarenorm()_ function in _Matlab_, which essentially computes the pair-wise distances of the latencies; the matrix is then transformed into a 3D-tensor using the _imagesc()_ function in _Matlab_. The authors then uses multiple state-of-art pretrained convnet models to process these feature images, among which GoogleNet seems to perform the best in terms of Equal Error Rate (EER).


These image-like feature engineering methods, however, has a common critical downside to the more traditional sequantial input formats: it forgets the instrinsic sequential ordering of the timestamps (beyond a length of 2 in di-graphs) within the fixed-window size by bagging them into a single matrix or separates them into different matrices. This missing piece of information might not be critical in the classification task, but is intuitively less likely to be unimportant in keystroke timestamps predicion (the regression task), as users tends to produce more mistakes later into a continuous typing session. 


#### 3.1.3 Additional Features

In addtion to keycodes and timestamp latencies, another implicit feature associated with keystroke dynamics that might be useful for training is key positions on the keyboard (Note: we limit to the case of QWERTY keyboard in this project). For example, in [[Singh](https://ieeexplore.ieee.org/abstract/document/5990168)] a keyboard grouping technique was introduced to classify keycodes based on their locations on the keyboard, which was divided into eight sections: two left and right halves and then each half divided into four lines representing the rows of the keyboard. In [[Alsultan2](https://ieeexplore.ieee.org/document/6722548)], by considering key-pairs on keyboard within a certain finger distance, the authors incorporated keyboard layout into their proposed keystroke dynamics authentication system to calculate similarity between the timing features typed and the user's profile data using the Euclidean distance. 

### 3.2 Timing Latencies and Outliers 

Keystroke dynamics datasets, especially those created for the purpose of training free-text keystroke system, generally contain many noises caused by extreme hold latencies or transition latencies involving more than two keys. Preprocessing the noises present in these keystroke dynamics datasets could be crucial, as otherwise the outliers might influence the ability for any model to learn. On the other hand, one needs to be careful that there might be "True" outliers that the model is supposed to learn. For example, consider the following box plot of the distributions of keystroke Press Latencies of six users:

<center>
    <img src="img/outliers-samples.png" alt="Outlier Samples" width="950" title="Outlier Samples"/>
</center>

Notice that PL $\approx 1200$ seems to be an outlier data for User 87; however, it is a completely normal datapoint for User 83. Similarly, PL $\approx 2800$ seems to be an outlier data for User 35, but it is a relatively normal data for User 83. Moreover, some of these points outside of the 4 time Interquartile Range (IQR) can even be crucial information for distinguishing users: for example, User 83 and User 144 have relatively small number of points outside of $4*$IQR, whereas the other four users both have relatively large number of points outside of $4*$IQR. Then, consider User 102, which has three data points that are clearly outliers: this could be where the user was distracted and temporarily zoomed out of their continuous typing session. Extreme outliers of this type can negatively influence the model training process. Finally, a common pattern that we are seeing here is that for each of the user, there are generally no more than 5-10 extreme outlier data points.

Based on these observation on individual user's keystroke latencies, we can conclude that dropping outliers on a global level is not going to be very useful, as users sits on different keystroke latency ranges. Thus, outlier filters must be applied on the user-level. To filter out outliers, we can either use quantile-ranges, or standard deviations. However, based on the last observation, it seems to be more effective to just filter out the largest top 5-10 latency for each user.

In [P2, [10, 1] - timing-info], only the keystrokes for which the hold latencies and transition latencies are below a predefined standard deviation were added to the users pool. On the other hand, in [11 -- timing info], the authors fixed hard minimum and maximum to limit the range that timing latencies can take values in, where the choice of range of the latencies (i.e. the minimum and maximum) can be dependent on quantile ranges or ....

### 3.3 Model Selection

---> User Embedding: TypeNet model, CNN based

Table IV of [[Piugie](https://hal.archives-ouvertes.fr/hal-03716818/document)]: some example of state-of-art pretrained CNN models performance.


### 3.4 Loss & Training Performance

## 4. The Dataset and Project Architecture

### 4.1 The Dataset

The contexts of studying of keystroke dynamics are characterized in terms of the following aspects:
* __fixed-text versus free-text systems__: fixed-text system refers to authenticate users through typing repetitively over a fixed string of text, whereas free-text system does not have such restrictions and the data gathered from user usually consists of keystrokes of different sentences. 
* __typing device__: in the early times of keystroke dynamics, the most common typing device are mechanical keyboards; in recent years, keystroke dynamics on touchscreen devices has gained a lot of attention as touchscreen functionality have been incorporated into more and more electronic devices.
* __keystroke features__: the most common type of datasets used in the study of keystoke dynamics usually consists of only characters with the associated key press and release timestamps, and perhaps some general meta data about the participants. On the other hand, along with the development in artificial intelligence, we have more tools to process and analyze other input data such as visual recordings and keystroke pressures; thus, there have also been datasets constructed which includes these features on keystroke dynamics.

There are different datasets collected for studying different systems of keystroke dynamics authentication: the two most explored free-text datasets based on mechanical keyboard typing are the _Buffalo Keystroke Dataset_ [[7](https://ieeexplore.ieee.org/document/7823894), [Dataset](https://www.buffalo.edu/cubs/research/datasets.html#title_429116244)], and the _Clarkson II Keystroke Dataset_ [[8](https://ieeexplore.ieee.org/document/8272738), [Dataset](https://citer.clarkson.edu/research-resources/biometric-dataset-collections-2/clarkson-university-keystroke-dataset/)]. There are also datasets collected on mobile devices with touchscreen keyboards, such as [[9](https://userinterfaces.aalto.fi/typing37k/resources/Mobile_typing_study.pdf)]. More traditional fixed-text datasets also exist, such as [[10](http://www.cs.cmu.edu/~keystroke/KillourhyMaxion09.pdf), [Dataset](https://www.cs.cmu.edu/~keystroke/)].

In this project, we will use the open-source dataset proposed in the paper [[11](https://userinterfaces.aalto.fi/136Mkeystrokes/resources/chi-18-analysis.pdf)] which can be accessed online through [[Dataset](https://userinterfaces.aalto.fi/136Mkeystrokes/)]. The dataset consists of 168,000 volunteers typing over 136 million keystrokes (15 sentences per volunteer) via an online typing platform [[Platform](https://typingtest.aalto.fi/)] (here is a screenshot): 

<center> <img src="img/136m-app-screenshot.png" alt="APP SCREENSHOT" width="500"/> </center>

and the dataset has the following characteristics:
* free-text systems: sentences are not repetitively typed by participants;
* typing devices are mechanical English keyboards (either full keyboard or laptop), and with QWERTY, AZERTY or QWERTZ layout;
* features conatin only keystroke timestamps (as oppose to keypress pressure data, or visual recording of typing actions).

More specifically, the raw dataset is a folder containing, for each participant, a corresponding `<PARTICIPANT_ID>_keystrokes.txt` file with the following information:
* `PARTICIPANT_ID`: unique ID of participant
* `TEST_SECTION_ID`: unique ID of the presented sentence
* `SENTENCE`: sentence shown to the user (each user types 15 sentences)
* `USER_INPUT`: sentence typed by the user after pressing Enter or Next button
* `KEYSTROKE_ID`: unique ID of the keypress 
* `PRESS_TIME`: timestamp of the key down event (in ms) 
* `RELEASE_TIME`: timestamp of the key release event (in ms) 
* `LETTER`: the typed letter
* `KEYCODE`: the javascript keycode of the pressed key

Aside from the keystroke timestamps data from each participant, the dataset also contains a `readme.txt` file and some meta-data about the participants collected in `metadata_participants.txt`:
* `PARTICIPANT_ID`: unique ID of participant
* `AGE`
* `GENDER`
* `HAS_TAKEN_TYPING_COURSE`: whether the participant has taken a typing course (1) or not (0)
* `COUNTRY`
* `KEYBOARD_LAYOUT`: QWERTY, AZERTY or QWERTZ layout of keyboard used
* `NATIVE_LANGUAGE`
* `FINGERS`: choice between 1-2, 3-4, 5-6, 7-8 and 9-10 fingers used for typing
* `TIME_SPENT_TYPING`: number of hours spent typing everyday
* `KEYBOARD_TYPE`: full (desktop), laptop, small physical (e.g on phone) or touch keyboard
* `ERROR_RATE(%)`: uncorrected error rate
* `AVG_WPM_15`: words per minute averaged over the 15 typed sentences
* `AVG_IKI`: average inter-key interval 
* `ECPC`: error Corrections per Character
* `KSPC`: keystrokes per Character
* `AVG_KEYPRESS`: average Keypress duration
* `ROR`: rollover ratio

Since this dataset is chosen for the project, the scope of the problem concerned in the project is largely determined: we will consider free-text systems with mechanical keyboards and investigate the keystroke dynamics based on purely timestamps information.

### 4.2 Project Architecture

<center> <img src="img/problem-statement.png" alt="Problem Statement" width="500"/> </center>

Here is an overview of our project architecture.

<center> <img src="img/proj-architecture.png" alt="Project Architecture" width="1100"/> </center>

<center> <img src="img/train-df-stats-digraph.png" alt="Trainset Statistics - Digraph Dataframe" width="500"/> </center>

## 5. Experiment Results and Discussions

## 6. Conclusions

### 6.X Code Development

* splitting on user level and sentence level transition caused high range of latencies.
* There are users who just types really slowly

Though not mentioned in the section _3. Challenges and Related Work_, I want to mention here the major challenge, especially in the early stage of the project, that I encountered in this project: a substantial amount of effort was spent on the development of the codes, especially trying to streamline the entire process prior to the training phase, from importing the _.txt_ files to extracting the features and perform encoding or filtering and to generating 'tf.data' object for feeding into the models.

* A lot of parameters to consider, need a uniform way to access and change them.

## Feasibility and Limitations

This project is largely experimental since we cannot find any prior work done with similar objective to use as reference. In particular, we have no expectation in baseline model performance; we are not sure whether the dataset we chose is suitable for our prediction task; even though the objective of the project is clear, the specific input and output of the model still need to be determined after a couple quick first experiments. So largely, we are not sure how the project is going to play out at the current stage yet; the important part is to try different approaches and keep records of what works and what does not work.

Finally, we want to mention here another soft application of keystroke dynamics prediction task: if a model can be trained to reach a reasonable accuracy, it then can be incorporated into a typing website for assisted typing training: a model that types like oneself makes the best opponent, thus the name `TypeLikeU`. This was also the starting point that led the author to consider keystroke dynamics.

## Experiments

Table 3 of [[13](https://www.sciencedirect.com/science/article/pii/S0167404820301334)]: CNN models seems to work better than RNN based.

## References

[1] Wikipedia contributors. (2022, September 26). _Keystroke dynamics_. Wikipedia. https://en.wikipedia.org/wiki/Keystroke_dynamics

[2] Siti Fairuz Nurr Sadikan, Azizul Azhar Ramli, and Mohd Farhan Md. Fudzee , _A survey paper on keystroke dynamics authentication for current applications_, AIP Conference Proceedings 2173, 020010 (2019) https://doi.org/10.1063/1.5133925

[3] Hassan Khan, Urs Hengartner, and Daniel Vogel. 2020. _Mimicry Attacks on Smartphone Keystroke Authentication_. ACM Trans. Priv. Secur. 23, 1, Article 2 (February 2020), 34 pages. https://doi-org.ezproxy.rice.edu/10.1145/3372420

[4] Abdul Serwadda and Vir V. Phoha. 2013. _Examining a Large Keystroke Biometrics Dataset for Statistical-Attack Openings_. ACM Trans. Inf. Syst. Secur. 16, 2, Article 8 (September 2013), 30 pages. https://doi-org.ezproxy.rice.edu/10.1145/2516960

[5] Tey, Chee Meng, Payas Gupta, and Debin Gao. _I can be you: Questioning the use of keystroke dynamics as biometrics._ (2013): 1.

[6] Y. Sun and S. Upadhyaya, _Synthetic Forgery Attack against Continuous Keystroke Authentication Systems,_ 2018 27th International Conference on Computer Communication and Networks (ICCCN), 2018, pp. 1-7, doi: 10.1109/ICCCN.2018.8487341.

[7] Y. Sun, H. Ceker and S. Upadhyaya, _Shared keystroke dataset for continuous authentication,_ 2016 IEEE International Workshop on Information Forensics and Security (WIFS), 2016, pp. 1-6, doi: 10.1109/WIFS.2016.7823894.

[8] C. Murphy, J. Huang, D. Hou and S. Schuckers, _Shared dataset on natural human-computer interaction to support continuous authentication research,_ 2017 IEEE International Joint Conference on Biometrics (IJCB), 2017, pp. 525-530, doi: 10.1109/BTAS.2017.8272738.

[9] Palin, Kseniia, et al. _How do people type on mobile devices? Observations from a study with 37,000 volunteers._ Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 2019.

[10] Killourhy, Kevin S., and Roy A. Maxion. _Comparing anomaly-detection algorithms for keystroke dynamics._ 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 2009.

[11] Dhakal, Vivek, et al. _Observations on typing from 136 million keystrokes._ Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 2018.

[12] Chang, Han-Chih, et al. _Machine Learning and Deep Learning for Fixed-Text Keystroke Dynamics._ Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. 309-329.

[13] Lu, Xiaofeng, et al. _Continuous authentication by free-text keystroke based on CNN and RNN._ Computers & Security 96 (2020): 101861.

[14] Li, Jianwei, Han-Chih Chang, and Mark Stamp. _Free-text keystroke dynamics for user authentication._ Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. 357-380.

[15] Acien, Alejandro, et al. _TypeNet: Deep learning keystroke biometrics._ IEEE Transactions on Biometrics, Behavior, and Identity Science 4.1 (2021): 57-70.