<style type="text/css">
	.myimg {
		max-width: 500px !important;
	}
</style>


# Monkey Type Machine Learn

**A Project by Richard McHorgh**


## Introduction


In the middle of October, inspired by YouTube creators, namely [Ben Vallack](https://www.youtube.com/benvallack), I began designing the printed circuit boards (PCBs), keycaps, and case plates that would eventually become the ortholinear, split, and thumb-maximizing keyboard I’m using to type this report. I thought it would be the perfect present for my birthday, but the main reason for starting this project was to protect my hands and wrists from the threat of carpal tunnel syndrome and other diseases like it.


Unlike the traditional row-staggered layout one would find on the majority of laptop and store-bought keyboards, ortholinear keyboards arrange the keys in a strict grid matrix. This allows for a more ergonomic and efficient typing experience, as each key is positioned directly under the finger that would normally press it, rather than back and shifted to the left or right by an arbitrary amount, thereby reducing strain on the fingers and wrists, and the risk of injuries such as carpal tunnel syndrome.


<img class='myimg' alt='A picture of the completed keyboard' src='/Users/richard/Documents/umd/cmsc320/final/img/kb_hat.jpg'/>

*A picture of the completed keyboard*

Furthermore, strain on the fingers, wrists, and back is reduced by the keyboard being split into halves between the positions of the 5 and 6 keys. Traditional keyboards require their users’ hands to be unnaturally close together, forcing their wrists to be at an angle, whereas, with a split keyboard, one’s hands can be straight and as far apart as the cable joining the two halves allows. Arms being in their natural position translates to hands being straight and in as comfortable of a position as possible, especially during long typing sessions.


<img class='myimg' src='/Users/richard/Documents/umd/cmsc320/final/img/strain.jpg'/>

_Hand angles while using a row staggered keyboard and a split keyboard_

The third ergonomic feature of the design are the programmable thumb clusters found on the last row of each half of the keyboard. Row staggered keyboards only use the operator’s thumbs for the spacebar, so to perform shortcuts with the Control, Command, Alt, or any other non-alphanumeric keys, hands must be moved off the keyboard. Each time the user’s fingers move off the home row introduces more strain. This strain is compounded when the entire hand must be moved to type a key. With a programmable thumb cluster, all movement (Up, Left, Home, etc.), page management (Alt-Tab), and text manipulation (Undo, Cut, Paste, etc.) shortcuts can be performed without lifting the hands from either half of the keyboard.


All the ergonomic design features of the keyboard would be less effective without a layout that can magnify them, so instead of the traditional QWERTY keyboard layout, my keyboard was programmed to use a slightly modified version of the Colemak-DHM layout that I named SemiColemak. Standard Colemak-DHM places letters on the keyboard by their frequency in English words and the relative strength of the finger that is meant to press them. The more common the letter, the stronger the finger above it. Contrarily, QWERTY is organized to minimize the jams of a typewriter, which is nonsensical in an era of electrically controlled keyboards.


However, due to QWERTY’s ubiquity, it was the keyboard layout I learned to type on in elementary school. Text manipulation shortcuts are based on the QWERTY layout as well, so SemiColemak’s proximity to QWERTY, with 11 keys in the same position, including z, x, c, and v, make it easier to learn compared to other ergonomic layouts such as Dvorak or Workman.


Although SemiColemak is among the easiest layouts for a transition from QWERTY, the relearning process has been as strenuous for my muscle memory as QWERTY was for my hands and wrists. Even now, in my third week of solely using the keyboard, my typing speed in SemiColemak is barely approaching half of my average on QWERTY.


The object of this tutorial is to identify the points of error in typing the words that I use most frequently and use that data to classify words featuring my weaknesses. Practicing this set of words, and other words that will help to increase my speed and accuracy typing on my keyboard.


## Collecting the raw data


As well as being the inspiration for this project's name, the competitive type racing website, Monkey See Monkey Type, which is generally abbreviated to Monkey Type, is the source of the data used in this project. By inserting the JavaScript keylogger written below into the Developer Mode Console in a browser, I was able to record the letter I pressed, the letter I should have pressed, and the source of the text, among other datapoints.


In [1]:
%pycat scrape.js

[0mvar[0m [0mmonkeyTypeAll[0m [0;34m=[0m [0;34m([0m[0;34m)[0m [0;34m=[0m[0;34m>[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m//[0m [0mcall[0m [0mwhen[0m [0mthe[0m [0mpage[0m [0mdetects[0m [0ma[0m [0mkeypress[0m[0;34m[0m
[0;34m[0m        [0mdocument[0m[0;34m.[0m[0maddEventListener[0m[0;34m([0m[0;34m'keyup'[0m[0;34m,[0m [0;32masync[0m [0;34m([0m[0;34m{[0m [0mkey[0m [0;34m}[0m[0;34m)[0m [0;34m=[0m[0;34m>[0m [0;34m{[0m[0;34m[0m
[0;34m[0m                [0;34m//[0m [0mguard[0m [0magainst[0m [0marrow[0m [0mkeys[0m [0metc[0m[0;34m.[0m[0;34m[0m
[0;34m[0m                [0;32mif[0m [0;34m([0m[0mkey[0m[0;34m.[0m[0mlength[0m [0;34m>[0m [0;36m1[0m[0;34m)[0m [0;32mreturn[0m[0;34m;[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m                [0;34m//[0m [0mselect[0m [0mlanguage[0m[0;34m,[0m [0mlayout[0m [0;32mand[0m [0mpotentially[0m [0mfunbox[0m [0moptions[0m[0;34m

The data scraped from the above function is then sent to the Deno localhost webserver transcribed below. Once the POST request is received, the webserver appends it to a comma separated table. If you would like to use this code to analyze your own typing, make sure to start the Deno server before sending any data from the keylogger script.


In [2]:
%pycat storage.ts

[0;34m//[0m [0;32mimport[0m [0mthe[0m [0mwebserver[0m [0;32min[0m [0mthe[0m [0mdeno[0m [0mway[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0;34m{[0m [0mserve[0m [0;34m}[0m [0;32mfrom[0m [0;34m'https://deno.land/std@0.157.0/http/server.ts'[0m[0;34m;[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mconst[0m [0mhandler[0m [0;34m=[0m [0;32masync[0m [0;34m([0m[0mreq[0m[0;34m:[0m [0mRequest[0m[0;34m)[0m[0;34m:[0m [0mPromise[0m[0;34m<[0m[0mResponse[0m[0;34m>[0m [0;34m=[0m[0;34m>[0m [0;34m{[0m[0;34m[0m
[0;34m[0m        [0;34m//[0m [0mextract[0m [0mthe[0m [0mjson[0m [0mbody[0m [0;32mfrom[0m [0mthe[0m [0mrequest[0m [0msent[0m [0;32mfrom[0m [0mthe[0m [0mbrowser[0m[0;34m[0m
[0;34m[0m        [0;32mawait[0m [0mreq[0m[0;34m.[0m[0mjson[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mthen[0m[0;34m([0m[0;34m[0m
[0;34m[0m                [0;34m//[0m [0mdestructure[0m [0mthe[0m [0mcontents[0m [0

## Understanding the data


### Required Imports 


In [3]:
from pandas import read_csv

In [4]:
df = read_csv("data.csv", keep_default_na=False)

### N-Grams


Although it rarely outputs actual words, the __Pseudolang__ funbox option generates words by algorithmically combining the common n-grams of the selected language. An n-gram is a sequence of two adjacent elements from a collection of tokens, and in this context, they are the likeliest letter sets. In English, some of the most common bigrams, or 2-grams, include:

-   th
-   he
-   in
-   er
-   an
-   re
-   nd
-   at
-   on
-   nt


Ingraining frequent n-grams into muscle memory is crucial to improving typing accuracy and speed. SemiColemak positions a few of these bigrams near each other, specifically in, he, at, and on, but underneath different fingers of the same hand. By avoiding the use of the same finger to type the n-gram, one can prime the finger assigned to the next letter in the sequence before the key is actually pressed, and as a product, decrease the time needed to type the whole word.


<img class='myimg' src='/Users/richard/Documents/umd/cmsc320/final/img/kb_legend.jpg'/>

_An image of the legend of the keyboard_

In [5]:
# Calculate the percentage of words with a listed n-gram in the entire dataset

ngrams = ["er", "re", "th", "in","an", "at", "he", "on", "nd",  "nt"]
for x in ngrams:
    print(
        f"Percentage of rows with {x} bigram:\t{round(len(df[df.activeWord.str.contains(x)]) * 100 / len(df))}%"
    )


Percentage of rows with er bigram:	10%
Percentage of rows with re bigram:	8%
Percentage of rows with th bigram:	7%
Percentage of rows with in bigram:	7%
Percentage of rows with an bigram:	5%
Percentage of rows with at bigram:	5%
Percentage of rows with he bigram:	5%
Percentage of rows with on bigram:	4%
Percentage of rows with nd bigram:	4%
Percentage of rows with nt bigram:	3%


### Languages


Monkey Type can generate typing tests in many different languages, so, in order to emulate the words that I use from day to day, [PERCENTAGE] of the tests that I took were comprised of English words, but since I write a lot of Python, JavaScript, and Swift, as well as French and Rust, to a lesser degree, those languages were added to the test mix.

Since all of the programming languages included use English-based syntax, they share many of the n-grams with English, however regular English does not train your fingers for coding paradigms like camel or snake case or using parentheses, equal signs, or comparators frequently.

In [6]:
# Words generated in all of the languages included in the dataset

df.iloc[[300, 1770, 19386, 9141, 3307, 8967]]

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox
300,2022-12-04T10:02:36.901Z,house,s,s,monkeytype,semicolemakdh,words,10,english,
1770,2022-12-04T21:34:37.804Z,rsplit,y,p,monkeytype,semicolemakdh,words,10,code python,
19386,2022-12-10T01:50:44.070Z,Math.pow(),w,w,monkeytype,semicolemakdh,words,25,code javascript,
9141,2022-12-06T06:30:15.753Z,puree,,,monkeytype,semicolemakdh,words,25,wordle,
3307,2022-12-04T22:26:02.941Z,etre,r,r,monkeytype,semicolemakdh,words,10,french,
8967,2022-12-06T06:27:44.742Z,subscript,b,b,monkeytype,semicolemakdh,words,25,code swift,


### Punctuation 


When spoken languages are transcribed, punctuation marks are added. Monkey Type does not add these marks by default, so the **Quote** test mode, in addition to the **Wikipedia** and **Poetry** funbox options, were used to introduce them.


In [7]:
# Words with punctuation

df[df["activeWord"].str.contains("'")].head()

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox
8456,2022-12-06T06:21:26.190Z,bestrewin',,,monkeytype,semicolemakdh,words,25,english,poetry
8457,2022-12-06T06:21:27.085Z,bestrewin',b,b,monkeytype,semicolemakdh,words,25,english,poetry
8458,2022-12-06T06:21:27.438Z,bestrewin',e,e,monkeytype,semicolemakdh,words,25,english,poetry
8459,2022-12-06T06:21:27.609Z,bestrewin',s,s,monkeytype,semicolemakdh,words,25,english,poetry
8460,2022-12-06T06:21:28.211Z,bestrewin',t,t,monkeytype,semicolemakdh,words,25,english,poetry


### Test Length

Depending on the type of test, the values recorded in the length column of the table can represent two kinds of data. When taking a __Words__ test, it represents the cardinality of the set of words generated for the typing test, but in the case of a __Time__ test, it indicates the amount of time allotted to type as many words as possible. 


Monkey Type ranks performance in timed tests daily on their leaderboards by speed and accuracy. As a result, taking timed tests induces more stress, but in turn estimates real-world performance better than a word count test, which is primarily used for improving accuracy.

In [8]:
# Characters typed during a timed test xor a word count test

df.iloc[[6942, 6944]]

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox
6942,2022-12-06T06:01:44.527Z,use,u,u,monkeytype,semicolemakdh,time,10,english,
6944,2022-12-06T06:01:44.954Z,use,e,e,monkeytype,semicolemakdh,time,10,english,


## Feature Engineering


### Required Imports 


In [9]:
from datetime import datetime as dt, timedelta
from pandas import NaT

In [31]:
df[df.type == 'time']

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox,cpm,testNum
6878,2022-12-06 06:01:19.071,still,s,s,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:27.253000,0
6879,2022-12-06 06:01:19.231,still,t,t,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.160000,0
6880,2022-12-06 06:01:20.134,still,i,i,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.903000,0
6881,2022-12-06 06:01:20.394,still,l,l,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.260000,0
6882,2022-12-06 06:01:20.609,still,l,l,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.215000,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7208,2022-12-06 06:04:49.644,real,l,l,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.384000,0
7209,2022-12-06 06:04:49.780,that,,,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.136000,0
7210,2022-12-06 06:04:50.292,that,t,t,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.512000,0
7211,2022-12-06 06:04:50.648,that,h,h,monkeytype,semicolemakdh,time,10,english,,0 days 00:00:00.356000,0


### Speed

Typing speed is calculated in characters per minute rather than the customary unit of measurement, words per minute. Since there is a high variance in the length of words, using wpm would require some sort averaging, which would invite loss of data. Measuring character per minute does not necessitate averaging because it relies on the smallest non-temporal piece of data, the character. 

Character per minute is calculated by subtracting timestamps between two adjacent characters. This strategy works well when every letter typed is correct, however, since I am still learning, correctly typed characters inevitably precede mistypes. Instead of returning an accurate measurement of the useful characters typed per unit of time, it returns raw typing speed. To remedy this, the speed of incorrectly typed characters is replaced with the mean of over the active word.

In [10]:
# Converting timestamps to python's date object

df.timestamp = df.timestamp.apply(lambda x: dt.fromisoformat(x[:-1]))
df.head()

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox
0,2022-12-04 09:44:12.191,get,g,g,monkeytype,semicolemakdh,words,10,english,
1,2022-12-04 09:44:12.368,get,e,e,monkeytype,semicolemakdh,words,10,english,
2,2022-12-04 09:44:12.986,get,t,t,monkeytype,semicolemakdh,words,10,english,
3,2022-12-04 09:44:13.154,down,,,monkeytype,semicolemakdh,words,10,english,
4,2022-12-04 09:44:14.400,down,d,d,monkeytype,semicolemakdh,words,10,english,


In [11]:
# Subtract the intervals

df["cpm"] = df.timestamp
df.cpm = df.cpm.sub(df.timestamp.shift())

df[["timestamp", "lastChar", "cpm"]].head()


Unnamed: 0,timestamp,lastChar,cpm
0,2022-12-04 09:44:12.191,g,NaT
1,2022-12-04 09:44:12.368,e,0 days 00:00:00.177000
2,2022-12-04 09:44:12.986,t,0 days 00:00:00.618000
3,2022-12-04 09:44:13.154,,0 days 00:00:00.168000
4,2022-12-04 09:44:14.400,d,0 days 00:00:01.246000


The intervals between letters in the same word and in the same test are correct, but since tests were not taken consecutively, some intervals are much too large.



In [12]:
df.iloc[[6829, 6830]] #[['timestamp', 'activeWord', 'cpm']]

Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox,cpm
6829,2022-12-06 06:00:30.936,fact,f,f,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.610000
6830,2022-12-06 06:00:31.142,fact,a,a,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.206000


Although these letters were recorded successively, moving my fingers from the _e_ to _o_ key to begin the word _old_ did not take 7 hours. To ensure accuracy of the data, the beginning of the interval must align with the start of each test.

Word count tests are separated after the number of completed words reaches the value in the length column. Separating on complete words is important because Monkey Type allows the user to regenerate a new set of words if they decide that they have made too many mistakes with the current set.

In [37]:
df['testNum'] = 0
# starting from the first row of words tests count the completed words
wordCount = df[df.type == 'words']
testLen = wordCount.at[0, 'length']

word = wordCount.at[0, 'activeWord']
queuedWord = word
count = 0

totalTests = 1

rec = ''

for i, (aw, lc, cc, l) in wordCount[['activeWord', 'lastChar', 'correctChar', 'length']].iterrows():
	# print(f'index {i} {aw} {lc}')
	df.at[i, 'testNum'] = totalTests

	if word == '':
		word = aw
		queuedWord = aw
	
	if queuedWord != aw:
		rec += '\{skip\} '
		word = aw
		queuedWord = aw

	if testLen == -1:
		totalTests += 1
		if False and '\{skip\}' in rec:
			print(f'{totalTests} -> {rec}')

		rec = ''
		testLen = l

	if lc == word[0]:
		word = word[1:]

	if word == '':
		rec += f'{queuedWord} '
		count += 1

	if count == testLen:
		testLen = -1
		count = 0

df.head(90)


Unnamed: 0,timestamp,activeWord,lastChar,correctChar,source,layout,type,length,language,funbox,cpm,testNum
0,2022-12-04 09:44:12.191,get,g,g,monkeytype,semicolemakdh,words,10,english,,NaT,1
1,2022-12-04 09:44:12.368,get,e,e,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.177000,1
2,2022-12-04 09:44:12.986,get,t,t,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.618000,1
3,2022-12-04 09:44:13.154,down,,,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.168000,1
4,2022-12-04 09:44:14.400,down,d,d,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:01.246000,1
...,...,...,...,...,...,...,...,...,...,...,...,...
85,2022-12-04 09:45:55.573,come,e,e,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.652000,2
86,2022-12-04 09:45:55.746,would,,,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.173000,2
87,2022-12-04 09:45:56.237,would,w,w,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.491000,2
88,2022-12-04 09:45:57.021,would,o,o,monkeytype,semicolemakdh,words,10,english,,0 days 00:00:00.784000,2


Timed tests are separated after the alotted time in the length column is reached. Splitting on complete words is not neccessary here as it is with word count tests because the timer may cut the user off from typing all the letters in a word.

### Consistency

### Accuracy

### Reach Length

### Aggregate Typing Score

## Analysis

## Remove later


In [15]:
!jupyter nbconvert --to html main.ipynb --output-dir='docs' --output='index.html'

[NbConvertApp] Converting notebook main.ipynb to html
[NbConvertApp] Writing 665632 bytes to docs/index.html
