ACTCD19: Andrew's C/C++ Token Count Dataset 2019
ACTCD19: Andrew's C/C++ Token Count Dataset 2019

See paper [] for background. Difference between ACTCD19 and ACTCD16 are that in ACTCD19 Debian Sid (Unstable) was used as the base distribution, which should be about 4 years fresher than ACTCD16.

Released Artifacts:

A text file containing lines of the form:


The unique token spellings in the corpus were identified, counted and ranked. We define a common token as a token that has a rank of 65536 or greater.

We then identified, counted and ranked unique consecutive pairs of common tokens. (This is known as a 2-gram language model.)

OCCURENCES is the number of occurences of each unique token pair. TOKEN1 is the first token of the pair TOKEN2 is the second token of the pair.

