Skip to content

tomazos/actcd19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

ACTCD19: Andrew's C/C++ Token Count Dataset 2019

See paper [http://www.tomazos.com/actcd16.pdf] for background. Difference between ACTCD19 and ACTCD16 are that in ACTCD19 Debian Sid (Unstable) was used as the base distribution, which should be about 4 years fresher than ACTCD16.

Released Artifacts:

A text file containing lines of the form:

OCCURENCES TOKEN1 TOKEN2

The unique token spellings in the corpus were identified, counted and ranked. We define a common token as a token that has a rank of 65536 or greater.

We then identified, counted and ranked unique consecutive pairs of common tokens. (This is known as a 2-gram language model.)

OCCURENCES is the number of occurences of each unique token pair. TOKEN1 is the first token of the pair TOKEN2 is the second token of the pair.

About

ACTCD19: Andrew's C/C++ Token Count Dataset 2019

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published