ACTCD19: Andrew's C/C++ Token Count Dataset 2019

See paper [http://www.tomazos.com/actcd16.pdf] for background. Difference between ACTCD19 and ACTCD16 are that in ACTCD19 Debian Sid (Unstable) was used as the base distribution, which should be about 4 years fresher than ACTCD16.

Released Artifacts:

ACTCD19-2TOK.txt.gz (23MB)

A text file containing lines of the form:

OCCURENCES TOKEN1 TOKEN2

The unique token spellings in the corpus were identified, counted and ranked. We define a common token as a token that has a rank of 65536 or greater.

We then identified, counted and ranked unique consecutive pairs of common tokens. (This is known as a 2-gram language model.)

OCCURENCES is the number of occurences of each unique token pair. TOKEN1 is the first token of the pair TOKEN2 is the second token of the pair.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ACTCD19-2TOK.txt.gz		ACTCD19-2TOK.txt.gz
ACTCD19-PERFILE.txt		ACTCD19-PERFILE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACTCD19: Andrew's C/C++ Token Count Dataset 2019

About

Releases

Packages

tomazos/actcd19

Folders and files

Latest commit

History

Repository files navigation

ACTCD19: Andrew's C/C++ Token Count Dataset 2019

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages