The dataset was extracted from Public Git Archive and consists of:
- 49 million distinct identifiers - 1 GB
- identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.
num_files- number of files where the identifier was found
num_occ- number of times the identifier was found overall
num_repos- number of repositories in which the identifier was found
token- the value of the identifier
token_split- the splitted parts using the sourced-ml heuristics
All the stats correspond to the HEAD revision of each repository in PGA.
- Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.