Skip to content
Branch: master
Find file History
Type Name Latest commit message Commit time
Failed to load latest commit information. Add the license for Ids Jul 16, 2018

Identifiers size 1.0GB

Paper (accepted to ML4P'18).

The dataset was extracted from Public Git Archive and consists of:

  1. 49 million distinct identifiers - 1 GB
  2. identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.


CSV, columns:

  • num_files - number of files where the identifier was found
  • num_occ - number of times the identifier was found overall
  • num_repos - number of repositories in which the identifier was found
  • token - the value of the identifier
  • token_split - the splitted parts using the sourced-ml heuristics

All the stats correspond to the HEAD revision of each repository in PGA.

Code examples

  • Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.


Open Data Commons Open Database License (ODbL)

You can’t perform that action at this time.