Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use apollo at scale. We hit scipy.sparse limits while trying to merge sparse matrices for all bags, so this is only one of three BOW model holding bags.


from import BOW
bow = BOW().load("694c20a0-9b96-4444-80ae-f2fa5bd1395b")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))


ID 694c20a0-9b96-4444-80ae-f2fa5bd1395b
Uploaded 2018-07-17 10:28:56.243131
Version 1.0.0
Size 26.0 GB
Data collection date July 2018
Number of distinct documents (files) 3,512,171
Number of distinct features 6,194,874
Other parts da8c5dee-b285-4d55-8913-a5209f716564 and 1e0deee4-7dc1-400f-acb6-74c0f4aec471