Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
33 lines (24 sloc) 1.48 KB

BOW

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourced.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use apollo at scale. We hit scipy.sparse limits while trying to merge sparse matrices for all bags, so this is only one of three BOW model holding bags.

Example:

from sourced.ml.models import BOW
bow = BOW().load("694c20a0-9b96-4444-80ae-f2fa5bd1395b")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))

References

ID 694c20a0-9b96-4444-80ae-f2fa5bd1395b
Uploaded 2018-07-17 10:28:56.243131
Version 1.0.0
File https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf
Size 26.0 GB
Data collection date July 2018
Number of distinct documents (files) 3,512,171
Number of distinct features 6,194,874
Other parts da8c5dee-b285-4d55-8913-a5209f716564 and 1e0deee4-7dc1-400f-acb6-74c0f4aec471
License

Dependencies