Machine learning models for MLonCode trained using the source{d} stack
Switch branches/tags
Nothing to show
Clone or download
Latest commit e4ac410 Oct 3, 2018

source{d} MLonCode models


Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF.


from import BOW
bow = BOW().load(bow)
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))

4 models:


Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature.


from import DocumentFrequencies
df = DocumentFrequencies().load(docfreq)
print("Number of tokens:", len(df))

2 models:


Source code identifier embeddings, that is, every identifier is represented by a dense vector.


from import Id2Vec
id2vec = Id2Vec().load(id2vec)
print("Number of tokens:", len(id2vec))

2 models:


Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories.


from import Topics
topics = Topics().load(topics)
print("Number of topics:", len(topics))
print("Number of tokens:", len(topics.tokens))

1 model: