Skip to content
This repository has been archived by the owner on May 22, 2019. It is now read-only.

Latest commit

 

History

History
129 lines (91 loc) · 7.46 KB

GLOSSARY.md

File metadata and controls

129 lines (91 loc) · 7.46 KB

Abstract Syntax Tree

An abstract syntax tree is a tree representing the abstract syntactic structure of a program written in a programming language. Because not all details of the real syntax are present in an AST, it is "abstract" rather than "concrete". In the tree, the branches and nodes represent structural relationships between the syntactic elements of the program it is based on. ASTs from different languages will have different features, so they are not language agnostic.

AST

See Abstract Syntax Tree.

Bag-of-words model

A bag-of-words model is a model wherein text is represented as a "bag" of the words it contains. A bag-of-words model discards information about text structure, grammar and order, but preserves multiplicity, or the number of occurrences of each word in the text.

A bow model refers to a special type of bag-of-words model, described below.

Weighted bag-of-words model

A bow model is an instance of a weighted bag-of-words model. In a weighted bag-of-words model, each word in the bag is weighted using some algorithm. For the bow model, every bag is a feature extracted from source code and the associated weight is calculated using TDIDF.

For more information on the bow model, see the documentation here.

Weighted bag-of-X model

A bag-of-words model can be generalized to a bag-of-X model. These models, sometimes called bag-of-feature models, can hold any uniform feature type. For example, it is possible to store information about some feature of a document in a vector, then dump the vectors into a bag-of-vectors. Given document frequencies and identifier embeddings, it is possible to represent a repository as a weighted bag-of-vectors.

Collection frequency

The number of times some term appears in all documents in a collection. See also document frequency.

COOC

See Co-occurance matrix

Co-occurance matrix

Document

Document frequency

The document frequency is defined as the number of documents in some collection of documents that contain a term or a feature. See also collection frequency.

The docfreq model represents the document frequencies of features extracted from source code; that is, how many documents (repositories, files or functions) contain each tokenized feature.

For more information on the docfreq model, see the documentation here.

Inverse document frequency

The inverse document frequency is defined as log(N/df(t)) where df(t) is the document frequency of a term t and N is the number of documents in a collection. This is used as a way to weight a term by its document frequency.

Features

Generally, a feature refers to any measurable property of data in the domain of a model. In the context of sourced.ml, a feature is a property of the source code sample used as input to a model. Selecting the correct features to use as inputs to a model is essential to the model's performance.

There are a number of relevant feature types used by sourced.ml:

Identifier

Token

The string "atoms" generated by the parsing process, which involves splitting text into words and stemming the resulting words.

Literal

Graphlet

The graphlet of a UAST node is composed from the node itself, its parent and its children.

Feature extraction

Feature extraction is the process of gathering information about features from a set of data.

Identifier embeddings

The id2vec model contains information on source code identifier embeddings; that is, every identifier is represented as a dense vector.

For more information on the id2vec model, see the documentation here.

Model

A model is the artifact from running an analysis pipeline. It is plain data with some methods to access it. A model can be serialized to bytes and deserialized from bytes. The underlying storage format is specific to src-d/modelforge and is currently ASDF with lz4 compression.

Pipeline

A tree of linked sourced.ml.transformers.Transformer objects which can be executed on PySpark/source{d} engine. The result is often written on disk as Parquet or model files or to a database.

Quantization

Most generally, quantization is a process which maps a large set of possible inputs onto a smaller set of possible outputs. The values of the large set may be continuous/uncountable. For example, a vector quantizer takes as its input a vector, which encodes some set of features of a document. The vector quantizer maps the input vector onto the nearest vector in a set of vectors. The vectors in the output set may be thought of as the vocabulary of words that can be used; all inputs can be mapped onto one of the words in the output set.

Term frequency

A measure of how many times a term appears in a given document.

Term frequency inverse document frequency

A weighting scheme that combines term frequency with inverse document frequency. It produces a composite weight for each term in each document. The weight assigned by TF-IDF is higher when the term t is highly discriminating. This occurs when t is in relatively few documents, and thus has a high IDF, or when it occurs many times in the relevant document, and thus has a high TF.

TF-IDF

See Term frequency inverse document frequency.

Topic modeling

In machine learning, topic modeling is a type of modeling used to find abstract "topics" that occur in a collection of documents. The process is often used to identify semantic content from documents or collections of documents automatically.

In the context of sourced.ml, topic modeling is used to identify topics of source code repositories. The topic model can be used to model the topics of a Git repository; all tokens are identifiers extracted from the repository or repositories. They are used as indicators of the abstract "topics" mentioned above and are used to infer the topic(s) of each repository.

For more information on the topic model, see the documentation here.

Transformer

A sourced.ml.transformers.Transformer object, which serve as one of a series of potential steps in transforming source code features from one form into another.

UAST

See Universal Abstract Syntax Tree.

Universal Abstract Syntax Tree

A generalized version of an abstract syntax tree. It is further abstracted away from any concrete details about the parent program, allowing different programming languages to have their programs converted into UASTs, which are language agnostic.

This is achieved using Babelfish, a universal code parser.

Weighted MinHash

An algorithm to approximate the Weighted Jaccard Similarity between all the pairs of source code samples in linear time and space. Described by Sergey Ioffe.