-
Notifications
You must be signed in to change notification settings - Fork 4
Documentation
Here, we document the inner workings of vitrivr-engine, introduce concpets employed and aim on providing a good overview of the components of vitrivr-engine.
This chapter introduced common terminology.
In content-based multimedia retrieval, the aim is to search within multimedia collections (e.g. video, image, audio, 3d objects) on a content, hence semantic level. This is a non-trivial problem due to the so-called semantic gap - the stark difference of semantic understanding of content between human and machines. Recent developments in foundation models has reduced this, yet, to efficiently search within large collections of multimedia data, various techniques are employed.
In (multimedia) retrieval, there a common distinction is between two phases; the ingestion phase (also known as offline phase), during which the multimedia content is being analysed and representations of the content is stored in an efficient way for later use.
The retrieval phase (also known as online phase) describes actions performed after ingestion, when (user) queries to the system are analysed in the same manner, as the multimedia data has been and the comparison of query and content is operated on those represntations. The outcome usually is represented by a list of results, each with an accompanying similarity score, which indicates how similar the results are. Commonly, a similarity score of 1 represents identity, while a similarity score of 0 indicates the greates dissimilarity.
In multimedia retrieval, a feature stands for the means on how to represent the multimedia content.
A very primitive feature is the average colour: Given an image (either an image or a frame from a video), one calculates the average colour by averaging the inidividual pixels' RGB values. While on its own this is not very expressive, demonstrates on how features work.
During ingestion, the average colour is calculated for all the input data (again, this could be for example a bunch of images or a couple of representative frames from a video) and stored in the database as three-element vectors (R,G,B).
During retrieval time, the query consists of a single three-element vector (R,G,B) and a Nearest Neighbour Search (NNS) is performed on those average colour vectors. The distance then is converted to a similarity score s on the interval
- Basics: Wikipedia
- Research: vitrivr
- Book: Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, ACM Press Books, 1999 (1st edition), 2011 (2nd edition)
There are a lot of (research) publications out there which cover (multimedia) retrieval in great detail.
vitrivr-engine's data model is based on almost a decade of research in multimedia retrieval. Influenced by its predecessor, the retrieval engine Cineast, the aim of the data model is to be as flexible as possible while still providing foundational guidelines for consumer of vitrivr-engine.
In vitrivr-engine, a retrievable is the unit of retrieval and the logical representation of multimedia data. Depending on the type of multimedia, one (e.g. image) or more (e.g. video) retrievables exist.
For an image file, a single retrievable of the type SOURCE:IMAGE is created.
For a video file, a single retrievable of the type SOURCE:VIDEO is created and a couple of retrievables with the type SEGMENT are created, depending on the segmentation strategy. Having a 30s video and a 1s fixed length segmentation, 31 retrievables are the result, one per second plus the one for the file. The one-second-segment retrievables have a partOf relationship towards the source retrievable.
The descriptor describes a retrievable in vitrivr-engine. The fundamental concept is, that a retrievable's content is represented by descriptors, which are based on features.
For an image file and the average colour example: The source retrievable is described by one average colour descriptor. For a 30s veideo file and the average colour example: Each of the 30 one-second-segment retrievables are described by one average colour descriptor, the source retrievable is not described.
In vitrivr-engine, there are four distinct high-level types of descriptors:
- Vector descriptors have a type (e.g. float) and a length. Ideal for NNS.
- Struct descriptors have pre-defined sub-fields of various types.
- Scalar descriptors consist of a single typed value.
- Tensor descriptors represent a mathematical tensor. Not yet implemented [June, 2024]
vitrivr-engine operates on the notion of a named schema, similarly to a database or a collection, essentially providing, among other things, a namespace.
{
"schemas": {
"my-schema"
}
}Each schema has to have a database connection which describes where the schema is persisted (and read from). The database which is supported by vitrivr-engine is CottontailDB.
{
"database": "CottontailConnectionProvider",
"parameters": {
"Host": "127.0.0.1",
"port": "1865"
}
}In vitrivr-engine, the term field represents features which are to be used. In particular, each field is uniquely named and might be parameterised.
Note: In technical terms, each field has to be backed by an Analyser, whose output is a descriptor. During ingestion, the analyser produces the representing descriptor of a retrievable, during retrieval the analysis step involves the execution of a query using the derived descriptor.
"uniqueName": {
"factory": "FactoryClass",
"parameters":{
"key": "value"
}
}A note about fields in vitrivr-engine: Due to its highly modular architecture, a handful of features to be used as fields are shipped with vitrivr-engine. The toy example is the AverageColor. Depending on use case, custom features can be added.
See analysier / field overview.
In constrast to an analyser / a field, in vitrivr-engine, an exporter produces exports new, derived data.
"uniqueName": {
"factory": "FactoryClass",
"resolverName": "resolverName",
"paramters": {
"key": "value"
}
}A resolver is responsible to resolve a physical ressource based on information present in a retrievable.
"uniqueName": {
"factory": "FactoryClass",
"paramters": {
"key": "value"
}
}The schema configuration is the foundation of vitrivr-engine and therefore required on startup. The configuration consists of blocks for the database connection (one), fields (many), exporters (many), and resolvers (many):
{
"schemas": {
"schema-name": {
"connection": {
"database": "CottontailConnectionProvider",
"parameters": {
"Host": "127.0.0.1",
"port": "1865"
}
},
"fields": {
"my-field-1": {
"factory": "AnalyserFactory"
},
"my-other-field": {
"factory": "AnotherAnalyserFactory"
}
},
"resolvers": {
"my-resolver": {
"factory": "ResolverFactory",
"parameters": {
"key": "value"
}
}
},
"exporters": {
"my-exporter": {
"factory": "ExporterFactory",
"resolverName": "my-resolver",
"parameters": {
"key1": "value1",
"key2": "value2"
}
}
},
"extractionPipelines": {
"my-video-pipeline": {
"path": "./videos.json"
},
"my-image-pipeline": {
"path": "./images.json"
}
}
}
}
}The newly introduces property extractionPipelines is a list of names ingestion pipelines and the path to the JSON file containing the pipeline configuration.
This is useful, if pre-defined ingestion pipelines are to be used. However, there is also the possiblity to provide the pipeline configuration on-the-fly, which is why this property is optional.
In pgVector we provide the following indexes for the query types FullText, NNS and SCALAR.
The hierarchical navigable small world index (HNSW) can be seted up for a VECTOR field by adding the following configuration to the field in the schema config.
Parameters:
-
attributesallowed max. 1. The valuevectoris the attribute name in database. -
type: "NNS" describes the query type. -
distance: describes the distance metric for this index. Thehnswindex provides:- "manhatten"
- "euclidean"
- "cosine"
- "hamming"
- "jaccard"
-
m: -
efConstruction: -
efSearch:
"indexes": [
{
"attributes": [
"vector"
],
"type": "NNS",
"parameters": {
"type": "hnsw",
"distance": "cosine",
"m": "4",
"efConstruction": "10"
"efSearch": "1000"
}
}
] "indexes": [
{
"attributes": [
"value"
],
"type": "FULLTEXT",
"parameters": {
"type": "gin",
"language": "english"
}
}
]-
FullTextgin
"indexes": [{"attributes":["value"],"type":"FULLTEXT","parameters":{"type":"gin", "language": "english"}}],"whisperasr": {
"factory": "ASR",
"indexes": [{"attributes":["value"],"type":"FULLTEXT","parameters":{"type":"gin", "language": "english"}}],
"parameters": {
"host": "http://10.34.64.83:8888/",
"model": "whisper",
"timeoutSeconds": "100",
"retries": "1000"
}
},index create <schema>.<descriptor_field> <descriptor> LUCENE
index rebuild <name-vom-index>index create warren.ptt.descriptor_asr descriptor LUCENE
index rebuild warren.ptt.descriptor_asr.idx_descriptor_luceneDuring ingestion, the multimedia data is analysed and features are extracted. Ingestion in vitrivr-engine is based on an ingestion pipeline definition, centered around so-called operators. The previously introduced analysers are one kind of such operators, which extract feature(s) corresponding to their field. Other operators include the previously introduced exporters.
Ingestion is schema-dependent and always directly linked to one specific schema.
The ingestion context provides vital information -- the context -- to an ingestion pipeline.
Specifically, there is a global and a local context.
The former provides key-value pairs for operators of the pipeline, while the latter provides key-value pairs
to specific operators based on their name. More so, the local context may override the global one (e.g. if there is a global "limit":"100" key-value pair and a certain local context provides a "limit":"50" key-value pair for one operator, this operator will have a limit of 50, in case it supports a limit.
The ingestion context additionally has two essential properties, contentFactory and resovlerName:
The ingestion context also includes a mandatory property, contentFactory, which requires the name of a ContentFactoriesFactory class.
The purpose of this factory is to produce ContentFactorys, which in turn produce Content - vitrivr-engine's representation of the media.
vitrivr-engine provides two such factories:
| Class | Description | Local Context Properties |
|---|---|---|
InMemoryContentFactory |
Produces content and stores it in-memory, which works fine for small datasets. | |
CachedContentFactory |
Produces content and caches the contents on disk. Designed for large datasets with large individual items (e.g. long high-res videos. |
content.location: The path location for the cache, defaults to a temporary directory called vitrivr-cache
|
The content.location local context property notation should be read as:
{
"context":{
"contentFactory":"CachedContentFactory",
"resolverName":"<resolver-name>",
"local":{
"content":{
"location":"<path-to-cache>"
}
}
}
}Fill in the placeholders <resolver-name> and <path-to-cache> as necessary.
Another special ingestion context property is the resolverName property, which has to reference a resolver defined on the schema.
The reason being that certain components may produce data which is relevant for retrieval and ingestion and the shared resolver ensures a common path.
The ingestion operator is first defined and then used as one component within a pipeline. Operators do have various types:
-
ENUMERATORenumerates sources and therefore serves as the starting point -
DECODERdecode the content into consumable elements -
EXTRACTORextract features and have to be backed by a field -
EXPORTERexport derived data from the multimedia data -
TRANSFORMERtransform the incoming retrievables to outgoing retrievables, possibly filtering them
The base structure of an ingestion operator is as follows:
{
"type": "<type>",
"<addressKey>":"<provider>"
}Where the <type> represents one of the above introduced types, <addressKey> is one of factory (enumerator, decoder, transformer), fieldName (extractor), or exporterName (exporter).
Some operators do have additional key-value configuration.
See Ingestion Operator Overview for further information on concrete implementations.
The enumerator emits elements based on its configuration.
{
"type":"ENUMERATOR",
"factory":"FactoryClass",
"mediaTypes":["<mt>"]
}Where <mt> stands for one of the following mediaTypes: IMAGE (images), VIDEO (videos), AUDIO (audio), MESH (3d objects). An enumerator can emit multiple media types, if necessary.
The decoder segments the media data into content and provides therefore the segmentation to work on.
{
"type":"DECODER",
"factory":"FacotryClass"
}The extractor is backed by a field. It analyses the media content and extracts the feature representation.
{
"type":"EXTRACTOR",
"fieldName":"my-field"
}my-field must be a field name defined on the schema.
An exporter produces derived data, e.g. thumbnails from a video.
{
"type":"EXPORTER",
"exporterName":"my-exporter"
}Where my-exporter is the name of an exporter defined on the schema.
In vitrivr-engine, a trasformer consumes retrievables and emits them, not necessarily one-to-one. That means, there might be a filter transformer which filters retrievables on a property.
{
"type":"TRANSFORMER",
"factory":"FactoryClass"
}In the ingestion configuration, a the operations define the ingestion operator pipeline / directed graph. An operation is a named node in the graph:
"operation-name": {
"operator":"operator-name",
"inputs": ["<input-stages>"],
"merge":"<merge-stragety>"
}operator-name and <input-stages> must reference a previously defined operator, as well as other existing operations.
<merge-strategy> must be one of MERGE, COMBINE or CONCAT, see below
The inputs and merge properties are optional with the following rules:
- if the
operator-namereferences an enumerator, then no inputs are expected, as the enumerator is the start node of the pipeline graph - if there is more than one element in the
inputslist, then themergeproperty is required.
By defining the operations accordingly, there are two thing that can happen implicitly.
Branching: If an operations is used as input for multiple other operations, this results in a branching. This is handled automatically by wrapping the associated Operator in a BroadcastOperator.
Merging: If an operation has multiple inputs, this results in a merging, which combines multiple flows of Retrievables into a single flow. The merging strategy (MergeType) must be specified explicitly in the operation.
Currently, vitrivr-engine supports three type of merging strategies:
-
MERGE: Merges theRetrievables from the input operations in order the arrive. No deduplication and ordering is performed. -
COMBINE: MergesRetrievables from the input operations and emits aRetrievable, once it was received on every input. -
CONCAT: CollectsRetrievables from the incoming flows in order of occurence, i.e., operation 1, then operation 2 etc.
In order to persist the results of the ingestion, an operation (or multiple ones) have to be specified in the special output proprety of the ingestion configuration.
If multiple operations are specified as output, then additionally, a mergeType has to be defined, see merging.
An ingestion pipeline is stored as a JSON file.
It's properties are as follows:
-
schema: The schema the ingestion operatos on -
context: The global and local ingestion context -
operators: The ingestion operators -
operations: The ingestion operations -
output: The persistance operations -
mergeType: Optional merge strategy
For a simple example, see the Getting Started guide's ingestion pipeline. For a more advanced example, see the Example guide's ingestion pipeline.
During retrieval time, queries are sent to vitrivr-engine with the aim to retrieve information, based on previous ingestion. Centered around retrieval operators, vitrivr-engine comes with its own query language which consists of four core components:
- The inputs define the query payload
- The operations define the order of retrieval operators as a pipeline
- The query context provides, similar to the ingestion context, vital contextual information
- The output specifies which operation is returned
Retrieval in vitrivr-engine is schema dependent and directly linked to one schema.
Similar to the ingestion context, the retrieval context consists of a local and global component.
See ingestion context for more information.
Essentially the payload of the query, the input is a typed, named component of a query. The types supported are:
-
TEXTfor textual input -
IMAGEfor image input -
VECTORfor vector input -
IDto query for an ID -
BOOLEANfor boolean input -
NUMERICfor numerical input -
DATEfor datetime inputs
See Query Input Overview for further information.
There are three types of query operators, which do have a certain similarity to the ingestion operators by design:
-
RETRIEVERs are theEXTACTORs counterpart, backed by a field and perform retrieval -
TRANSFORMERs transform the retrievables, similar to ingestionTRANSFORMERs -
AGGREGATORs aggregate multiple retrievables.
See Query Operator Overview for further information.
The retriever operator retrieves retrievables from the storage layer based on its analyser's capacity. Retrievers are by definition backed by a field and hence, the semantics very much dependent on the field.
{
"type":"RETRIEVER",
"field":"fieldname",
"input":"<inputname>"
}Where fieldname is the name of a field defined on the schema and <inputname> is the name of an input.
Retrievers may have additional properties set in the local or global query context.
A special notation for StructDescriptors (see Analyser Overview) is in place to formulate simple Boolean queries.
Given an input with a comparison specified, the dot (.) notation as in the following example results in a simple Boolean query on the subfield:
{
"type":"RETRIEVER",
"field":"fieldname.subfieldname"
"intput":"input-with-comparison"
}Assuming the input-with-comparsion is defined as follows:
{
"type":"NUMERICAL",
"data":"10000",
"comparison":">="
}And given that fieldname.subfieldname is numerical (e.g. the FileSourceMetadata.size subfield), the simple Boolean query reads as
Give me retrievables of fieldname where the subfield's value is larger or equal than 10000
The transformer operator takes retrievables, processes them and emits retrievables again. This is not necessarily a one-to-one operation. Common transformations include, among others, the expansion of relationships as well as the lookup of certain (sub)fields.
{
"type":"TRANSFORMER",
"transformerName":"TransformerClass"
"input":"<input-stage>"
}Transformers may have additional properties set in the global or local context.
See Query Transformer Overview for further information.
The aggregator operator aggregates incoming retrievables based on its aggregation strategy, inherent to the aggregator.
{
"type":"AGGREGATOR",
"aggregatorName":"AggregatorClass",
"inputs":["<input-operations>"]
}Where the <input-operations> are previously defined operations.
Aggregators may have additional properties set in the global or local context.
See Query Aggregator Overview for further information.
The query configuration is provided as JSON. It consists of the following properties:
-
context: The global and local query context -
inputs: The input payloads -
operations: The query operators, as named operations -
output: The operation name that is eventually emitted to the caller
For a simple example, see the Getting Started guide's query. For a more advanced example, see the Example guide's query.
Found an issue in the wiki? Post it!
Have a question? Ask it
Disclaimer: Please keep in mind, vitrivr and vitrivr-engine are predominantly research prototypes.