Skip to content

Documentation

Loris Sauter edited this page Jun 26, 2024 · 34 revisions

Here, we document the inner workings of vitrivr-engine, introduce concpets employed and aim on providing a good overview of the components of vitrivr-engine.

Terminology

This chapter introduced common terminology.

Introduction

In content-based multimedia retrieval, the aim is to search within multimedia collections (e.g. video, image, audio, 3d objects) on a content, hence semantic level. This is a non-trivial problem due to the so-called semantic gap - the stark difference of semantic understanding of content between human and machines. Recent developments in foundation models has reduced this, yet, to efficiently search within large collections of multimedia data, various techniques are employed.

Ingestion / Offline Phase

In (multimedia) retrieval, there a common distinction is between two phases; the ingestion phase (also known as offline phase), during which the multimedia content is being analysed and representations of the content is stored in an efficient way for later use.

Retrieval / Online Phase

The retrieval phase (also known as online phase) describes actions performed after ingestion, when (user) queries to the system are analysed in the same manner, as the multimedia data has been and the comparison of query and content is operated on those represntations. The outcome usually is represented by a list of results, each with an accompanying similarity score, which indicates how similar the results are. Commonly, a similarity score of 1 represents identity, while a similarity score of 0 indicates the greates dissimilarity.

Feature

In multimedia retrieval, a feature stands for the means on how to represent the multimedia content.

Toy Example

A very primitive feature is the average colour: Given an image (either an image or a frame from a video), one calculates the average colour by averaging the inidividual pixels' RGB values. While on its own this is not very expressive, demonstrates on how features work.

During ingestion, the average colour is calculated for all the input data (again, this could be for example a bunch of images or a couple of representative frames from a video) and stored in the database as three-element vectors (R,G,B).

During retrieval time, the query consists of a single three-element vector (R,G,B) and a Nearest Neighbour Search (NNS) is performed on those average colour vectors. The distance then is converted to a similarity score s on the interval $$s \in [0,1]$$ for all items in the database.

Further Reading

  • Basics: Wikipedia
  • Research: vitrivr
  • Book: Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, ACM Press Books, 1999 (1st edition), 2011 (2nd edition)

There are a lot of (research) publications out there which cover (multimedia) retrieval in great detail.

Data Model vitrivr-engine

vitrivr-engine's data model is based on almost a decade of research in multimedia retrieval. Influenced by its predecessor, the retrieval engine Cineast, the aim of the data model is to be as flexible as possible while still providing foundational guidelines for consumer of vitrivr-engine.

Retrievable

In vitrivr-engine, a retrievable is the unit of retrieval and the logical representation of multimedia data. Depending on the type of multimedia, one (e.g. image) or more (e.g. video) retrievables exist.

For an image file, a single retrievable of the type SOURCE:IMAGE is created. For a video file, a single retrievable of the type SOURCE:VIDEO is created and a couple of retrievables with the type SEGMENT are created, depending on the segmentation strategy. Having a 30s video and a 1s fixed length segmentation, 31 retrievables are the result, one per second plus the one for the file. The one-second-segment retrievables have a partOf relationship towards the source retrievable.

Descriptor

The descriptor describes a retrievable in vitrivr-engine. The fundamental concept is, that a retrievable's content is represented by descriptors, which are based on features.

For an image file and the average colour example: The source retrievable is described by one average colour descriptor. For a 30s veideo file and the average colour example: Each of the 30 one-second-segment retrievables are described by one average colour descriptor, the source retrievable is not described.

Overview of Descriptors

In vitrivr-engine, there are four distinct high-level types of descriptors:

  • Vector descriptors have a type (e.g. float) and a length. Ideal for NNS.
  • Struct descriptors have pre-defined sub-fields of various types.
  • Scalar descriptors consist of a single typed value.
  • Tensor descriptors represent a mathematical tensor. Not yet implemented [June, 2024]

Schema

vitrivr-engine operates on the notion of a named schema, similarly to a database or a collection, essentially providing, among other things, a namespace.

{
  "name":"my-schema"
}

Database Connection

Each schema has to have a database connection which describes where the schema is persisted (and read from). The database which is supported by vitrivr-engine is CottontailDB.

{
  "database": "CottontailConnectionProvider",
  "parameters": {
    "Host": "127.0.0.1",
    "port": "1865"
  }
}

Field

In vitrivr-engine, the term field represents features which are to be used. In particular, each field is uniquely named and might be parameterised.

Note: In technical terms, each field has to be backed by an Analyser, whose output is a descriptor. During ingestion, the analyser produces the representing descriptor of a retrievable, during retrieval the analysis step involves the execution of a query using the derived descriptor.

{
  "name": "uniqueName",
  "factory": "FactoryClass",
  "parameters":{
    "key": "value"
  }
}

A note about fields in vitrivr-engine: Due to its highly modular architecture, a handful of features to be used as fields are shipped with vitrivr-engine. The toy example is the AverageColor. Depending on use case, custom features can be added.

See analysier / field overview.

Exporter

In constrast to an analyser / a field, in vitrivr-engine, an exporter produces exports new, derived data.

{
    "name": "uniqueName",
    "factory": "FactoryClass",
    "resolverName": "resolverName",
    "paramters": {
        "key": "value"
    }
}

Resolver

A resolver is responsible to resolve a physical ressource based on information present in a retrievable.

{
    "name": "uniqueName",
    "factory": "FactoryClass",
    "paramters": {
        "key": "value"
    }
}

Schema Configuration

The schema configuration is the foundation of vitrivr-engine and therefore required on startup. The configuration consists of blocks for the database connection (one), fields (many), exporters (many), and resolvers (many):

{
    "schemas": [
        {
            "name": "schema-name",
            "connection": {
                "database": "CottontailConnectionProvider",
                "parameters": {
                    "Host": "127.0.0.1",
                    "port": "1865"
                }
            },
            "fields": [
                {
                    "name": "my-field-1",
                    "factory": "AnalyserFactory"
                },
                {
                    "name": "my-other-field",
                    "factory": "AnotherAnalyserFactory"
                }
            ],
            "resolvers": {
                "my-resolver": {
                    "factory": "ResolverFactory",
                    "parameters": {
                        "key": "value"
                    }
                }
            },
            "exporters": [
                {
                    "name": "my-exporter",
                    "factory": "ExporterFactory",
                    "resolverName": "my-resolver",
                    "parameters": {
                        "key1": "value1",
                        "key2": "value2"
                    }
                }
            ],
            "extractionPipelines": [
                {
                    "name": "my-video-pipeline",
                    "path": "./videos.json"
                },
                {
                    "name": "my-image-pipeline",
                    "path": "./images.json"
                }
            ]
        }
    ]
}

The newly introduces property extractionPipelines is a list of names ingestion pipelines and the path to the JSON file containing the pipeline configuration. This is useful, if pre-defined ingestion pipelines are to be used. However, there is also the possiblity to provide the pipeline configuration on-the-fly, which is why this property is optional.

Ingestion

During ingestion, the multimedia data is analysed and features are extracted. Ingestion in vitrivr-engine is based on a ingestion pipeline definition, centered around so-called operators. The previously introduced analysers are one kind of such operators, which extract feature(s) corresponding to their field. Other operators include the previously introduced exporters.

Ingestion Context

The ingestion context provides vital information -- the context -- to an ingestion pipeline. Specifically, there is a global and a local context. The former provides key-value pairs for operators of the pipeline, while the latter provides key-value pairs to specific operators based on their name. More so, the local context may override the global one (e.g. if there is a global "limit":"100" key-value pair and a certain local context provides a "limit":"50" key-value pair for one operator, this operator will have a limit of 50, in case it supports a limit.

Ingestion Operator

The ingestion operator is first defined and then used as one component within a pipeline. Operators do have various types:

  • ENUMERATOR enumerates sources and therefore serves as the starting point
  • DECODER decode the content into consumable elements
  • EXTRACTOR extract features and have to be backed by a field
  • EXPORTER export derived data from the multimedia data
  • TRANSFORMER transform the incoming retrievables to outgoing retrievables, possibly filtering them

The base structure of an ingestion operator is as follows:

{
  "type": "<type>",
  "<addressKey>":"<provider>"
}

Where the <type> represents one of the above introduced types, <addressKey> is one of factory (enumerator, decoder, transformer), fieldName (extractor), or exporterName (exporter). Some operators do have additional key-value configuration.

See Ingestion Operator Overview for further information on concrete implementations.

Enumerator

The enumerator emits elements based on its configuration.

{
  "type":"ENUMERATOR",
  "factory":"FactoryClass",
  "mediaTypes":["<mt1>", "<mt2>"]
}

Where <mt1> and <mt2> stand for one of the following mediaTypes: IMAGE (images), VIDEO (videos), AUDIO (audio), MESH (3d objects).

Decoder

The decoder decodes

Extractor

Exporter

Tansformer

Retrieval

Clone this wiki locally