Permalink
Find file Copy path
288 lines (213 sloc) 12.1 KB

Type Mapping

ZomboDB provides a set of default Postgres<-->Elasticsearch type mappings for Postgres' common data types. ZomboDB also includes a complete set of custom Postgres DOMAINs that represent all of the foreign languages that Elasticsearch supports.

This document attempts to explain how ZomboDB maps Postgres data types to Elasticsearch data types, how they're analyzed (if at all), and how these things can be controlled through a combination of Postgres' type system and ZomboDB-specific functions.

Common Data Types

These are the default mappings:

Postgres Type Elasticsearch JSON Mapping Definition
bytea {"type": "binary"}
boolean {"type": "boolean"}
smallint {"type": "short"}
integer {"type": "integer"}
bigint {"type": "long"}
real {"type": "float"}
double precision {"type": "double"}
character varying {"type": "keyword", "copy_to": "zdb_all", "normalizer": "lowercase", "ignore_above": 10922}
text {"type": "text", "copy_to": "zdb_all", "analyzer": "zdb_standard", "fielddata": true}
time without time zone {"type": "date", "format": "HH:mm:ss.SSSSSS", "copy_to": "zdb_all"}
time with time zone {"type": "date", "format": "HH:mm:ss.SSSSSSZZ", "copy_to": "zdb_all"}
date {"type": "date", "copy_to": "zdb_all"}
timestamp without time zone {"type": "date", "copy_to": "zdb_all"}
timestamp with time zone {"type": "date", "copy_to": "zdb_all"}
json {"type": "nested", "include_in_parent": true}
jsonb {"type": "nested", "include_in_parent": true}
inet {"type": "ip", "copy_to": "zdb_all"}
zdb.fulltext {"type": "text", "copy_to": "zdb_all", "analyzer": "zdb_standard"}
zdb.fulltext_with_shingles {"type": "text", "copy_to": "zdb_all", "analyzer": "fulltext_with_shingles", "search_analyzer": "fulltext_with_shingles_search"}

Some things to note from the above:

  • Columns of type character varying (varchar) are not analyzed by Elasticsearch. They're indexed as whole values, but are converted to lowercase
  • Columns of type text are analyzed by Elasticsearch using its standard analyzer, and the individual terms are converted to lowercase
  • Columns of type json/jsonb are mapped to Elasticsearch's nested object with a dynamic template that treats "string" properties as if they're of type character varying (ie, unanalyzed exact, lowercased values), and treats "date" properties as if they're dates, accepting a wide range of date formats.

In all cases above, arrays of Postgres types are fully supported.

ZomboDB's Custom DOMAIN types

ZomboDB includes a few custom Postgres DOMAIN types which can be used as column types in your tables.

zdb.fulltext works exactly the same way as a column of type text, but exists so as to provide an extra hint of metadata to client applications indicating that the column likely contains a large amount of text.

zdb.fulltext_with_shingles is akin to zdb.fulltext but uses a 2-gram single filter for high-speed right-truncated wildcard support.

Language-specific DOMAIN types

As noted earlier, ZomboDB provide support for all of Elasticsearch's language analyzers, exposed as Postgres DOMAINs. This allows you to create tables with columns of type portuguese or thai, for example. The complete set of language domains is:

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, 
czech, danish, dutch, english, finnish, french, galician, german, greek, 
hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, 
portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

More details on how Elasticsearch analyzes each of these can be found in its language analyzers documentation.

Defining Custom Analyzers, Filters, Normalizers, Type Mappings

ZomboDB provides a set of SQL-level functions that allow you to define custom analyzer chains, filters, normalizers, along with custom type mappings.

These are designed to be used with Postgres' CREATE DOMAIN command where the domain name exactly matches the analyzer name.

Analysis Definition Functions

FUNCTION zdb.define_analyzer(name text, definition json)

Allows for the definition of Elasticsearch custom analyzers. Depending on the complexity of the analyzer you need to define, you'll likely first need to define custom filters or tokenizers, as described below.

In order to use a custom analyzer you must make a custom Postgres DOMAIN with the same name, and then you can use that DOMAIN as column type in your tables.

You can also use the custom analyzer in conjunction with a custom field mapping via the zdb.define_field_mapping() function described below.

Note that making changes to any of the analysis definitions will require a REINDEX of any indices that use the things you changed.


FUNCTION zdb.define_filter(name text, definition json)

Allows for the definition of a custom Elasticsearch token filter.


FUNCTION zdb.define_char_filter(name text, definition json) 

Allows for the definition of a custom Elasticsearch character filter.


FUNCTION zdb.define_tokenizer(name text, definition json)

Allows for the definition of a custom Elasticsearch tokenizer.


FUNCTION zdb.define_normalizer(name text, definition json) 

Allows for the definition of a custom Elasticsearch normalizer.


FUNCTION zdb.define_type_mapping(type_name regtype, definition json)

If you need to define a type mapping for a Postgres datatype that isn't included in ZomboDB's defaults (citext, for example), this is the function to use.

Field-Specific Mapping Functions

Rather than using DOMAIN types to map Postgres types to an Elasticsearch analyzer, you can also define field-specific mappings per table and field.

This approach can be quite powerful as you can set, per field, all the mapping properties that Elasticsearch allows, and you don't need to create and manage custom DOMAIN types.

FUNCTION zdb.define_field_mapping(table_name regclass, field_name name, definition json) 

If you need to define a field mapping for a specific field in a specific table, this is the function to use. You can specify any custom mapping definition json that is supported by Elasticsearch.

Creating or changing a field mapping requires a REINDEX of the specified table.


FUNCTION zdb.define_es_only_field(table_name regclass, field_name name, definition json)

If you want a custom field that only exists in the Elasticsearch index (perhaps as a target to the mapping copy_to property, you can use this.

Any field you create here can be searched and used with aggregates, but won't be SELECT-able by Postgres.

Creating or changing an Elasticearch-only field requires a REINDEX of the specified table.

Example

SELECT zdb.define_tokenizer('example_tokenizer', '{
          "type": "pattern",
          "pattern": "_"
        }');
SELECT zdb.define_analyzer('example', '{
          "tokenizer": "example_tokenizer"
        }');
CREATE DOMAIN example AS text;
CREATE TABLE foo (
   id serial8,
   some_field example
);
CREATE INDEX idxfoo ON foo USING zombodb ((foo.*));
INSERT INTO foo (some_field) VALUES ('this_is_a_test');
SELECT * FROM foo WHERE foo ==> 'some_field:this';

Testing Analyzers

ZomboDB provides a few functions that can be used to evaluate how an analyzer actually tokenizes text.

FUNCTION zdb.analyze_with_field(
	index regclass, 
	field name, 
	text text) 
RETURNS TABLE (
	type text, 
	token text, 
	"position" int, 
	start_offset int, 
	end_offset int)

This function allows you to evaluate text analysis using the analyzer already defined for a particular field.

Examples:

SELECT * FROM zdb.analyze_with_field('idxproducts', 'keywords', 'this is a test');
 type |     token      | position | start_offset | end_offset 
------+----------------+----------+--------------+------------
 word | this is a test |        0 |            0 |         14
(1 row)
SELECT * FROM zdb.analyze_with_field('idxproducts', 'long_description', 'this is a test');
    type    | token | position | start_offset | end_offset 
------------+-------+----------+--------------+------------
 <ALPHANUM> | this  |        0 |            0 |          4
 <ALPHANUM> | is    |        1 |            5 |          7
 <ALPHANUM> | a     |        2 |            8 |          9
 <ALPHANUM> | test  |        3 |           10 |         14
(4 rows)

FUNCTION zdb.analyze_text(
	index regclass, 
	analyzer text, 
	text text) 
RETURNS TABLE (
	type text, 
	token text, 
	"position" int, 
	start_offset int, 
	end_offset int)

This function allows you to evaluate analysis using a specific analyzer name, either built-in to Elasticsearch or one of the custom analyzers you may have defined.

Examples:

SELECT * FROM zdb.analyze_text('idxproducts', 'keyword', 'this is a test');
 type |     token      | position | start_offset | end_offset 
------+----------------+----------+--------------+------------
 word | this is a test |        0 |            0 |         14
SELECT * FROM zdb.analyze_text('idxproducts', 'standard', 'this is a test');
    type    | token | position | start_offset | end_offset 
------------+-------+----------+--------------+------------
 <ALPHANUM> | this  |        0 |            0 |          4
 <ALPHANUM> | is    |        1 |            5 |          7
 <ALPHANUM> | a     |        2 |            8 |          9
 <ALPHANUM> | test  |        3 |           10 |         14

FUNCTION zdb.analyze_custom(
	index regclass, 
	text text DEFAULT NULL, 
	tokenizer text DEFAULT NULL, 
	normalizer text DEFAULT NULL, 
	filter text[] DEFAULT NULL, 
	char_filter text[] DEFAULT NULL) 
RETURNS TABLE (
	type text, 
	token text, 
	"position" int, 
	start_offset int, 
	end_offset int)

This function allows you to dynamically define a custom analyzer and test it in real-time.

Example:

SELECT * FROM zdb.analyze_custom(
	index=>'idxproducts', 
	text=>'This is a test, 42 https://www.zombodb.com', 
	tokenizer=>'whitespace', 
	filter=>ARRAY['lowercase']);
 type |          token          | position | start_offset | end_offset 
------+-------------------------+----------+--------------+------------
 word | this                    |        0 |            0 |          4
 word | is                      |        1 |            5 |          7
 word | a                       |        2 |            8 |          9
 word | test,                   |        3 |           10 |         15
 word | 42                      |        4 |           16 |         18
 word | https://www.zombodb.com |        5 |           19 |         42
(6 rows)

About Elasticsearch's _all Field

In short, ZomboDB disables Elasticsearch's _all field and instead configures its own field named zdb_all. By default, all non-numeric field types are added to the zdb_all field.

ZomboDB does this to maintain compatability between Elasticsearch 5 and Elasticsearch 6, where ES 6 deprecates the _all field.