Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citext data type #133

Closed
budi opened this issue Jul 19, 2016 · 8 comments
Closed

Citext data type #133

budi opened this issue Jul 19, 2016 · 8 comments
Assignees

Comments

@budi
Copy link

budi commented Jul 19, 2016

Good day!

I have an email column that I want to index, and I'm using citext extension to make it easier for case-insensitivity cases (ha ha ha).

But zombodb always gives me elog:

WARNING:  Unrecognized data type citext, pretending it's of type 'text'

Sad. 😢

Any plan to put citext into the domain?

@eeeebbbbrrrr
Copy link
Collaborator

Sure, that's easy enough to do. I'll take care of this in the next version of ZomboDB.

Note that ZDB searches are case-insensitive by default, so there's no need to use citext for ZomboDB. If you're also querying that column directly from SQL, however, then there might be good reasons.

@budi
Copy link
Author

budi commented Jul 19, 2016

Yup!
Suppressing logs when it is way too obvious is a huge deal :)

@budi
Copy link
Author

budi commented Jul 19, 2016

Just to be clear, it just so happens that the citext (email) column is something I want to query with ES.. like so:

...
 USING zombodb(zdb('blah', blah.ctid), zdb(blah))
  WITH ( url='#{ENV['ELASTIC_HOST']}',
         shards=5,
         replicas=10,
         field_lists='search=[other_column, email]' );

so then I can:

SELECT * 
  FROM blah
 WHERE zdb('users', ctid) ==> 'search:"*combo*"';

its_so_fluffy.gif

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jul 19, 2016

Yep, we're on the same page. When I add support for the citext type, I'll do it exactly the way ZomboDB currently handles text. i.e., I'm going to add it to this block in zdbops.c:

        } else if (strcmp("text", typename) == 0 || strcmp("varchar", typename) == 0 ||
                   strcmp("character", typename) == 0 || strcmp("character varying", typename) == 0 ||
                   strcmp("text[]", typename) == 0 || strcmp("varchar[]", typename) == 0 ||
                   strcmp("character[]", typename) == 0 || strcmp("character varying[]", typename) == 0 ||
                   strcmp("uuid", typename) == 0 || strcmp("uuid[]", typename) == 0) {
            /* string field */
            appendStringInfo(result, "\"type\": \"string\",");
            appendStringInfo(result, "\"norms\": {\"enabled\":false},");
            appendStringInfo(result, "\"index_options\": \"docs\",");
            appendStringInfo(result, "\"ignore_above\":32000,");
            appendStringInfo(result, "\"analyzer\": \"exact\"");

        }

If you need tokenization/analysis then it's up to you to either use a predefined domain type like phrase or fulltext or to use zdb_define_mapping() to set custom mappings.

@budi
Copy link
Author

budi commented Jul 19, 2016

I was just reading the SQL-API and got a eureka moment only to come back and read your comment :))

@eeeebbbbrrrr
Copy link
Collaborator

Should also read the TYPE-MAPPING.md document. :)

eeeebbbbrrrr added a commit that referenced this issue Jul 21, 2016
@eeeebbbbrrrr eeeebbbbrrrr added v3.0.1 and removed v3.1.0 labels Aug 5, 2016
@eeeebbbbrrrr
Copy link
Collaborator

Change of plans... going to release in a v3.0.1

eeeebbbbrrrr added a commit that referenced this issue Aug 5, 2016
eeeebbbbrrrr added a commit that referenced this issue Aug 7, 2016
* bump version to v3.0.1

* a little test to make sure things work with materialized views

* make sure all ubuntu-based Dockerfiles for the build system first update themselves (and --fix-missing).  Boggles my mind why this used to work and now doesn't.

* fix terrible bug where ANDed terms + phrases against the same field would be rewritten as ORs!

* code cleanup

* doc update

* organize query_parser classes into subpackages and cleanup method/member visibility along the way.

* get everything compiling across all distros and pg versions

* fixing issue #129:  support for Debian Jessie as build artifacts and in top-level Dockerfile

* resolving issue #133:  support the 'citext' contrib datatype and treat it just as we do the 'text' datatype.

* docs update

* fixing issue #132

* implementing issue #134:  ZDB now supports querying multi-fields that might be configured via zdb_define_mapping().

* remove unused file

* fixing merge issues

* fixup tests list

* doc fixes
@eeeebbbbrrrr
Copy link
Collaborator

Released in v3.0.1

eeeebbbbrrrr added a commit that referenced this issue Nov 4, 2016
* fix terrible bug where ANDed terms + phrases against the same field would be rewritten as ORs!

cherry-picks 39e6da0

* bump version to 2.6.15

* fixing issue #118

* fix version to 2.6.15

* fix this stupid test so it doesn't sometimes return different results

* bump to v2.6.16

* working on vacuuming the _zdb_xact index as we execute queries that require it.  seems to work but causes autovacuum to cancel itself.  need to rework locking mechanics so autovacuum has a chance to run too

* set the "consistency" parameter on _bulk requests to "all" so that it won't return until all the replicas are updated too.

* fix locking so that autovacuum is still able to run, at least from time to time.  also cleanup code around detecting invisible tuples -- no logical changes there, just re-ordering the conditionals so the common cases happen sooner.

* for now, set the 'consistency' parameter to 'default'.  I think making this an index option is the correct answer

* minor formatting change

* code cleanup

* remove the vacuum-inline stuff.  use ES' "terms from document" query to filter out dead tuples.  implement HTTP compression using deflate (needs more work/cleanup).

* implement a background worker to vacuum _zdb_xact indexes.
However!  It's currently disabled.  Not sure I like the complexity it adds to a ZDB installation

* create a new index option named "compression_level", with a default of zero

* fix unit tests

* get everything compiling across all distros and pg versions

* fixing issue #129:  support for Debian Jessie as build artifacts and in top-level Dockerfile

* bring back the .dynamic index but with 1 shard and zero replicas.  Also set a TTL on the documents of 24hrs so that the index will eventually keep itself clean

* resolving issue #133:  support the 'citext' contrib datatype and treat it just as we do the 'text' datatype.

* docs update

* make sure to free memory holding compressed post data

* skip checking visibility for tuples from xids we know can't yet be vacuumed

* quick hack to use SiREN for excluding tuples -- needs more work and cleanup

* Some minor changes to get minimal Citus support
- support prepared transactions
- allow first argument of zdb(regclass, tid) to be a function:  we evaluate it and use its return value

* hack builtin function pg_catalog.regclass(text) to be immutable so it can be used in CREATE INDEX statements via citus

* update to reflect changes in personal laptop configuration

* code cleanup

* what happened here?

* fixing issue #132

* implementing issue #134:  ZDB now supports querying multi-fields that might be configured via zdb_define_mapping().

* working on vacuuming the _zdb_xact index as we execute queries that require it.  seems to work but causes autovacuum to cancel itself.  need to rework locking mechanics so autovacuum has a chance to run too

* set the "consistency" parameter on _bulk requests to "all" so that it won't return until all the replicas are updated too.

* fix locking so that autovacuum is still able to run, at least from time to time.  also cleanup code around detecting invisible tuples -- no logical changes there, just re-ordering the conditionals so the common cases happen sooner.

* for now, set the 'consistency' parameter to 'default'.  I think making this an index option is the correct answer

* minor formatting change

* code cleanup

* remove the vacuum-inline stuff.  use ES' "terms from document" query to filter out dead tuples.  implement HTTP compression using deflate (needs more work/cleanup).

* implement a background worker to vacuum _zdb_xact indexes.
However!  It's currently disabled.  Not sure I like the complexity it adds to a ZDB installation

* create a new index option named "compression_level", with a default of zero

* fix unit tests

* get everything compiling across all distros and pg versions

* bring back the .dynamic index but with 1 shard and zero replicas.  Also set a TTL on the documents of 24hrs so that the index will eventually keep itself clean

* docs update

* make sure to free memory holding compressed post data

* skip checking visibility for tuples from xids we know can't yet be vacuumed

* quick hack to use SiREN for excluding tuples -- needs more work and cleanup

* Some minor changes to get minimal Citus support
- support prepared transactions
- allow first argument of zdb(regclass, tid) to be a function:  we evaluate it and use its return value

* hack builtin function pg_catalog.regclass(text) to be immutable so it can be used in CREATE INDEX statements via citus

* update to reflect changes in personal laptop configuration

* code cleanup

* what happened here?

* move row visibility checking into Elasticsearch.  requires a NOT NULL column of type int8/bigint/serial8 with a UNIQUE constraint.  Some tests still fail.  More work to do, but things are looking really good

* fix the 'words' regression test by adding a pkey to the words table

* refactoring:  move classes into different packages and rename things.  shift some code out of ZDBvisibilityQuery and into ZDBVisibilityQueryHelper

* make a TODO note

* more changes to support Citus integration.

Allow query formulations that use the Postgres system column named 'tableoid':  SELECT * FROM table WHERE zdb(tableoid, ctid) ==> 'foo';

This negates the need for Citus to rewrite ::regclass values.

* refactoring around visibility resolution.  Introduce a new query node called #visibility (and remove #exclude) and start doing the detection inside VisibilityQueryHelper.

* fix equals/hashcode/tostring methods so query caching does the right thing.

* cleanup

* use a bitset to track pkey values

* cleanup

* track update state in a separate index.

I think all of this is going to be reverted and am going to go back to parent/child indexes.

* working on making visibility rules use _routing

* working on managing visibility inside ES.

* tweak test so results are always ordered

* update the 'deletes' test to reflect that ZDB sees deleted records now in zdb_estimate_count()

* fix numeric casting

* trying to make travis happy again.

* more bugfixing

* change version to v3.1.0_BETA1

* bugfix

* remove dead code

* - ensure that committed xids are added to to each shard (elasticsearch.c#markTransactionCommitted)
- move resolution of committed xids into VisibilityQueryHelper and get rid of the aggregate in QueryRewriter
- move resolution of _prev_ctid values into VisibilityQueryHelper and get rid of the aggreate in QueryRewrite
- remove the _zdb_updated field
- rename the _xid field in the 'committed' type to _zdb_committed_xid

* 1) avoid doing a _refresh if an atomic transaction; 2) consider the current transaction as committed when doing a CREATE INDEX or REINDEX or ALTER TABLE statement that rewrites the index.

* remove dead code

* bugfixes and performance optimizations

* gotta have this refresh

* update docker builds

* 1) keep a cache of known committed transaction ids so we can avoid seeking around in the lucene index
2) pass the user query down into the VisibilityQuery -- it's not used, but it I think it could somehow be used to limit the number of _prev_ctids we need to resolve
3) overload BytesRefBuilder so that it's .toBytesRef() method doesn't make a copy of the byte array.

* code cleanup

* code cleanup

* adjustments to refresh, to eliminate one per atomic transaction

* guard against HOT updated

* some changes to refresh strategies so things work when you set a specific refresh_interval

* don't require tables to have a primary key columns

* document citus support

* note the version of Citus

* fix type-o

* Update CITUS-SUPPORT.md

* resolving issue #148

* fixing issue #141

* bump version to 3.1.0_BETA2

* follow change for issue #148 -- only set the null_value property for 'string' fields

* remove field from mapping -- it's not used

* some changes to enable Citus integration to use zdb_tally(), zdb_estimate_count(), etc functions, with some caveats.

Basically, this commit introduces an 'alias' index option that can be used when creating indexes on distributed Citus tables (or also tables for Postgres' table inheritence scheme), and ZDB will use this alias where appropriate.

If using with Citus, you'll need to set zombodb.ignore_visibility to false; before any of ZDB's aggregate and count estimation functions will actually work.

Also bumps version to _BETA3

* actually change version to 3.1.0_BETA3

* fix a bug where committed xid documents might route to the wrong shard, causing NPEs and possibly incorrect document counts.

this necessitates we brute-force _routing values for each shard.  We cache these in the new ZombodbCommitXIDAction.java, per index, so that they don't need to be re-calculated all the time.

* rejigger algorithm for brute-forcing shard routing values

* rejigger when we mark transaction ids as committed to the XACT_EVENT_PRE_COMMIT event so as to avoid a possible postgres PANIC if that call fails (which is what PG does if an ERROR arrises during XACT_EVENT_COMMIT)

* 1) eliminate a direct call to the _refresh endpoing on transaction commit
2) if the table has a primary key, use that as the _routing/_prev_ctid values
which avoids doing a search for every UPDATE statement.

* implements issue #143

* bump version to 3.1.0_BETA4

* implement issue #150

* alternative regression test output to account for differences between PG 9.3 and 9.4/5

* adjust test so it doesn't bomb on different platforms

* fix upgrade from 3.0 to 3.1 by dropping old zdb_tally() functions

* some optimizations to speed up visibility resolution by nearly 10x when there's hundreds of thousands of docs in the 'state' type.

* bugfix from previous commit

* use a hashset and change sort algo

* do some work around equals/hashcode so that ZomboDB's visibility filter can be cached.

This passes all regression tests along with my manual testing, but I'm not 100% sure it's correct.

That said, it gives a 10x-100x performance increase in resolving visibility, even under high UPDATE load.

* expand the xact_visibility test to test another thing, fix the NPE that exposed, and also favor documents with a higer readerOrd when their xids match -- these are the most current version of a doc in the transaction

* remove dead code

* fix visibility bug by introducing a sequence counter, per transaction, that can be used to break ties when a row was updated more than once in a single transaction

* bump version to 3.1.0_BETA5

* change around vacuum so that we ask ES for all the 'state' ctids (in a binary blob), validate them with PG, and then delete them.

This ensures we don't try to delete docs that don't exist in our ES index and it removes the need for ZDB to copy in the LVRelStats struct from 3 different versions of PG.

* Fix a long standing bug where deleting docs (via vacuum) could, depending on timing when that ctid is reused by PG, cause primary and secondary replicas to become out of sync.

The fix for this is to use an explicit version (value of 2) when creating and deleting docs.

Additionally, set the OpType to CREATE when inserting new docs so we'll catch possible corruption issues earlier.

Also additionally, do some extra validation during the handleDeleteRequests() method to catch possible corruption issues earlier.

* Revert "change around vacuum so that we ask ES for all the 'state' ctids (in a binary blob), validate them with PG, and then delete them."

This reverts commit f16d705.

* follow up fix to 689e9fa

* bump version to v3.1.0_BETA6

* teach ZDB's vacuum support to _optimize?only_expunge_deletes=true after a threshold of deleted documents.

Introduces a new index option called "optimize_after" with a default value of zero.

* fix issue #153

* follow up fix to issue #152 -- wasn't exactly fixed before.  We always have to provide a higher version number, so just use the doc's _xid as that version.  it's guaranteed to always be bigger

* bugfixes and cleanup

* minor changes to all Dockerfiles for the -Prelease stuff so that the current username/uid is used within the container so that files it generates are owned by the current user,  rather than root.

Never saw this on OSX but I do see it on Linux.

* bump verison

* doc fix

* rejigger the local-ci scripts/Dockerfile so they don't leave files behind that are owned by root

* only statically link to curl if building in Docker, which is how ZDB builds binary packages for all supported versions/platforms

* don't cache VisibiltyFilter (it wasn't cachable anyways); better error detection in ZombodbBulkAction; don't allow PG query cancel when marking a transaction as complete

* fix some shell scripts to use bash; bump version to _BETA8

* use /bin/bash

* delete useless files

* change index operation type for docs we add to the "state" type so that existing versions won't.

* bump version to 3.1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants