New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive updates via triggers cause a document_missing_exception #622
Comments
Was this database created and indexed before I fixed #618? if you do a vacuum full on all the tables with zdb indexes do the errors go away? it’s kinda unfair to ask me to try and diagnose problems against the develop branch. The only time the develop branch is in a state to be used is the moment I merge it into master. |
no, I’m working on a test environment I destroy and recreate from scratch every time I start the migration script. so the database was created today
no, same result
oh I’m sorry ! no worries, I don’t expect you to prioritize this over your own stuff. Take your time I’ll continue investigating and updating the issue if I find something that could be worthwhile. or if you’d rather not be distracted I’ll wait until I can work on |
Can you post your entire database schema? Or can you email me a pg_dump of the database? |
Okay, with the database dump you emailed me, I can re-create this. It seems to happen whenever the I think what's happening is you've got recursive updates happening on the I don't think there's anything I can do about this. You might should look at doing this in a way where you're just changing the columns in I'll think on this a bit more tho... |
More commentary... ZDB installs its own I might could maybe fix this by queueing the ES update commands to happen after triggers have finished firing, but a) I don't know how to do that, and b) that would require that ZDB track these things in memory, and that's probably a terrible idea if large UPDATEs are happening. |
again, thank you a lot for looking into this so quickly
unfortunately I can’t always change NEW, because most of the time NEW is a row from another table, not from the table with the zdb index. I’ve worked around that by deporting the index update to an asynchronous service instead of doing it in the same transaction. it’s ok in this case because I don’t mind if there is a small delay before the change is visible in ElasticSearch
I think I understand what’s happening. I didn’t know before that a row ctid was changing each time it's updated. I guess suppose |
I do. I absolutely do. ZDB's #1 design goal is to have results always be MVCC-correct, even if the query is wholly answered by Elasticsearch (ie, aggregate queries like
I'm gonna put together a standalone test case here in a few minutes, but yeah, I think that's what's happening. In the 5+ years ZDB has been around, this is the first time such a thing has come up and I honestly have never put any thought into this.
I'm still thinking about how to handle this. It's necessary for ZDB to update the xmin/xmax/cmin/cmax values in the ES index on the |
ah, I didn't know oids were removed. I don’t know if this can be solved with I suppose it would be hard to use the primary key because (1) the type is unknown and potentially spread across multiple column. or (2) just absent altogether. if (1) is solvable, I think requiring a primary key is a reasonable requirement. I wouldn’t mind having to add a column just for that, eg: |
I'm not doing that. ;) It's never been a requirement fo ZDB before and It's not going to be now. |
Got an update here, which I think is positive, although I don't know what to do about it yet! ZomboDB indexes documents using background threads, fed by the main Postgres thread using a simple channel (queue). When the triggers recurse, what looks like is happening is that say Thread1 will get the "update" message for If I hack ZDB to only use a single thread then everything works works just fine as ordering is properly preserved. So I'd say this is my bug -- now I just need to figure out what to do about it. I could just blindly say "if the table has any (non-ZDB) triggers, we run in single-thread mode". That would be the quick fix. I don't know any way to predict if the triggers are going to recurse, and otherwise batching everything in memory and re-ordering would essentially make it a single-threaded processes anyways. |
I guess I could also defer queueing the "update" command to be sent along with its corresponding "insert" command. As long as those two commands are in the same _bulk ES request (in the proper order), they'd be fine. I wonder what that would look like, code wise... |
This will re-create the problem. The key here is having enough rows in the table (and updating enough rows) to cause ZDB to spin up more than 1 indexing thread. CREATE TABLE issue622 (id serial8, oid bigint, by_trigger bool);
CREATE OR REPLACE FUNCTION update_trigger () RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
IF (round(random())::int)::boolean THEN
-- recurse. There's a chance we could end up doing this forever, but seems unlikely
-- don't feel like making this smarter
UPDATE issue622 SET by_trigger = true WHERE id = OLD.id;
END IF;
RETURN NEW;
END
$$;
CREATE TRIGGER update_trigger AFTER UPDATE ON issue622 FOR EACH ROW EXECUTE FUNCTION update_trigger();
INSERT INTO issue622 (oid) SELECT gs FROM generate_series(1, 30000) gs;
CREATE INDEX idxissue622 ON issue622 USING zombodb ((issue622.*));
UPDATE issue622 SET id = id; |
so that’s why I could not reproduce it manually. race condition problems are always fun ... :/
could that behavior be the default and adding an option in the index creation |
I think I'm going to be able to keep the "update" and "insert" commands together in the same package, and do them at the time of "insert". Then ZDB will be able to scale out threads automatically like it does now. I've almost got this implemented already. |
An However, lets say our table is:
When we update that row, we'll want to have two versions in ES, that look like this (essentially)
And now those rows are updated/inserted together. But when the trigger recurses we end up with:
And what we'll have going on concurrently with ZDB is an update/insert pair for (0, 1) and (0, 2), along with an update/insert pair for (0, 2) and (0, 3). If that second pair hits ES, where it wants to update (0, 2) before the first pair hits ES where it inserts (0, 2), then we get the "document_missing" error. This is the problem. Grouping these operations together is a good change for ZDB -- it halves the number of things that run through ZDB's queue, but it doesn't actually fix the problem. I'm still thinking... |
…ng ErrorKind::Iterrupted was causing things to retry. So use ErrorKind::BrokenPipe instead. This was discovered while debugging issue #622.
I think I've got this working. Basically, ZDB now tracks "in-flight" ctids (ctids that aren't known to have finished their _bulk request yet), and if we see we want to "update" a ctid (in ES) that's in-flight, we queue it up to the end of the entire bulk process. And then after the bulk indexing process has completed, we re-process the the queued items, which keeps working recursively until we have nothing else queued. With some debug output, this is what that looks like: [v12.3][79378] contrib_regression=# UPDATE issue622 SET id = id;
INFO: requeueing 15573 commands
INFO: requeueing 7774 commands
INFO: requeueing 3852 commands
INFO: requeueing 1898 commands
INFO: requeueing 947 commands
INFO: requeueing 479 commands
INFO: requeueing 239 commands
INFO: requeueing 112 commands
INFO: requeueing 49 commands
INFO: requeueing 19 commands
INFO: requeueing 6 commands
INFO: requeueing 2 commands
INFO: requeueing 0 commands
UPDATE 30000
Time: 7972.490 ms (00:07.972) I think this is the correct way to handle it with minimal code weirdness. I'm gonna think on this another day before I push it. |
(and I need to learn to spell requeuing, but whatever). |
And all these additional requeuing batches run in parallel themselves, as much as they can, going through the same in-flight detection process. |
and this is what it looks like against your database: =# update freelance set id = id;
INFO: requeuing 1400 commands
UPDATE 23105
Time: 81291.085 ms (01:21.291) Only one additional bulk process, but your triggers are more deterministic than mine. I couldn't do that yesterday. So that's great. I'm pretty jazzed! I'm not sure if I like the performance here or not, but I think that's just a factor of ZDB being ZDB. ZDB isn't known for its UPDATE perf relative to not having a ZDB index on your table. |
oh, in --release mode, it's fine: [v12.3][82463] mathroc=# update freelance set id = id;
INFO: requeuing 1276 commands
UPDATE 23105
Time: 47299.639 ms (00:47.300)
[v12.3][82463] mathroc=# |
oh that looks promising ! I’m not too worried about the performance, this kind of an edge case and such large scale update won’t happen often once the migration is over. And the time it takes to reindex that is actually acceptable if you commit this on a branch I can test it too (not today but soon) |
document_missing_exception
document_missing_exception
I feel pretty good about the changes here. They're really not that intrusive. I'll just include this as part of -alpha3, which I think might actually happen today! |
There's two primary changes here. The first is that during an UPDATE, we postpone queuing the ES BulkRequestCommand until we see the corresponding insert command via the index AM API. The second change is that we now track a set of "in flight" ctids (these are ctids that are either on their way to ES or being waited on for ES to finish indexing), and if we see an update command for one of these, we defer it (and its corresponding index command) and play those in after all the other documents have successfully finished. This essentially happens recursively until all deferred commands have been processed. The latter adds a small bit of per-tuple overhead in all cases (including CREATE INDEX), but it seems worth the price to ensure things actually work!
ZomboDB version: alpha3 ( 8ffc83d )
Postgres version: 12.4
Elasticsearch version: 7.10.1
Problem Description:
I have a script to migrate a database to a new one, it first inserts a bunch of rows to be indexed and commit that.
Later it inserts a bunch of row somewhere else that triggers an update on the zombodb indexed rows (because the indexed function use rows from this other table to build the final document to index in elasticsearch). when the transaction is committed, zombodb tries to update the document in ElasticSearch but it fails with this error
Error Message (if any):
there’s a few hundreds of
document_missing_exception
but I truncated the message here. I think none of the document were updatedTable Schema/Index Definition:
I don’t have a small enough schema at the moment, I’ll update if I can reproduce the issue with something smaller. however if I try to do what I expect to be similar manualy in postgres (instead of from my php script) I don’t know yet how to reproduce the issue
if I try to query one of the missing document
select zdb.request('public.idx_es_freelance', '_doc/1859720839210')::jsonb;
it’s indeed missing :the index in ES does contain as many documents as the table in postgres and fetching those by id work fine. so I don’t know where all those ids without a corresponding ES document are coming from
I restarted with a fresh elasticsearch database so I only have one index on it to make sure there wouldn’t be any confusion on my part or from zdb between which index to use.
I was thinking that maybe it was caused by a delay related to ES wait_for parameter but the doc says it should be handled properly and when I tried counting the document before the update with a
select * from zdb.cat_indices
everything was indeed fine. and adding a 30 seconds delay did not help either.I’ll continue investigating this for a while, but I’m open to suggestion on how to debug this further 🙏
The text was updated successfully, but these errors were encountered: