New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent index out of bounds error on write #750
Comments
Interesting. What details can you provide about the configuration of your postgres and es nodes, number of concurrent postgres sessions, SELECT v/s UPDATE v/s INSERT v/s DELETE frequency, a query that's known to cause this error. Anything. This is definitely a problem, but it's unclear right now how it is happening... |
I've spent all morning trying to recreate this and haven't been successful. So I definitely need lots more details from your end. I understand the error and have two theories on how it could be caused, but I can't make it happen... |
We did end up at least avoiding the bug. Contributing factors from my understanding: we were 4 transactions deep with a circular reference (two rows with a foreign key on each other). The crash occurred as one of the transactions was closed. Both eliminating the circular reference and eliminating a transaction level made the problem go away, we executed both solutions. |
I'll try to make a stab at an MWE, but this bug was fairly mysterious to us. |
I don’t quite understand “4 transactions deep”. Are you using substransactions? Or do you mean concurrent transactions all trying to update the same row? |
meaning subtransactions. our docker compose section for elastic and postgres the pgx version is due to getting our arm system building. The migration from that is something we need to address, but it will certainly involve downtime.
I spent an hour trying to build an mwe but was unsuccessful, but I am certainly being speculative,. here is the function that was called when an error was triggered. The big update at the end was my attempt to avoid many update calls.
here is post fix which involved removing recipe_id from work_order
|
In the abstract, this doesn't really help me very much. If you could provide some general outline of how the failing transaction uses subtransactions, that might be helpful. Alternatively, you could send me (offline at eebbrr @ gmail dot com) a full dump of your database along with the query or queries that tickle this issue. I definitely want to nail this down, but I've got about 12hrs in this now and haven't come up with anything -- I'm just stabbing in the dark. |
also, can you confirm the output of: SHOW session_preload_libraries; It ought to look a little like this: # SHOW session_preload_libraries;
session_preload_libraries
---------------------------
"zombodb.so"
(1 row) |
Middle of a merge, will do some more work on it when I'm done. |
big win here, this mwe gave me the error
btw:
|
wow, fantastic. I know what I'll be working on all day tomorrow. I appreciate your tenacity in working this out! |
hmm. This isn't working quite right for me: ...
WARNING: there is already a transaction in progress
BEGIN
Time: 0.118 ms
SAVEPOINT
Time: 0.027 ms
DELETE 12
Time: 1.141 ms
UPDATE 30
Time: 0.611 ms
UPDATE 27
Time: 0.552 ms
RELEASE
Time: 0.029 ms
COMMIT
Time: 965.411 ms
ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
Time: 0.126 ms
WARNING: there is no transaction in progress
COMMIT |
I pasted the bottom section of inserts etc into psql and the error printed there. same thing happened to me
The error is almost certainly occurring, look at the time on the commit.
|
yeah, same. I mean, I made a |
okay, so I don't see the ZDB error. |
Do you need any more help from me? This is what I seen on the same file with running a file into psql.
|
I'm not seeing the error at all. Not in psql nor in my logs. EDIT: And it seems your script is bugged as you (at least) call |
can you try pasting the section of inserts etc into a psql console? |
hmm. That seems to do it. This is truly bizarre... ERROR: code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));"," ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 2"}}}}}]} |
well, you're missing a semicolon on the |
I must've grabbed the wrong error run. actually -- we got lucky. this error is intermittent, I ran the example again and it passed. easily could have passed the first time I ran it. So beware while debugging.
|
I agree there's a ZDB bug here somewhere, but I don't understand why when this whole script passes, it still makes this error:
Between that and the warning about already being in a transaction, I'm a little dubious what you gave me is complete/correct. |
I don't think that actually passed. I think it simply didn't print the zdb error when running the whole file. I seen the same thing. The error about being in only valid in a transaction occurs because the transaction blew up on the line before
I get the same thing.
|
It is also certainly possible I did something insane, I am not a seasoned postgres developer only been working with the project for a month or so. |
Let me know if you need anything more from me. |
You've got: BEGIN;
SAVEPOINT abc;
<bunch of inserts>
BEGIN; <-- this is incorrect, you're still in a transaction
SAVEPOINT asdf;
<delete>
<updates>
RELEASE SAVEPOINT asdf;
COMMIT; <-- this commits the whole transaction
RELEASE SAVEPOINT abc; <-- this is already gone, due to above COMMIT
COMMIT; <-- no longer in a transaction I agree there's an issue with ZDB somewhere, but your script is bugged. Can you fix it to match your intentions? The above is why when the script appears to work, we get this output: $ dropdb issue-750; createdb issue-750 ; psql issue-750 < issue-750.sql
CREATE EXTENSION
DROP INDEX
DROP FUNCTION
DROP TYPE
DROP INDEX
DROP FUNCTION
DROP TYPE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
ALTER TABLE
CREATE TYPE
CREATE FUNCTION
CREATE INDEX
CREATE TYPE
CREATE FUNCTION
CREATE INDEX
<iinserts>
BEGIN
SAVEPOINT
<inserts>
WARNING: there is already a transaction in progress
BEGIN
SAVEPOINT
DELETE 12
UPDATE 30
UPDATE 27
RELEASE
COMMIT
ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
WARNING: there is no transaction in progress
COMMIT |
ok so this is a more legal example. It gives a warning on the double BEGIN but that appears to only be a warning... anyways, we are using it in a much larger framework, and the double transaction was probably a result of that... I was simply attempting to match what I thought the equivalent was.
|
BEGIN
SAVEPOINT
WARNING: pushing xid=24531
DELETE 12
UPDATE 30
UPDATE 27
RELEASE
RELEASE
INFO: xids to commit: Some({24530, 24531})
WARNING: TransactionCommitted: 24530
WARNING: TransactionCommitted: 24531
WARNING: about to remove xid=24530
WARNING: about to remove xid=24531
WARNING: pushing xid=24529 <-- This shouldn't have happened
INFO: xids to commit: Some({24530, 24531, 24529})
WARNING: TransactionCommitted: 24530
WARNING: TransactionCommitted: 24531
WARNING: TransactionCommitted: 24529
WARNING: about to remove xid=24530
WARNING: about to remove xid=24531
WARNING: about to remove xid=24529
ERROR: code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));"," ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 0"}}}}}]}
CONTEXT: src/elasticsearch/bulk.rs:1006:17 Yeah, I'm getting close. It's weird that it's intermittent, but I'll figure it out tomorrow. |
I will say, pushing a new transaction id we we're not supposed to was not one of my theories. |
Who knows the mad things people will do with your code... |
That’s why we write code. I enjoy debugging these problems. I’m just hoping it’s a simple fix and not a fundamental flaw. I’ll know more in the next few days. |
Alrighty, looks like this is a bug related to issue #622 which was fixed back on the last day of 2020. Basically, ZDB knows which rows haven't finished being updated in ES and if it sees you trying to do update that row again, it'll requeue that update command to run again. The reason this issue is intermittent is that sometimes ES finishes updating the row before we try to update it again (which means things work), and sometimes it doesn't (which means things fail). And the bug here is that when we requeued the command, it would would ultimately try to queue up a "commit" for the current transaction id, which, thanks to the use of subtransactions, turned out to be different at that time. Anyways, the fix here is to skip changing out our concept of the "current transaction" when re-processing deferred commands. I'll get the fix pushed here in a bit along with a test (your .sql script from here -- mostly), and can likely go ahead and publish a new release today too. |
released in v3000.1.1. Enjoy! |
ZomboDB version:
v3000.0.8
Postgres version:
13.7
Elasticsearch version:
7.17.0
Problem Description:
We have an intermittent error that occurs on write.
Error Message (if any):
Table Schema/Index Definition:
-- complex, not sure of a mwe.
Output from
select zdb.index_mapping('index_name');
:here is the output from the two indicies I would expect to be responsible.
The text was updated successfully, but these errors were encountered: