Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent index out of bounds error on write #750

Closed
bhalonen opened this issue Jun 21, 2022 · 33 comments
Closed

Intermittent index out of bounds error on write #750

bhalonen opened this issue Jun 21, 2022 · 33 comments

Comments

@bhalonen
Copy link
Contributor

bhalonen commented Jun 21, 2022

ZomboDB version:
v3000.0.8
Postgres version:
13.7
Elasticsearch version:
7.17.0
Problem Description:
We have an intermittent error that occurs on write.

Error Message (if any):

error: code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds([Preconditions.java:64](http://preconditions.java:64/))","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex([Preconditions.java:70](http://preconditions.java:70/))","java.base/jdk.internal.util.Preconditions.checkIndex([Preconditions.java:266](http://preconditions.java:266/))","java.base/java.util.Objects.checkIndex([Objects.java:359](http://objects.java:359/))","java.base/java.util.ArrayList.remove([ArrayList.java:504](http://arraylist.java:504/))","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","                                                                               ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 2"}}}}}]}
[1]     at Parser.parseErrorMessage (/home/brent/projects/steel-ui/steelhead-deploy/gosteelhead/node_modules/

Table Schema/Index Definition:
-- complex, not sure of a mwe.

CREATE FUNCTION steelhead.search_recipe_node_idx(steelhead.recipe_node) RETURNS steelhead.search_recipe_node_idx_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
           $1.name,
           $1.description_markdown::text,
           $1.operator_input_json_schema::text,
           $1.derived_from,
           $1.treatment_id,
           'recipe node'
           )::steelhead.search_recipe_node_idx_type;
$$;

CREATE INDEX search_recipe_node_idx 
      ON steelhead.recipe_node 
      USING zombodb (steelhead.search_recipe_node_idx(recipe_node))
         WITH (url='http://elasticsearch:9200/');
CREATE TYPE steelhead.search_work_order_idx_type AS (
    id integer, 
    name text, 
    id_in_domain text,
    customer_name text,
    received_order_id integer,
    recipe_id integer,
    product_id integer,
    created_at timestamptz,
    creator_id integer,
    custom_inputs text,
    entity text
);  

CREATE FUNCTION steelhead.search_work_order_idx(steelhead.work_order) RETURNS steelhead.search_work_order_idx_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
           $1.name,
           concat('WO',$1.id_in_domain::text),
           (select name from steelhead.customer where id = $1.customer_id),
           $1.received_order_id,
           $1.recipe_id,
           $1.product_id,
           $1.created_at,
           $1.creator_id,
            $1.custom_inputs,
           'work orders'
           )::steelhead.search_work_order_idx_type;
$$;

CREATE INDEX search_work_order_idx 
      ON steelhead.work_order 
      USING zombodb (steelhead.search_work_order_idx(work_order))
         WITH (url='http://elasticsearch:9200/');
GRANT EXECUTE ON FUNCTION steelhead.search_work_order_idx to steelhead_authed;

ALTER INDEX search_work_order_idx SET (options='parts_transfer:(id=<steelhead.parts_transfer.search_parts_transfer_idx>to_work_order_id),
        part_number:(parts_transfer.part_number_id=<steelhead.part_number.search_part_number_idx>id),
        creator:(creator_id=<steelhead.user.search_user_idx>id),
        received_order:(received_order_id=<steelhead.received_order.search_received_order_idx>id),
        ro_creator:(received_order.creator_id=<steelhead.user.search_user_idx>id),
        recipe:(recipe_id=<steelhead.recipe_node.search_recipe_node_idx>id),
        product:(product_id=<steelhead.product.search_product_idx>id)');

Output from select zdb.index_mapping('index_name');:
here is the output from the two indicies I would expect to be responsible.

{"122906.123633.124008.130171": {"mappings": {"properties": {"id": {"type": "integer"}, "name": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "entity": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "zdb_all": {"type": "text", "analyzer": "zdb_all_analyzer"}, "zdb_cmax": {"type": "integer"}, "zdb_cmin": {"type": "integer"}, "zdb_ctid": {"type": "long"}, "zdb_xmax": {"type": "long"}, "zdb_xmin": {"type": "long"}, "derived_from": {"type": "integer"}, "treatment_id": {"type": "integer"}, "zdb_aborted_xids": {"type": "long"}, "description_markdown": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "operator_input_json_schema": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}}, "date_detection": false, "dynamic_templates": [{"strings": {"mapping": {"type": "keyword", "copy_to": "zdb_all", "normalizer": "lowercase", "ignore_above": 10922}, "match_mapping_type": "string"}}, {"dates_times": {"mapping": {"type": "keyword", "fields": {"date": {"type": "date", "format": "strict_date_optional_time||epoch_millis||HH:mm:ss.S||HH:mm:ss.SX||HH:mm:ss.SS||HH:mm:ss.SSX||HH:mm:ss.SSS||HH:mm:ss.SSSX||HH:mm:ss.SSSS||HH:mm:ss.SSSSX||HH:mm:ss.SSSSS||HH:mm:ss.SSSSSX||HH:mm:ss.SSSSSS||HH:mm:ss.SSSSSSX"}}, "copy_to": "zdb_all"}, "match_mapping_type": "date"}}, {"objects": {"mapping": {"type": "nested", "include_in_parent": true}, "match_mapping_type": "object"}}], "numeric_detection": false}}}

{"122906.123633.124067.130198": {"mappings": {"properties": {"id": {"type": "integer"}, "name": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "entity": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "zdb_all": {"type": "text", "analyzer": "zdb_all_analyzer"}, "zdb_cmax": {"type": "integer"}, "zdb_cmin": {"type": "integer"}, "zdb_ctid": {"type": "long"}, "zdb_xmax": {"type": "long"}, "zdb_xmin": {"type": "long"}, "recipe_id": {"type": "integer"}, "created_at": {"type": "keyword", "fields": {"date": {"type": "date"}}, "copy_to": ["zdb_all"]}, "creator_id": {"type": "integer"}, "product_id": {"type": "integer"}, "id_in_domain": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "custom_inputs": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "customer_name": {"type": "text", "copy_to": ["zdb_all"], "analyzer": "zdb_standard", "fielddata": true}, "zdb_aborted_xids": {"type": "long"}, "received_order_id": {"type": "integer"}}, "date_detection": false, "dynamic_templates": [{"strings": {"mapping": {"type": "keyword", "copy_to": "zdb_all", "normalizer": "lowercase", "ignore_above": 10922}, "match_mapping_type": "string"}}, {"dates_times": {"mapping": {"type": "keyword", "fields": {"date": {"type": "date", "format": "strict_date_optional_time||epoch_millis||HH:mm:ss.S||HH:mm:ss.SX||HH:mm:ss.SS||HH:mm:ss.SSX||HH:mm:ss.SSS||HH:mm:ss.SSSX||HH:mm:ss.SSSS||HH:mm:ss.SSSSX||HH:mm:ss.SSSSS||HH:mm:ss.SSSSSX||HH:mm:ss.SSSSSS||HH:mm:ss.SSSSSSX"}}, "copy_to": "zdb_all"}, "match_mapping_type": "date"}}, {"objects": {"mapping": {"type": "nested", "include_in_parent": true}, "match_mapping_type": "object"}}], "numeric_detection": false}}}
@eeeebbbbrrrr
Copy link
Collaborator

Interesting. What details can you provide about the configuration of your postgres and es nodes, number of concurrent postgres sessions, SELECT v/s UPDATE v/s INSERT v/s DELETE frequency, a query that's known to cause this error. Anything.

This is definitely a problem, but it's unclear right now how it is happening...

@eeeebbbbrrrr
Copy link
Collaborator

I've spent all morning trying to recreate this and haven't been successful. So I definitely need lots more details from your end.

I understand the error and have two theories on how it could be caused, but I can't make it happen...

@eeeebbbbrrrr eeeebbbbrrrr self-assigned this Jun 27, 2022
@bhalonen
Copy link
Contributor Author

We did end up at least avoiding the bug.

Contributing factors from my understanding: we were 4 transactions deep with a circular reference (two rows with a foreign key on each other). The crash occurred as one of the transactions was closed.

Both eliminating the circular reference and eliminating a transaction level made the problem go away, we executed both solutions.

@bhalonen
Copy link
Contributor Author

I'll try to make a stab at an MWE, but this bug was fairly mysterious to us.

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jun 28, 2022

I don’t quite understand “4 transactions deep”.

Are you using substransactions?

Or do you mean concurrent transactions all trying to update the same row?

@bhalonen
Copy link
Contributor Author

meaning subtransactions.

our docker compose section for elastic and postgres

the pgx version is due to getting our arm system building. The migration from that is something we need to address, but it will certainly involve downtime.

  postgres:
    image: "steelhead:postgres-13"
    ports:
      - "5432:5432"
    build:
      context: "${PWD}/docker/postgres"
      dockerfile: "${PWD}/docker/postgres/Dockerfile"
      args:
        IMAGE: "postgres:13-alpine"
        MIGRATE_VERSION: "1.1"
        WALG_VERSION: "v1.1.1-rc"
        PGX_VERSION: "47309db05f6453a799d85bcde5a61e186145126f"
        ZOMBODB_VERSION: "v3000.0.8"
    restart: always
    env_file: ".env"
    command: [
      "-c", "config_file=/etc/steelhead.conf",
      
      "-c", "wal_level=logical",
      "-c", "max_wal_senders=8",
       "-c", "max_replication_slots=16",
      
      "-c", "archive_mode=${PGARCHIVEMODE}",
      "-c", "archive_command=timeout 600 wal-g wal-push %p",
      "-c", "archive_timeout=600",
    ]
    volumes:
      - "${PWD}/.gmrc:/init/.gmrc:ro"
      - "${PWD}/migrations:/init/migrations"
      - "${PWD}/data:/var/lib/postgresql/data"
      - "${PWD}/backups:/etc/wal-g:ro"
      - "${PWD}/backups/cron.conf:/etc/crontabs/root:ro"
      - "${PWD}/postgresql.conf:/etc/steelhead.conf:ro"
    depends_on:
      - "elasticsearch"
  elasticsearch:
    image: "steelhead:elasticsearch-7"
    build:
      context: "${PWD}/docker/elasticsearch"
      dockerfile: "${PWD}/docker/elasticsearch/Dockerfile"
      args:
        IMAGE: "docker.elastic.co/elasticsearch/elasticsearch:7.17.0"
    restart: always
    env_file: ".env"
    ports: 
       - 9200:9200
    environment:  # if any of these might be deployment-specific, then put them in example.env (and .env)
      "xpack.security.enabled": "false"
      "discovery.type": "single-node"
      "bootstrap.memory_lock": "true"
      "ES_JAVA_OPTS": "-Xms2g -Xmx2g"
      "cluster.max_shards_per_node": "5000"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - "${PWD}/search:/search"
    entrypoint: "/bin/bash"
    command: [
      "-c",
      "set -eu ; rmdir data &> /dev/null || true ; mkdir -p /search/data && chown 1000:0 /search/* && ln -snf /search/data data && exec /bin/tini -- /usr/local/bin/docker-entrypoint.sh eswrapper",
    ]

I spent an hour trying to build an mwe but was unsuccessful, but I am certainly being speculative,.

here is the function that was called when an error was triggered. The big update at the end was my attempt to avoid many update calls.

CREATE or REPLACE FUNCTION steelhead.recipe_node_relationship_update(delete_ids int[],to_ids int[], from_ids int[], child_inds int[], wo_id int, node_list int[], index_list int[] ) RETURNS  int  AS $$  
    BEGIN
        DELETE FROM steelhead.recipe_node_relationship as rnr  where rnr.to_id = ANY(delete_ids) and rnr.type = 'is a child of';

        INSERT INTO steelhead.recipe_node_relationship(from_id, to_id, type, child_ind)
        SELECT temptable.from_id, temptable.to_id, 'is a child of', temptable.child_ind FROM (
        SELECT UNNEST(to_ids) as to_id, 
           UNNEST(from_ids) as from_id, 
           UNNEST(child_inds) as child_ind
        ) as temptable;

        
        UPDATE steelhead.work_order set recipe_id = node_list[1] where work_order.id = wo_id;

        UPDATE steelhead.recipe_node set (recipe_ind, work_order_id) = (
           
                SELECT ind, wo_id FROM ( SELECT UNNEST(node_list) as recipe_node_id, UNNEST(index_list) as ind) as temptable2 
                WHERE steelhead.recipe_node.id = temptable2.recipe_node_id
                UNION 
                select null as ind, null as wo_id
                from (select steelhead.recipe_node.id as test_id) as test_field 
                where steelhead.recipe_node.work_order_id = wo_id and not exists (
                    select * 
                    from ( SELECT UNNEST(node_list) as recipe_node_id) as temptable3 
                    where temptable3.recipe_node_id = test_field.test_id
                )
            ) where steelhead.recipe_node.id = ANY(node_list) or steelhead.recipe_node.work_order_id = wo_id;

     return 1;
    END 
$$ LANGUAGE plpgsql strict VOLATILE PARALLEL UNSAFE;

here is post fix which involved removing recipe_id from work_order

CREATE or REPLACE FUNCTION steelhead.recipe_node_relationship_update(delete_ids int[],to_ids int[], from_ids int[], child_inds int[], wo_id int, node_list int[], index_list int[] ) RETURNS  void  AS $$  
    BEGIN
        DELETE FROM steelhead.recipe_node_relationship as rnr  where rnr.to_id = ANY(delete_ids) and rnr.type = 'is a child of';

        INSERT INTO steelhead.recipe_node_relationship(from_id, to_id, type, child_ind)
        SELECT temptable.from_id, temptable.to_id, 'is a child of', temptable.child_ind FROM (
        SELECT UNNEST(to_ids) as to_id, 
           UNNEST(from_ids) as from_id, 
           UNNEST(child_inds) as child_ind
        ) as temptable;

        UPDATE steelhead.recipe_node set recipe_ind = null , work_order_id = null where steelhead.recipe_node.work_order_id = wo_id;

        UPDATE steelhead.recipe_node set (recipe_ind, work_order_id) = (
           
                SELECT ind, wo_id FROM ( SELECT UNNEST(node_list) as recipe_node_id, UNNEST(index_list) as ind) as temptable2 
                WHERE steelhead.recipe_node.id = temptable2.recipe_node_id
            ) where steelhead.recipe_node.id = ANY(node_list) ;

     return;
    END 
$$ LANGUAGE plpgsql strict VOLATILE;

@eeeebbbbrrrr
Copy link
Collaborator

In the abstract, this doesn't really help me very much.

If you could provide some general outline of how the failing transaction uses subtransactions, that might be helpful.

Alternatively, you could send me (offline at eebbrr @ gmail dot com) a full dump of your database along with the query or queries that tickle this issue.

I definitely want to nail this down, but I've got about 12hrs in this now and haven't come up with anything -- I'm just stabbing in the dark.

@eeeebbbbrrrr
Copy link
Collaborator

also, can you confirm the output of:

SHOW session_preload_libraries;

It ought to look a little like this:

# SHOW session_preload_libraries;
 session_preload_libraries 
---------------------------
 "zombodb.so"
(1 row)

@bhalonen
Copy link
Contributor Author

Middle of a merge, will do some more work on it when I'm done.

@bhalonen
Copy link
Contributor Author

bhalonen commented Jun 28, 2022

big win here, this mwe gave me the error

DROP INDEX IF EXISTS table_a_idx;
DROP function if exists table_a_idx;
DROP TYPE IF EXISTS table_a_type;


DROP INDEX IF EXISTS table_b_idx;
DROP function if exists table_b_idx;
DROP TYPE IF EXISTS table_b_type;

DROP TABLE IF EXISTS table_a CASCADE;

CREATE TABLE IF NOT EXISTS table_a (
    id serial primary key,
    val text
);
DROP TABLE IF EXISTS table_b;

CREATE TABLE IF NOT EXISTS table_b (
    id serial primary key,
    id_a integer references table_a(id),
    val text
);

alter table table_a add column id_b integer references table_b(id);



CREATE TYPE table_a_type AS (
    id integer,  
    id_b integer,
    val text
);  

CREATE FUNCTION table_a_idx(table_a) RETURNS table_a_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
           $1.id_b,
           $1.val
)
$$;

CREATE INDEX table_a_idx 
      ON table_a 
      USING zombodb (table_a_idx(table_a))
         WITH (url='http://elasticsearch:9200/');

CREATE TYPE table_b_type AS (
    id integer,  
    val text
);  

CREATE FUNCTION table_b_idx(table_b) RETURNS table_b_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
            $1.val
)
$$;

CREATE INDEX table_b_idx 
      ON table_b 
      USING zombodb (table_b_idx(table_b))
         WITH (url='http://elasticsearch:9200/');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');

BEGIN;
savepoint abc;
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
BEGIN;
savepoint asdf;
delete from table_a where val = '123';
UPDATE table_a set id_b = 1 where val ='hello';
UPDATE table_b set id_a = 1 where val = 'world';
release savepoint asdf;
COMMIT;
release savepoint abc;
COMMIT;
ERROR:  code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","                                                                               ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 0"}}}}}]}
CONTEXT:  src/elasticsearch/bulk.rs:1003:17

btw:

steelhead=# SHOW session_preload_libraries;
 session_preload_libraries 
---------------------------
 "zombodb.so"
(1 row)

@eeeebbbbrrrr
Copy link
Collaborator

wow, fantastic. I know what I'll be working on all day tomorrow. I appreciate your tenacity in working this out!

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jun 28, 2022

hmm. This isn't working quite right for me:

...
WARNING:  there is already a transaction in progress
BEGIN
Time: 0.118 ms
SAVEPOINT
Time: 0.027 ms
DELETE 12
Time: 1.141 ms
UPDATE 30
Time: 0.611 ms
UPDATE 27
Time: 0.552 ms
RELEASE
Time: 0.029 ms
COMMIT
Time: 965.411 ms
ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks
Time: 0.126 ms
WARNING:  there is no transaction in progress
COMMIT

@bhalonen
Copy link
Contributor Author

bhalonen commented Jun 28, 2022

I pasted the bottom section of inserts etc into psql and the error printed there. same thing happened to me

DELETE 12
steelhead=*# UPDATE table_a set id_b = 1 where val ='hello';
UPDATE 30
steelhead=*# UPDATE table_b set id_a = 1 where val = 'world';
UPDATE 27
steelhead=*# release savepoint asdf;
RELEASE
steelhead=*# COMMIT;
ERROR:  code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","                                                                               ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 0"}}}}}]}
CONTEXT:  src/elasticsearch/bulk.rs:1003:17
steelhead=# release savepoint abc;
ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks

The error is almost certainly occurring, look at the time on the commit.

COMMIT
Time: 965.411 ms

@eeeebbbbrrrr
Copy link
Collaborator

yeah, same. I mean, I made a issue-750.sql script and am playing it in via psql

@eeeebbbbrrrr
Copy link
Collaborator

okay, so I don't see the ZDB error.

@bhalonen
Copy link
Contributor Author

bhalonen commented Jun 28, 2022

Do you need any more help from me?
I mean the error is silently occurring and you aren't seeing it by playing a file (which happened to me as well) or you aren't seeing it period.

This is what I seen on the same file with running a file into psql.

[brent@brent-desktop steelhead-deploy]$ ./migrate.sh watch --once
graphile-migrate: Already up to date
[2022-06-28T20:20:59.878Z]: Running current.sql

🛑 Error occurred at line 143, column 1 of "current.sql":
|   UPDATE table_a set id_b = 1 where val ='hello';
|   UPDATE table_b set id_a = 1 where val = 'world';
|   release savepoint asdf
|   COMMIT;
| --^
| 42601: syntax error at or near "COMMIT"

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jun 28, 2022

I'm not seeing the error at all. Not in psql nor in my logs.

EDIT: And it seems your script is bugged as you (at least) call BEGIN; while already in a transaction. So I'm dubious about if the sql you pasted here is complete?

@bhalonen
Copy link
Contributor Author

can you try pasting the section of inserts etc into a psql console?

@eeeebbbbrrrr
Copy link
Collaborator

can you try pasting the section of inserts etc into a psql console?

hmm. That seems to do it. This is truly bizarre...

ERROR:  code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","                                                                               ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 2"}}}}}]}

@eeeebbbbrrrr
Copy link
Collaborator

Do you need any more help from me? I mean the error is silently occurring and you aren't seeing it by playing a file (which happened to me as well) or you aren't seeing it period.

This is what I seen on the same file with running a file into psql.

[brent@brent-desktop steelhead-deploy]$ ./migrate.sh watch --once
graphile-migrate: Already up to date
[2022-06-28T20:20:59.878Z]: Running current.sql

🛑 Error occurred at line 143, column 1 of "current.sql":
|   UPDATE table_a set id_b = 1 where val ='hello';
|   UPDATE table_b set id_a = 1 where val = 'world';
|   release savepoint asdf
|   COMMIT;
| --^
| 42601: syntax error at or near "COMMIT"

well, you're missing a semicolon on the release savepoint asdf line.

@bhalonen
Copy link
Contributor Author

bhalonen commented Jun 28, 2022

I must've grabbed the wrong error run.

actually -- we got lucky. this error is intermittent, I ran the example again and it passed. easily could have passed the first time I ran it. So beware while debugging.

teelhead=# -- insert into table_b (val) values ('world');
steelhead=# -- insert into table_b (val) values ('world');
steelhead=# -- insert into table_b (val) values ('world');
steelhead=# -- BEGIN;
steelhead=# -- savepoint asdf;
steelhead=# -- delete from table_a where val = '123';
steelhead=# -- UPDATE table_a set id_b = 1 where val ='hello';
steelhead=# -- UPDATE table_b set id_a = 1 where val = 'world';
steelhead=# -- release savepoint asdf;
steelhead=# -- COMMIT;
steelhead=# -- release savepoint abc;
steelhead=# -- COMMIT;
steelhead=# 

@eeeebbbbrrrr
Copy link
Collaborator

I agree there's a ZDB bug here somewhere, but I don't understand why when this whole script passes, it still makes this error:

ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks

Between that and the warning about already being in a transaction, I'm a little dubious what you gave me is complete/correct.

@bhalonen
Copy link
Contributor Author

bhalonen commented Jun 28, 2022

I don't think that actually passed. I think it simply didn't print the zdb error when running the whole file. I seen the same thing. The error about being in only valid in a transaction occurs because the transaction blew up on the line before

...
WARNING:  there is already a transaction in progress
BEGIN
Time: 0.118 ms
SAVEPOINT
Time: 0.027 ms
DELETE 12
Time: 1.141 ms
UPDATE 30
Time: 0.611 ms
UPDATE 27
Time: 0.552 ms
RELEASE
Time: 0.029 ms
COMMIT
Time: 965.411 ms <------------------ this almost certainly failed.
ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks
Time: 0.126 ms
WARNING:  there is no transaction in progress
COMMIT

I get the same thing.

[brent@brent-desktop steelhead-deploy]$ ./migrate.sh watch --once
graphile-migrate: Already up to date
[2022-06-28T20:59:02.734Z]: Running current.sql

🛑 Error occurred whilst processing migration
    error: RELEASE SAVEPOINT can only be used in transaction blocks
        at Parser.parseErrorMessage (/init/node_modules/pg-protocol/dist/parser.js:287:98)
        at Parser.handlePacket (/init/node_modules/pg-protocol/dist/parser.js:126:29)
        at Parser.parse (/init/node_modules/pg-protocol/dist/parser.js:39:38)
        at Socket.<anonymous> (/init/node_modules/pg-protocol/dist/index.js:11:42)
        at Socket.emit (node:events:527:28)
        at addChunk (node:internal/streams/readable:315:12)
        at readableAddChunk (node:internal/streams/readable:289:9)

@bhalonen
Copy link
Contributor Author

It is also certainly possible I did something insane, I am not a seasoned postgres developer only been working with the project for a month or so.

@bhalonen
Copy link
Contributor Author

Let me know if you need anything more from me.

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jun 28, 2022

You've got:

BEGIN;
SAVEPOINT abc;
<bunch of inserts>
BEGIN;  <-- this is incorrect, you're still in a transaction
SAVEPOINT asdf;
<delete>
<updates>
RELEASE SAVEPOINT asdf;
COMMIT;  <-- this commits the whole transaction
RELEASE SAVEPOINT abc;  <-- this is already gone, due to above COMMIT
COMMIT;  <-- no longer in a transaction

I agree there's an issue with ZDB somewhere, but your script is bugged. Can you fix it to match your intentions?

The above is why when the script appears to work, we get this output:

$ dropdb issue-750; createdb issue-750 ; psql issue-750 < issue-750.sql
CREATE EXTENSION
DROP INDEX
DROP FUNCTION
DROP TYPE
DROP INDEX
DROP FUNCTION
DROP TYPE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
ALTER TABLE
CREATE TYPE
CREATE FUNCTION
CREATE INDEX
CREATE TYPE
CREATE FUNCTION
CREATE INDEX
<iinserts>

BEGIN
SAVEPOINT
<inserts>
WARNING:  there is already a transaction in progress
BEGIN
SAVEPOINT
DELETE 12
UPDATE 30
UPDATE 27
RELEASE
COMMIT
ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks
WARNING:  there is no transaction in progress
COMMIT

@bhalonen
Copy link
Contributor Author

ok so this is a more legal example. It gives a warning on the double BEGIN but that appears to only be a warning...

anyways, we are using it in a much larger framework, and the double transaction was probably a result of that... I was simply attempting to match what I thought the equivalent was.

DROP INDEX IF EXISTS table_a_idx;
DROP function if exists table_a_idx;
DROP TYPE IF EXISTS table_a_type;


DROP INDEX IF EXISTS table_b_idx;
DROP function if exists table_b_idx;
DROP TYPE IF EXISTS table_b_type;

DROP TABLE IF EXISTS table_a CASCADE;

CREATE TABLE IF NOT EXISTS table_a (
    id serial primary key,
    val text
);
DROP TABLE IF EXISTS table_b;

CREATE TABLE IF NOT EXISTS table_b (
    id serial primary key,
    id_a integer references table_a(id),
    val text
);

alter table table_a add column id_b integer references table_b(id);



CREATE TYPE table_a_type AS (
    id integer,  
    id_b integer,
    val text
);  

CREATE FUNCTION table_a_idx(table_a) RETURNS table_a_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
           $1.id_b,
           $1.val
)
$$;

CREATE INDEX table_a_idx 
      ON table_a 
      USING zombodb (table_a_idx(table_a))
         WITH (url='http://elasticsearch:9200/');

CREATE TYPE table_b_type AS (
    id integer,  
    val text
);  

CREATE FUNCTION table_b_idx(table_b) RETURNS table_b_type IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ROW (
           $1.id,
            $1.val
)
$$;

CREATE INDEX table_b_idx 
      ON table_b 
      USING zombodb (table_b_idx(table_b))
         WITH (url='http://elasticsearch:9200/');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('hello');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');
Insert into table_a (val) values ('123');

BEGIN;

savepoint abc;

insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');
insert into table_b (val) values ('world');

BEGIN;
savepoint asdf;
delete from table_a where val = '123';
UPDATE table_a set id_b = 1 where val ='hello';
UPDATE table_b set id_a = 1 where val = 'world';
release savepoint asdf;
release savepoint abc;
COMMIT;

@eeeebbbbrrrr
Copy link
Collaborator

BEGIN
SAVEPOINT
WARNING:  pushing xid=24531
DELETE 12
UPDATE 30
UPDATE 27
RELEASE
RELEASE
INFO:  xids to commit: Some({24530, 24531})
WARNING:  TransactionCommitted: 24530
WARNING:  TransactionCommitted: 24531
WARNING:  about to remove xid=24530
WARNING:  about to remove xid=24531
WARNING:  pushing xid=24529          <-- This shouldn't have happened
INFO:  xids to commit: Some({24530, 24531, 24529})
WARNING:  TransactionCommitted: 24530
WARNING:  TransactionCommitted: 24531
WARNING:  TransactionCommitted: 24529
WARNING:  about to remove xid=24530
WARNING:  about to remove xid=24531
WARNING:  about to remove xid=24529
ERROR:  code=Some(200), {"error":null,"errors":true,"items":[{"update":{"error":{"type":"illegal_argument_exception","reason":"failed to execute script","caused_by":{"type":"script_exception","reason":"runtime error","script_stack":["java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)","java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)","java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)","java.base/java.util.Objects.checkIndex(Objects.java:359)","java.base/java.util.ArrayList.remove(ArrayList.java:504)","ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","                                                                               ^---- HERE"],"script":"ctx._source.zdb_aborted_xids.remove(ctx._source.zdb_aborted_xids.indexOf(params.XID));","lang":"painless","position":{"offset":79,"start":0,"end":86},"caused_by":{"type":"index_out_of_bounds_exception","reason":"Index -1 out of bounds for length 0"}}}}}]}
CONTEXT:  src/elasticsearch/bulk.rs:1006:17

Yeah, I'm getting close. It's weird that it's intermittent, but I'll figure it out tomorrow.

@eeeebbbbrrrr
Copy link
Collaborator

I will say, pushing a new transaction id we we're not supposed to was not one of my theories.

@bhalonen
Copy link
Contributor Author

Who knows the mad things people will do with your code...

@eeeebbbbrrrr
Copy link
Collaborator

That’s why we write code.

I enjoy debugging these problems. I’m just hoping it’s a simple fix and not a fundamental flaw.

I’ll know more in the next few days.

@eeeebbbbrrrr
Copy link
Collaborator

eeeebbbbrrrr commented Jun 29, 2022

Alrighty, looks like this is a bug related to issue #622 which was fixed back on the last day of 2020.

Basically, ZDB knows which rows haven't finished being updated in ES and if it sees you trying to do update that row again, it'll requeue that update command to run again.

The reason this issue is intermittent is that sometimes ES finishes updating the row before we try to update it again (which means things work), and sometimes it doesn't (which means things fail).

And the bug here is that when we requeued the command, it would would ultimately try to queue up a "commit" for the current transaction id, which, thanks to the use of subtransactions, turned out to be different at that time.

Anyways, the fix here is to skip changing out our concept of the "current transaction" when re-processing deferred commands.

I'll get the fix pushed here in a bit along with a test (your .sql script from here -- mostly), and can likely go ahead and publish a new release today too.

eeeebbbbrrrr added a commit that referenced this issue Jun 29, 2022
eeeebbbbrrrr added a commit that referenced this issue Jun 29, 2022
@eeeebbbbrrrr
Copy link
Collaborator

released in v3000.1.1. Enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants