Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] [Upgrade] Rolling Upgrade - Invalid argument: Index 13 does not reference a valid sidecar #21229

Closed
1 task done
rjalan-yb opened this issue Feb 28, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
2024.1_blocker area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/highest Highest priority issue qa_itest-system Bugs identified in itest-system automation

Comments

@rjalan-yb
Copy link
Contributor

rjalan-yb commented Feb 28, 2024

Jira Link: DB-10156

Description

During rolling upgrade automation run on 2.20.2.0-b143 we are getting this error:

connection to server at "10.9.215.60", port 5433 failed: FATAL: Invalid argument: Index 13 does not reference a valid sidecar

It has started from b126: https://jenkins.dev.yugabyte.com/job/itest-system-developer/10512/

It is not happening during any query run but probably during trying to connect to node.

In Postgres logs, we can see this error:

I0228 11:54:38.463984 69708 pg_client.cc:146] Using TServer host_port: 10.9.215.60:9100
I0228 11:54:38.465440 69708 pg_client.cc:153] S 814: Session id acquired
2024-02-28 11:54:39.273 UTC [69708] FATAL:  Invalid argument: Index 13 does not reference a valid sidecar
I0228 11:54:39.274224 69711 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
I0228 11:55:04.134824 71594 mem_tracker.cc:194] Overriding FLAGS_mem_tracker_tcmalloc_gc_release_bytes to 5242880
I0228 11:55:04.143363 71594 thread_pool.cc:170] Starting thread pool { name: pggate_ybclient queue_limit: 10000 max_workers: 1024 }
I0228 11:55:04.145865 71594 pg_client.cc:146] Using TServer host_port: 10.9.215.60:9100
I0228 11:55:04.147545 71594 pg_client.cc:153] S 815: Session id acquired

This issue in not seen in b125 run: https://jenkins.dev.yugabyte.com/job/itest-system-developer/10509/

This is seen when upgrading from 2.14 or 2.16 to 2.20.2.0.

Logs: https://drive.google.com/file/d/14-q-BgJE1mk0CKrDSqZ2TbwgNTtYe3dJ/view?usp=sharing

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@rjalan-yb rjalan-yb added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage qa_itest-system Bugs identified in itest-system automation labels Feb 28, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue priority/critical Critical issue priority/highest Highest priority issue and removed priority/medium Medium priority issue priority/critical Critical issue labels Feb 28, 2024
karthik-ramanathan-3006 added a commit that referenced this issue Feb 29, 2024
…ed to output of EXPLAIN(ANALYZE, DIST)."

Summary:
This reverts commit b5c632c.

The addition of new fields to the `PgsqlResponsePB` proto seems to have introduced upgrade failures in the 2.14/2.16 --> 2.20.2.
Reverting this change until this issue can be analyzed and fixed.
Jira: DB-10156, DB-569

Test Plan: Jenkins

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: aaruj, mihnea, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32727
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Feb 29, 2024
karthik-ramanathan-3006 added a commit that referenced this issue Mar 18, 2024
…ed to output of EXPLAIN(ANALYZE, DIST)."

Summary:
This reverts commit b5c632c.

The addition of new fields to the PgsqlResponsePB proto seems to have introduced upgrade failures in the 2.14/2.16 --> 2.20.X path.
Reverting this change until this issue can be analyzed and fixed.

D32727 reverted the change in branch 2.20.2.
This revision reverts the change in mainline 2.20 so that subsequent 2.20.X will not have this change by default.
Jira: DB-10156, DB-569

Test Plan: Jenkins

Reviewers: telgersma, smishra

Reviewed By: smishra

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33214
karthik-ramanathan-3006 added a commit that referenced this issue Mar 26, 2024
Summary:
It has been reported that upgrades from 2.14/2.16 to 2.20 (and beyond) fail due to pggate going into a crash loop
upon unpacking `PgsqlResponsePB`. This is caused due to the introduction of the 'Scanned Rows' field in 2.20+ (D31111)
which is sent in its own RPC metrics sidecar.
Versions of pggate lower than 2.17.1 are not capable of unpacking response protos that contain RPC
sidecars holding data other than the rows returned by DocDB. During an upgrade, while an un-upgraded pggate
may send a request only to its local un-upgraded tserver, it may have responses proxied back from upgraded
tservers on other nodes. Thus, the RPC infrastructure needs to be forward compatible in order to ensure that
pggate is not broken during upgrades.

This revision introduces a guardrail to check that the receiving pggate is capable of unpacking the RPC metrics sidecar
before sending the 'Scanned Rows' count.

Required backports: 2.20 (original diff + this fix), 2024.1 (only this fix), 2.21 (if needed, depending on branching)
Jira: DB-10156

Test Plan: Run rolling upgrade itest from 2.14/2.16 to master, 2.20.x, 2.21.x, 2024.1

Reviewers: hsunder, esheng, sergei

Reviewed By: hsunder

Subscribers: ybase, yql, mihnea, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33503
karthik-ramanathan-3006 added a commit that referenced this issue Mar 29, 2024
…ding scanned rows.

Summary:
Original commit: 80dc997 / D33503
It has been reported that upgrades from 2.14/2.16 to 2.20 (and beyond) fail due to pggate going into a crash loop
upon unpacking `PgsqlResponsePB`. This is caused due to the introduction of the 'Scanned Rows' field in 2.20+ (D31111)
which is sent in its own RPC metrics sidecar.
Versions of pggate lower than 2.17.1 are not capable of unpacking response protos that contain RPC
sidecars holding data other than the rows returned by DocDB. During an upgrade, while an un-upgraded pggate
may send a request only to its local un-upgraded tserver, it may have responses proxied back from upgraded
tservers on other nodes. Thus, the RPC infrastructure needs to be forward compatible in order to ensure that
pggate is not broken during upgrades.

This revision introduces a guardrail to check that the receiving pggate is capable of unpacking the RPC metrics sidecar
before sending the 'Scanned Rows' count.

Required backports: 2.20 (original diff + this fix), 2024.1 (only this fix), 2.21 (if needed, depending on branching)
Jira: DB-10156

Test Plan: Run rolling upgrade itest from 2.14/2.16 to master, 2.20.x, 2.21.x, 2024.1

Reviewers: hsunder, esheng, sergei

Reviewed By: hsunder

Subscribers: smishra, mihnea, yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33550
karthik-ramanathan-3006 added a commit that referenced this issue Apr 2, 2024
…upgrade fix

Summary:
This revision re-enables the 'Storage Rows Scanned' for `EXPLAIN (ANALYZE, DIST)` functionality in branch 2.20 after applying the upgrade related fix introduced in
80dc997 (D33503).
This 'Storage Rows Scanned' functionality was originally introduced as part of commit b5c632c (D31931) on 2.20
and reverted as part of commit a54db61 (D32727).

Jira: DB-10156, DB-569

Test Plan:
Run the following tests to validate the 'Storage Rows Scanned' functionality.
```
./yb_build.sh ---java-test org.yb.pgsql.TestPgExplainAnalyze
./yb_build.sh --java-test org.yb.pgsql.TestPgExplainAnalyzeColocated
./yb_build.sh --java-test org.yb.pgsql.TestPgExplainAnalyzeScans#testIndexScanConditionAndFilter
```

To validate the upgrade pathways, run rolling upgrade itest from 2.14/2.16 to 2.20.x.

Reviewers: telgersma, hsunder, smishra

Reviewed By: telgersma

Subscribers: mihnea, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1_blocker area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/highest Highest priority issue qa_itest-system Bugs identified in itest-system automation
Projects
None yet
Development

No branches or pull requests

3 participants