Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.7.0-rc/nightly-20240201 source throughput down to 0 with non-shared PG CDC sources #14943

Closed
cyliu0 opened this issue Feb 2, 2024 · 18 comments
Assignees
Labels
Milestone

Comments

@cyliu0
Copy link
Contributor

cyliu0 commented Feb 2, 2024

Describe the bug

Run ch-benchmark with non-shared PG CDC sources with v1.7.0-rc/nightly-20240201

v1.7.0-rc: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/187
nightly-20240201: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/188

The buildkite pipeline jobs failed the data consistency check before they completed the data sync. Because the data consistency check would start after the source throughput down to 0 for 60 seconds.

Grafana

image
image
image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v1.7.0-rc
nightly-20240201

Additional context

nightly-20240201

@cyliu0 cyliu0 added the type/bug Something isn't working label Feb 2, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 2, 2024
@StrikeW
Copy link
Contributor

StrikeW commented Feb 4, 2024

Update: revert #14899 also reproduce the problem, investigating other PRs in the list.

@StrikeW
Copy link
Contributor

StrikeW commented Feb 4, 2024

I found that the stream query in the passed job generated much less data compared with failed jobs.
image

And there is join amplification in the failed jobs:
image

I suspect the workload has changed and I rerun the pipeline with nightly-20240131 again, then it also experience barrier piled up as those failed jobs.

So I think the pipeline failure is not caused by the code change. cc @lmatz if you have other information.

@lmatz
Copy link
Contributor

lmatz commented Feb 4, 2024

thanks for the findings, let us check if there is any changes on the pipeline side

@cyliu0
Copy link
Contributor Author

cyliu0 commented Feb 4, 2024

Recently, I've added the ch-benchmark q3 back to the pipeline. The q3 has been removed since #12777

So I think it's still a problem?

@cyliu0
Copy link
Contributor Author

cyliu0 commented Feb 5, 2024

Reran the queries except q3 with v1.7.0-rc-1 passed https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/199

@lmatz
Copy link
Contributor

lmatz commented Feb 19, 2024

which causes barrier piled up and backpressures the the source.

Remove it from blockers for now.

Join amplification
Many L0 files

Join amplification is expected as it is determined by the nature of the query and data
but wonder why many L0 files

@fuyufjh
Copy link
Member

fuyufjh commented Apr 8, 2024

Ping, Any updates?

@lmatz
Copy link
Contributor

lmatz commented Apr 8, 2024

@cyliu0 could you run one more time but with more resources?

I think the point of this test is here is just to make sure that non-shared PG CDC sources don't block itself somehow,
but if the slowness/freeze is caused by query, then it does not matter.

@cyliu0
Copy link
Contributor Author

cyliu0 commented Apr 9, 2024

@cyliu0 could you run one more time but with more resources?

Hitting this while running with bigger memory on nightly-20240408

compactor = { limit = "12Gi", request = "12Gi" }
compute = { limit = "24Gi", request = "24Gi" }

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=ch-benchmark-pg-cdc-pipeline&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1712630508369&to=1712632212485
image

but if the slowness/freeze is caused by query, then it does not matter.

It's caused by ch-benchmark q3 in this case.

@StrikeW Shall we keep this issue for future enhancements? Or close it now?

@StrikeW
Copy link
Contributor

StrikeW commented Apr 26, 2024

but if the slowness/freeze is caused by query, then it does not matter.

It's caused by ch-benchmark q3 in this case.

@StrikeW Shall we keep this issue for future enhancements? Or close it now?

Optimize the query performance should be tracked by other issue. let's close this one.

@StrikeW StrikeW closed this as completed Apr 26, 2024
@cyliu0
Copy link
Contributor Author

cyliu0 commented May 8, 2024

The issue still exists with nightly-20240507. Which issue covered this right now? @StrikeW
https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/41#018f5655-3749-4029-a4b8-cc0c321eb18a
image

@StrikeW
Copy link
Contributor

StrikeW commented May 8, 2024

The issue still exists with nightly-20240507. Which issue covered this right now? @StrikeW https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/41#018f5655-3749-4029-a4b8-cc0c321eb18a image

There is no new issue for the performance problem.

The source is backpressured and could you confirm that the CN is configured with 16GB memory?
image

It seems the bottleneck is in the state store due to the amount of L0 files and lead to higher sync duration.
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?from=1715140652391&orgId=1&to=1715143112711&var-component=All&var-datasource=P2453400D1763B4D9&var-instance=benchmark-risingwave&var-namespace=tpc-20240508-035346&var-pod=All&var-table=All

@cyliu0
Copy link
Contributor Author

cyliu0 commented May 8, 2024

It's 13GB for compute node memory. But it seems like enough because the top memory usage is around 9GB here.
image
image

@StrikeW
Copy link
Contributor

StrikeW commented May 8, 2024

I think this should be a performance issue instead of functionality bug, so I suggest you can create a new issue for it and post to perf working group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants