Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV OOM during CDC initial scan #16035

Closed
Tracked by #16375
fubinzh opened this issue Nov 21, 2023 · 5 comments · Fixed by #16048
Closed
Tracked by #16375

TiKV OOM during CDC initial scan #16035

fubinzh opened this issue Nov 21, 2023 · 5 comments · Fixed by #16048
Assignees
Labels
severity/major type/bug Type: Issue - Confirmed a bug

Comments

@fubinzh
Copy link

fubinzh commented Nov 21, 2023

Bug Report

What version of TiKV are you using?

/ # /tikv-server -V
TiKV
Release Version: 6.5.3
Edition: Enterprise
Git Commit Hash: 5165943
Git Commit Branch: heads/refs/tags/v6.5.3-20231116-5165943

What operating system and CPU are you using?

K8S

Steps to reproduce

  1. TiDB cluster with 6 TiKV (16c 40g), 2 TiCDC
  2. Create cdc changefeed and pause it
  3. Run workload to generate 1TB data for CDC to sync.
  4. Resume cdc changefeed

What did you expect?

TiKV should not OOM

What did happened?

TiKV OOM

image
image

@fubinzh fubinzh added the type/bug Type: Issue - Confirmed a bug label Nov 21, 2023
@fubinzh
Copy link
Author

fubinzh commented Nov 21, 2023

/assign @hicqu

@fubinzh
Copy link
Author

fubinzh commented Nov 21, 2023

/severity major

@overvenus
Copy link
Member

The OOM is caused by initial scan pending task surge. The size of initial scan tasks is roughly 5848 bytes, 1649639 tasks take about 8.9 GB memory.

TiKV-2 Pending Tasks TiKV-2 Memory
image image

self.workers.spawn(async move {
CDC_SCAN_TASKS.with_label_values(&["total"]).inc();
match init
.initialize(change_cmd, raft_router, concurrency_semaphore)
.await
{
Ok(()) => {
CDC_SCAN_TASKS.with_label_values(&["finish"]).inc();
}
Err(e) => {
CDC_SCAN_TASKS.with_label_values(&["abort"]).inc();
error!("cdc initialize fail: {}", e; "region_id" => region_id);
init.deregister_downstream(e)
}
}
});


To fix the issue,

@hicqu
Copy link
Contributor

hicqu commented Nov 22, 2023

On TiCDC instances, there are lots of such logs:

[2023/11/19 16:11:13.208 +08:00] [INFO] [client.go:981] ["stream to store closed"] [namespace=default] [changefeed=test1] [addr=tc-tikv-2.tc-tikv-peer.webank-cdc-tps-4590005-1-42.svc:20160] [storeID=8]

After a gRPC stream is re-established, all subscribed regions will be re-sent to TiKV instances. That's why there are only 50K subscribed regions on a TiKV but 15000K pending region tasks.

Limiting the total number of pending tasks is a good idea. However we still need to resolve the stream-recreation case.

@hicqu
Copy link
Contributor

hicqu commented Nov 22, 2023

@overvenus pingcap/tiflow#8860 is expected to reduce changefeed initialization time, and it does works. So I suggest to just limit total number of pending tasks on TiKVs.

@overvenus overvenus added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 affects-6.0 affects-6.1 affects-6.2 and removed may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Nov 22, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
close tikv#16035

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
close tikv#16035

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
close tikv#16035

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 27, 2023
close tikv#16035

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this issue Nov 29, 2023
close #16035

When TiCDC starts changefeed, it may send numerous requests leading to
the creation of numerous scan tasks. However, the initial surge of scan
tasks may cause OOM.

This commit aims to resolve the issue by implementing a mechanism that
allows TiKV to reject requests when the number of pending tasks reaches
a certain limit.

Signed-off-by: Neil Shen <overvenus@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue Dec 5, 2023
ref #16035

return server_is_busy to cdc clients if necessary

Signed-off-by: qupeng <qupeng@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major type/bug Type: Issue - Confirmed a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants