Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two tikv oom after inject tikv network-loss and recovery for some time #12255

Closed
Lily2025 opened this issue Mar 24, 2022 · 12 comments · Fixed by #12458
Closed

two tikv oom after inject tikv network-loss and recovery for some time #12255

Lily2025 opened this issue Mar 24, 2022 · 12 comments · Fixed by #12458
Assignees
Labels
severity/critical type/bug Type: Issue - Confirmed a bug

Comments

@Lily2025
Copy link

Lily2025 commented Mar 24, 2022

Bug Report

What version of TiKV are you using?

[2022/03/24 04:13:50.851 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=pd] [version=6.1.0-nightly] [git_hash=1ac0ad691260dabb61a25f30359e996a968ed857]
[2022/03/24 04:13:50.851 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=tikv] [version=6.0.0-alpha] [git_hash=869b953e798cabf29872fd17d526a7061437aec2]
[2022/03/24 04:13:50.851 +08:00] [INFO] [client.go:376] ["Cluster version information"] [type=tidb] [version=6.1.0-nightly] [git_hash=b9bacad6dafabf5e2dfafee8e50ac66785e911b6]

What operating system and CPU are you using?

8core、16GB
2tidb、3pd、5tikv(5replicas)

Steps to reproduce

https://tcms.pingcap.net/dashboard/executions/plan/662849
test data:
{{[tpcc] []} {s3://benchmark/tpcc10000 tpcc10000 10000 64 2013,1213,1105,1205,8022,8027,8028,9004,9007,1062} {s3://benchmark/sysbench_64_7000w sysbench_64_7000w 64 70000000 64 2013,1213,1105,1205,8022,8027,8028,9004,9007,1062} {0} {[]} {false }}

1、[2022/03/24 04:13:51.083 +08:00] [INFO] [cmd.go:124] ["Start remote command"] [cmd="go-tpc tpcc run -D tpcc10000 --host tc-tidb.endless-oltp-tps-662849-1-968 -P4000 --warehouses 10000 -T 64 --time 36000m --ignore-error '2013,1213,1105,1205,8022,8027,8028,9004,9007,1062'"] [nodename=benchtoolset]
2、inject fault
[2022/03/24 04:24:51.173 +08:00] [INFO] [chaos.go:86] ["Run chaos"] [name=network-loss] [selectors="[endless-oltp-tps-662849-1-968/tc-tikv-1]"] [experiment="{"Duration":"","Scheduler":null,"Loss":"84","Correlation":"25"}"]
[2022/03/24 04:24:51.175 +08:00] [INFO] [chaos.go:86] ["Run chaos"] [name=network-loss] [selectors="[endless-oltp-tps-662849-1-968/tc-tikv-0]"] [experiment="{"Duration":"","Scheduler":null,"Loss":"84","Correlation":"25"}"]
3、recovery fault
[2022/03/24 05:06:51.203 +08:00] [INFO] [chaos.go:151] ["Clean chaos"] [name=network-loss] [chaosId="ns=endless-oltp-tps-662849-1-968,kind=network-loss,name=network-loss-pdhgfxcy,spec=&k8s.ChaosIdentifier{Namespace:"endless-oltp-tps-662849-1-968", Name:"network-loss-pdhgfxcy", Spec:NetworkLossSpec{Duration: "", Scheduler: , Loss: "84", Correlation: "25"}}"]
[2022/03/24 05:06:51.203 +08:00] [INFO] [chaos.go:151] ["Clean chaos"] [name=network-loss] [chaosId="ns=endless-oltp-tps-662849-1-968,kind=network-loss,name=network-loss-zfevalyq,spec=&k8s.ChaosIdentifier{Namespace:"endless-oltp-tps-662849-1-968", Name:"network-loss-zfevalyq", Spec:NetworkLossSpec{Duration: "", Scheduler: , Loss: "84", Correlation: "25"}}"]

What did you expect?

all tikv are normal

What did happened?

two tikv oom at 2022/03/24 06:02 and 06:26
tikv0 memory start to rise form 2022/03/24 05:10 and oom at 06:26
tikv1 memory start to rise form 2022/03/24 05:08 and oom at 06:00
image
image

image
image
image
image
image
image
image
image

@Lily2025
Copy link
Author

/type bug
/severity Critical
/assign Connor1996
/assign tabokie

@ti-chi-bot ti-chi-bot added type/bug Type: Issue - Confirmed a bug severity/critical labels Mar 24, 2022
@Lily2025
Copy link
Author

/remove-severity critical
/severity major

@Lily2025
Copy link
Author

/found automation

@Lily2025
Copy link
Author

/assign 5kbpers

@Lily2025
Copy link
Author

/remove-severity critical
/severity Moderate

@ti-chi-bot
Copy link
Member

@Lily2025: These labels are not set on the issue: severity/critical.

In response to this:

/remove-severity critical
/severity Moderate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@Lily2025
Copy link
Author

/remove-severity major
/severity Moderate

@tabokie
Copy link
Member

tabokie commented Mar 25, 2022

OmNNt9wwvR

It appears that the memory growth matches active Raft entry count. After restart, the baseline memory usage (4.5GB) indicates the memory usage of Raft Engine.

This happens because during disconnection, log entries cannot be GC-ed by the leader, and the in-memory index inside Raft Engine accumulates indefinitely.

No fix at the moment.

@Lily2025
Copy link
Author

/remove-severity Moderate
/severity major

@Lily2025
Copy link
Author

/remove-severity major
/severity critical

@Lily2025 Lily2025 changed the title two tikv oom after inject two tikv network-loss and recovery for some time in 5 replicas scenes two tikv oom after inject tikv network-loss and recovery for some time Mar 29, 2022
@Lily2025
Copy link
Author

/affects-6.0

@Lily2025
Copy link
Author

Lily2025 commented Apr 1, 2022

/label affects-6.0

ti-chi-bot pushed a commit that referenced this issue May 7, 2022
close #12255

Support setting memory limit for raft engine

Signed-off-by: tabokie <xy.tao@outlook.com>
3AceShowHand pushed a commit to 3AceShowHand/tikv that referenced this issue May 7, 2022
close tikv#12255

Support setting memory limit for raft engine

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: 3AceShowHand <jinl1037@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/critical type/bug Type: Issue - Confirmed a bug
Projects
None yet
5 participants