Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: can not load WAL data when restart #1624

Closed
GuyCheung opened this issue Oct 10, 2019 · 41 comments
Closed

receive: can not load WAL data when restart #1624

GuyCheung opened this issue Oct 10, 2019 · 41 comments

Comments

@GuyCheung
Copy link

GuyCheung commented Oct 10, 2019

Thanos, Prometheus and Golang version used:

Thanos: self build from thanos master: a09a4b9
Golang: go version go1.12.7 darwin/amd64
Prometheus: 2.13.0

thanos build command: GOOS=linux GOARCH=amd64 go build -o thanos ./cmd/thanos

What happened:

I'm trying to use thanos receive, they are running as expected hours before I restart it.
the thanos receive try to load WAL data again and again

What you expected to happen:

load WAL data and listen & receive new data.

How to reproduce it (as minimally and precisely as possible):

I'm not sure is there any logic issue when receive starting? the thanos-0.7 can restart successful but master code cannot.

Full logs to relevant components:

Logs

level=info ts=2019-10-10T13:11:00.909450843Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:11:00.909646509Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:11:00.910759805Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:11:00.913326316Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:11:00.913810978Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:11:00.913638216Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:11:00.913837063Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:11:00.91392518Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:11:00.914250195Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:11:00.91435138Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:11:00.914334174Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:11:00.914388157Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:11:00.914437502Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:11:01.480532567Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:11:54.313611621Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:11:58.976905831Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:12:03.674326957Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:12:08.825274021Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:12:13.699480427Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:12:18.64123707Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167
level=info ts=2019-10-10T13:12:23.570922965Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2039 maxSegment=2167

.....

level=info ts=2019-10-10T13:22:40.613369573Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2165 maxSegment=2167
level=warn ts=2019-10-10T13:22:41.261691466Z caller=head.go:492 component=receive component=tsdb component=tsdb msg="unknown series references" count=104
level=info ts=2019-10-10T13:22:41.261809977Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2166 maxSegment=2167
level=info ts=2019-10-10T13:22:41.26233365Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2167 maxSegment=2167
level=info ts=2019-10-10T13:23:18.541986663Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:24:06.787360411Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:24:11.174920806Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:24:15.48524744Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:24:20.140672496Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:24:24.467852131Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:24:28.944796843Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167

....

level=info ts=2019-10-10T13:33:51.954823544Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2136 maxSegment=2167
level=warn ts=2019-10-10T13:33:57.647949531Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973
level=info ts=2019-10-10T13:33:57.648056531Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2137 maxSegment=2167
level=warn ts=2019-10-10T13:34:03.485428154Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=7034
level=info ts=2019-10-10T13:34:03.485536561Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2138 maxSegment=2167
level=info ts=2019-10-10T13:34:22.347698132Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:34:22.348067504Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:34:22.34921212Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:34:22.353678479Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:34:22.353887745Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:34:22.354027884Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:34:22.354002076Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:34:22.354402573Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:34:22.35507048Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:34:22.355474967Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:34:22.355602831Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:34:22.355642604Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:34:22.355704854Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:34:22.840486871Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"

Environment:

  • OS (e.g. from /etc/os-release): ubuntu
  • Kernel (e.g. uname -a): 4.15.0
  • Others:
@brancz
Copy link
Member

brancz commented Oct 10, 2019

On master receive will on startup ensure that it starts with a clean state, so it loads the WAL, flushes it to a block and then uploads it. That's why you're seeing it be loaded multiple times.

cc @squat

@squat
Copy link
Member

squat commented Oct 10, 2019

@GuyCheung the WAL gets replayed and written to disk in the following cases:

  • startup (for cleanup)
  • every time that the hashring configuration changes
  • every two hours
  • shutdown

Are you seeing an error where the WAL data is not successfully getting loaded?

@GuyCheung
Copy link
Author

@brancz @squat sorry for later reply. I didn't see critical errors except some unknown series error log like caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973.

if it will load multiple times, is there an exact number of how many times? seems it's an infinite loop and always loading

@squat
Copy link
Member

squat commented Oct 15, 2019

@GuyCheung the WAL will be reloaded twice on startup (only one of the two times actually has data played back) and two times every two hours:

During startup:

  • startup for ReadOnly DB flush (data is played back)
  • startup for Write DB (WAL is empty so nothing is played back)

... then every two hours:

  • once for ReadOnly DB flush (data is played back)
  • again for Write DB open (WAL is empty so nothing is played back)

Are you seeing it more often than that?

@GuyCheung
Copy link
Author

GuyCheung commented Oct 16, 2019

@squat thanks for ur information! I think I know why it's always keep loading... the process had been OOM and been killed. I have some questions based on this, could you help me?

before my questions, update the information: I have changed thanos verison to v0.8.1 now.

  • 1st, during multiple times replay, the memory will not released?

In my mind, I think the memory will not much bigger than the data folder on the disk? but I found the process took about 47G memory(total 48G), and the WAL data folder on disk was about 22G

the process took about 47G memory when been killed:

[7140727.569918] Out of memory: Kill process 89696 (thanos) score 936 or sacrifice child
[7140727.573522] Killed process 89696 (thanos) total-vm:52478600kB, anon-rss:47156872kB, file-rss:0kB, shmem-rss:0kB
[7140730.168611] oom_reaper: reaped process 89696 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

the receiver data folder size like below:
image

  • 2nd, according to this situation, is there some way I can limit the memory to avoid OOM?

I try to find the answer on the internet, but unfortunately, I didn't find any useful information, and there is a voice said the Prometheus 2 can not limit the memory, I'm not sure about this.

  • 3rd, about the WAL replay, I got succeed on a 128G memory host, but seems not only replay twice?

I think it was replayed triple times during startup, and each time seems took 10 minutes, did this act as ur expectation? is there sth we can do to reduce the replay time cost?

the key points of my logs like below:

level=info ts=2019-10-15T11:30:13.41402523Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-15T11:30:13.414148657Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-15T11:30:13.414159228Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-15T11:30:13.414214989Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-15T11:30:13.414218471Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-15T11:30:13.414273506Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-15T11:30:13.414441585Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:30:13.414492525Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:30:13.414524465Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:30:13.4145516Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:30:13.728236906Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:30:47.562607851Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:30:53.447669188Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:30:58.981511175Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:31:04.501141802Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:31:10.169544723Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

......

level=info ts=2019-10-15T11:40:25.17463596Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2584 maxSegment=2588
level=info ts=2019-10-15T11:40:31.55063322Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2585 maxSegment=2588
level=info ts=2019-10-15T11:40:32.134839757Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:40:32.13568359Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:40:32.136520863Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:40:48.050314426Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:41:19.363551827Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:41:26.103639804Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:41:31.860346412Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:41:37.503849905Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:41:43.180104605Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

.......

level=info ts=2019-10-15T11:51:48.263074931Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263503714Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263827295Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:55:32.977551161Z caller=compact.go:495 component=receive component=tsdb msg="write block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3 duration=3m26.870889547s
level=info ts=2019-10-15T11:55:32.977767127Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:55:32.977804379Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:55:32.977831925Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:55:32.977858102Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:55:32.977885957Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:55:33.274419251Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:55:43.440394647Z caller=shipper.go:349 component=receive msg="upload new block" id=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:56:06.821192263Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:56:13.313346626Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2589
level=info ts=2019-10-15T11:56:19.591774203Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2589
level=info ts=2019-10-15T11:56:25.923561156Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2589
level=info ts=2019-10-15T11:56:32.700377317Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2589
level=info ts=2019-10-15T11:56:39.431858684Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2497 maxSegment=2589

..........

level=info ts=2019-10-15T12:06:15.766904275Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767436341Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767896116Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2589
level=info ts=2019-10-15T12:06:15.768469372Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2589 maxSegment=2589
level=info ts=2019-10-15T12:06:37.956202426Z caller=receive.go:239 component=receive msg="tsdb started"
level=info ts=2019-10-15T12:06:37.9562544Z caller=prober.go:114 component=receive msg="changing probe status" status=ready
level=info ts=2019-10-15T12:06:37.95626502Z caller=receive.go:243 component=receive msg="server is ready to receive web requests."
level=info ts=2019-10-15T12:06:37.956699189Z caller=receive.go:347 component=receive msg="listening for StoreAPI gRPC" address=0.0.0.0:19219

@squat
Copy link
Member

squat commented Oct 16, 2019

@GuyCheung, this huge growth in WAL looks to me to be the same issue that #1654 fixes. Every time that the receive component triggers a flush, the WAL is replayed and written to a block but the WAL is never deleted. Only new samples are actually written to the block but ALL samples are processed. This means that the the receive component has to do way more work than necessary. If the receive component has correctly shut down and cleaned the WAL, there should be 0 replay cost when starting back up.

@squat
Copy link
Member

squat commented Oct 16, 2019

@GuyCheung now that #1654 has merged, could you test again using the image built from the commit?

@GuyCheung
Copy link
Author

@squat I have tried the new version which commit id is 48a8fb6e2f6a476bcffa508d6609a19847c695ef

but I got OOM on the 128GB memory hosts... and the data folder size just about 6GB, could you help to look into this?

image

I shared the pprof/heap file to you: https://drive.google.com/file/d/1iKqfMD9brOXbt7mLqJCX689AhRPuzJ_N/view?usp=sharing

@Victorion
Copy link

@squat I also tried to test the receive component with this patch - same story, but with small chunk - ~80 MB generated wal logs can't be replayed (single segment) - got OOM with 4GB memory limit on the container.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@brancz
Copy link
Member

brancz commented Jan 13, 2020

Does this continue to occur for you?

@stale stale bot removed the stale label Jan 13, 2020
@ricoberger
Copy link

We have a similar problem, running Thanos in a Kubernetes cluster. When a pod gets restarted the receiver is OOM killed. The StatefulSet looks similar to the following example: https://github.com/thanos-io/kube-thanos/blob/master/examples/all/manifests/thanos-receive-statefulSet.yaml

The problem occurs with Thanos v0.10.0 and all earlier versions. We also tested it with up to 16GB of memory (when the pods are running they using around 1GB of memory). These are the logs until the pod gets OOM killed:

level=info ts=2020-01-17T07:40:38.874473819Z caller=main.go:149 msg="Tracing will be disabled"
level=warn ts=2020-01-17T07:40:38.874629267Z caller=receive.go:170 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2020-01-17T07:40:38.874722296Z caller=options.go:20 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:38.902765617Z caller=options.go:80 component=receive msg="TLS client using system certificate pool"
level=info ts=2020-01-17T07:40:38.904426534Z caller=factory.go:43 component=receive msg="loading bucket configuration"
level=info ts=2020-01-17T07:40:39.133165452Z caller=receive.go:476 component=receive msg="starting receiver"
level=info ts=2020-01-17T07:40:39.133391365Z caller=prober.go:127 component=receive msg="changing probe status" status=healthy
level=info ts=2020-01-17T07:40:39.133430517Z caller=http.go:53 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-01-17T07:40:39.133494173Z caller=options.go:20 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:39.133543844Z caller=receive.go:322 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2020-01-17T07:40:39.13335757Z caller=handler.go:171 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291
level=info ts=2020-01-17T07:40:39.394096288Z caller=head.go:583 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-01-17T07:40:41.731857492Z caller=head.go:607 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-01-17T07:40:44.053388267Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=15 maxSegment=22
level=info ts=2020-01-17T07:40:45.417218717Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=16 maxSegment=22
level=info ts=2020-01-17T07:40:47.796282924Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=17 maxSegment=22

@squat
Copy link
Member

squat commented Jan 17, 2020

This entire code path is simply leveraging Prometheus TSDB packages. Do we see similar spikes in memory when Prometheus restarts and replays a large WAL?

@ricoberger
Copy link

We are seeing similar spikes in memory, when Prometheus restarts and replays a large WAL? The Prometheus spikes are not as extreme as in Thanos.

Thanos, gets OOM killed with 16GB, after that we decreased the limit to 4GB:

Bildschirmfoto 2020-01-17 um 15 00 07

Prometheus:

Bildschirmfoto 2020-01-17 um 15 00 32

@sepich
Copy link
Contributor

sepich commented Jan 27, 2020

Also see that thanos-receive v0.10.1 spikes in memory usage on start and got in OOM-loop.
Have to reduce --tsdb.block-duration=15m to have smaller WAL size - startup memory usage reduced dramatically.

@rdzimmer-zz
Copy link

I am also seeing huge spikes in memory after a restart. 9GB becomes 93GB. I have more details in #2107 (which I closed as a duplicate after finding I missed this issue), but sounds pretty similar to what everyone else is seeing.

@stale
Copy link

stale bot commented Mar 8, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 8, 2020
@mikkeloscar
Copy link

This still happens with Thanos v0.11.0

@stale stale bot removed the stale label Mar 9, 2020
@brancz
Copy link
Member

brancz commented Mar 17, 2020

This can largely be attributed to WAL replay cost of the Prometheus TSDB, which thanos-receive makes use of. This is being improved on Prometheus itself already, once merged there, it'll trickle down to here.

@norbertwnuk
Copy link

@brancz - can you point us to some existing Prometheus PR?

@brancz
Copy link
Member

brancz commented Mar 26, 2020

One example: prometheus/prometheus#6679. There are more ideas in the pipeline once that work is completed.

@stale
Copy link

stale bot commented Apr 25, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 25, 2020
@sepich
Copy link
Contributor

sepich commented Apr 26, 2020

Still valid for v0.12.1

@stale stale bot removed the stale label Apr 26, 2020
@brancz
Copy link
Member

brancz commented Apr 27, 2020

Not necessarily a general solution (as it just pushes the problem out further), but latest version of prometheus TSDB should at least cut this problem in half: prometheus/prometheus#7098

@stale
Copy link

stale bot commented May 27, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 27, 2020
@stale
Copy link

stale bot commented Jun 3, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 3, 2020
@daxwang
Copy link

daxwang commented Jun 5, 2020

same problem for v0.12.2, restart receive, then oom

@pratap-bhaskar
Copy link

We are issuing a similar problem with our setup, we have Thanos-Receive to get metrics from 8 clusters, we are running the pods on 64GB memory VMs and it gets OOM Killed atleast once a day and we lose the data for atleast 2 hours. Is there a way to circumvent this scenario?

@JoseRIvera07
Copy link

stills valid for v0.17.2, after restart receiver component. OMM Killed

@thinker0
Copy link

thinker0 commented Feb 9, 2021

same problem for v0.18.0, restart receive, then oom

@kakkoyun
Copy link
Member

kakkoyun commented Feb 9, 2021

I'm reopening this. It seems like there is still an issue. If we think this is not an issue. Let's convert this to a discussion and redirect users to the discussion for similar issues.

cc @brancz @squat @bwplotka

@kakkoyun kakkoyun reopened this Feb 9, 2021
@stale
Copy link

stale bot commented Jun 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 3, 2021
@EladDolev
Copy link

still happening

@stale stale bot removed the stale label Jun 4, 2021
@stale
Copy link

stale bot commented Aug 4, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 4, 2021
@stale
Copy link

stale bot commented Aug 21, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 21, 2021
@imranismail
Copy link

This is still happening on the latest release

@luizrojo
Copy link
Contributor

Still happening on 0.24.0

@Boeller666
Copy link

Still happening on 0.25.1

@Venture200
Copy link

Still happening on 0.29.0

@praveenchandran
Copy link

Still happening on 0.31.0. Why is this not looked into ? :(

@gioppoluca
Copy link

still happening on 0.31.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests