Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples: Fix the interactive test for MacOS users #4779

Merged
merged 8 commits into from Nov 3, 2021

Conversation

matej-g
Copy link
Collaborator

@matej-g matej-g commented Oct 14, 2021

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This PR includes a couple of fixes, targeted towards MacOS users, namely:

  1. Bumping to latest version of efficientgo/e2e includes a fix which will allow running the interactive test on MacOS
  2. Adjusting some cp -t usage in the code which was not compatible with MacOS version of cp.
  3. Made the block generation profile variable, with smaller default, in order to make this run with less RAM.
  4. Updated documentation to include more details.

Verification

Perhaps an Apple user could give the final confirmation? 🍏

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
@saswatamcode
Copy link
Member

saswatamcode commented Oct 14, 2021

I encountered this error (after running the interactive test previously i.e, interactive/data dir exists) but not sure if this related to efficentgo/e2e,

=== RUN   TestReadOnlyThanosSetup
23:57:38 Starting cadvisor
23:57:46 Ports for container interactive-cadvisor >> Local ports: map[http:8080] Ports available from host: map[http:59667]
23:57:46 Starting monitoring
23:57:49 cadvisor: W1014 18:27:49.933149       1 sysinfo.go:203] Nodes topology is not available, providing CPU topology
23:57:49 cadvisor: W1014 18:27:49.933514       1 sysfs.go:348] unable to read /sys/devices/system/cpu/cpu0/online: open /sys/devices/system/cpu/cpu0/online: no such file or directory
23:57:49 cadvisor: W1014 18:27:50.014122       1 oomparser.go:173] error reading /dev/kmsg: read /dev/kmsg: broken pipe
23:57:49 cadvisor: E1014 18:27:50.014232       1 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:388 msg="No time or size retention was set so using the default time retention" duration=15d
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:426 msg="Starting Prometheus" version="(version=2.27.0, branch=HEAD, revision=24c9b61221f7006e87cd62b9fe2901d43e19ed53)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:431 build_context="(go=go1.16.4, user=root@f27daa3b3fec, date=20210512-18:04:51)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:432 host_details="(Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64 monitoring (none))"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:433 fd_limits="(soft=1048576, hard=1048576)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.177Z caller=main.go:434 vm_limits="(soft=unlimited, hard=unlimited)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.303Z caller=web.go:540 component=web msg="Start listening for connections" address=:9090
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.316Z caller=main.go:803 msg="Starting TSDB ..."
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.376Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.488Z caller=head.go:741 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.489Z caller=head.go:755 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=45.5µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.491Z caller=head.go:761 component=tsdb msg="Replaying WAL, this may take a while"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:818 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.6788ms wal_replay_duration=5.0222ms total_replay_duration=17.7563ms
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.548Z caller=main.go:828 fs_type=65735546
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:831 msg="TSDB started"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=26.8292ms remote_storage=79.4µs web_handler=18.8µs query_engine=17.4µs scrape=4.1701ms scrape_sd=360.6µs notify=14.5µs notify_sd=324.1µs rules=16.3µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:775 msg="Server is ready to receive web requests."
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.786Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.818Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=32.7781ms remote_storage=16.1µs web_handler=12.1µs query_engine=13.5µs scrape=189.8µs scrape_sd=300.8µs notify=13.8µs notify_sd=15.2µs rules=13.9µs
23:57:54 Ports for container interactive-monitoring >> Local ports: map[http:9090] Ports available from host: map[http:59705]
    interactive_test.go:107: interactive_test.go:107:
        
         unexpected error: Prometheus failed to scrape local endpoint after 2 minutes, check monitoring Prometheus logs
        
23:59:54 Killing monitoring
23:59:56 Killing cadvisor
--- FAIL: TestReadOnlyThanosSetup (154.81s)
FAIL
FAIL    command-line-arguments  156.287s
FAIL

Also, logs were same as the comment after starting afresh (without existing interactive/data dir).

@matej-g
Copy link
Collaborator Author

matej-g commented Oct 14, 2021

I encountered this error (after running the interactive test previously i.e, interactive/data dir exists) but not sure if this related to efficentgo/e2e,

Oof, I think I know this one 🤕. Since we're running the Prometheus instance in Docker, but we want to scrape metrics from the host machine, there needs to be a connection from the container to the host. The framework assumes that the host machine will be reachable on the network's gateway IP, but this does not seem to work in all cases. Since the host metrics never get scraped, the operation times out after 2 minutes.

@saswatamcode
Copy link
Member

saswatamcode commented Oct 15, 2021

Oh! I think this is due to docker networking being different on macOS.

An easy fix seems to be just replacing this line with d.hostAddr = "gateway.docker.internal" (would need to detect OS), which I tried locally and it works (thanosbench issue remains though). 🙂

Maybe I can raise PR to e2e?

@matej-g
Copy link
Collaborator Author

matej-g commented Oct 15, 2021

Maybe I can raise PR to e2e?

Absolutely, go for it! 🥳 I'd be happy to review it.

@saswatamcode
Copy link
Member

saswatamcode commented Oct 16, 2021

For the thanosbench issue, I don't think it's related to the platform as createData() uses containers for it.

Seems like createData() should be generating 9 blocks for each data-dir i.e, store1, store2, prom1, prom2 but it only generates five before erroring out with status code 137. I tested this on both Darwin x86_64 and Linux x86_64(wsl2) and got the same results i.e, same logs and 5 blocks generated before the error.

Changing block plan profile from continuous-1w-small to continuous-30d-tiny seems to work here i.e, it actually generates 9 blocks and doesn't error out. But I think it only has 5 series per block.

Edit: Also the status code 137 means probably means it's getting OOM killed in some way, even though docker inspect shows it wasn't,

[
    {
        "Id": "8162c0b62bc2446929407e762e1ddff1e27832c6ce97b883297597eb1848e17e",
        "Created": "2021-10-16T13:45:19.620969732Z",
        "Path": "/bin/thanosbench",
        "Args": [
            "block",
            "gen",
            "--output.dir",
            "/shared"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2021-10-16T13:45:22.172899406Z",
            "FinishedAt": "2021-10-16T13:47:11.717758878Z"
        },
        "Image": "sha256:bef7a74fb0aacfaa470ecea92033c38e640e09a2bd2ccae7fcdd176530a834cc",

Maybe this is an issue with thanosbench or I'm doing something wrong? 🤕

@matej-g
Copy link
Collaborator Author

matej-g commented Oct 18, 2021

Edit: Also the status code 137 means probably means it's getting OOM killed in some way, even though docker inspect shows it wasn't,

Maybe this is an issue with thanosbench or I'm doing something wrong? face_with_head_bandage

Works fine on my machine ™️. I'm wondering if you're right about the memory, especially if you're running on MacOS or in other virtualized environment with a memory limit (on my Linux machine, I seem to be limited only by my host machine's available memory). What does docker stats show you when you run the test? What is the MEM USAGE / LIMIT for thanosbench container(s)?

@saswatamcode
Copy link
Member

saswatamcode commented Oct 18, 2021

Yes, I considered that. Docker for macOS limits containers to 2GB memory and half the number of host CPUs by default. So I bumped memory to 8GB and max CPUs. But still the same result. The test creates two containers for each store (one for block plan and one for block gen). The block gen container MEM USAGE / LIMIT is at 7.005GiB / 7.774GiB by the end.

        level=info ts=2021-10-18T10:04:51.148618306Z caller=block.go:94 msg="generating block" spec="[1626667200005 - 1626696000004](7h59m59.999s) "
        level=info ts=2021-10-18T10:04:51.185266496Z caller=head.go:644 msg="Replaying on-disk memory mappable chunks if any"
        level=error ts=2021-10-18T10:04:51.21856407Z caller=head.go:649 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:04:51.218680786Z caller=head.go:760 msg="Deleting mmapped chunk files"
        level=info ts=2021-10-18T10:04:51.218699391Z caller=head.go:763 msg="Deletion of mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:04:51.218709409Z caller=head.go:658 msg="On-disk memory mappable chunks replay completed" duration=33.361098ms
        level=info ts=2021-10-18T10:04:51.218716063Z caller=head.go:660 msg="WAL not found"
        level=info ts=2021-10-18T10:05:04.207970966Z caller=writer.go:123 msg=flushing series_count=10000 mint=2021-07-19T04:00:15.005Z maxt=2021-07-19T12:00:00.005Z
        level=info ts=2021-10-18T10:05:07.100724541Z caller=compact.go:494 msg="write block" mint=1626667215005 maxt=1626696000006 ulid=01FJ9DS7AGE6YSRBHXEVHF1RC0 duration=2.892528562s
        level=info ts=2021-10-18T10:05:07.226623846Z caller=block.go:100 msg="generated block" path=/shared/01FJ9DS7AGE6YSRBHXEVHF1RC0 count=5
        level=info ts=2021-10-18T10:05:08.7544727Z caller=block.go:94 msg="generating block" spec="[1626494400006 - 1626667200005](47h59m59.999s) "
        level=info ts=2021-10-18T10:05:08.910671352Z caller=head.go:644 msg="Replaying on-disk memory mappable chunks if any"
        level=error ts=2021-10-18T10:05:08.971713339Z caller=head.go:649 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:05:08.972019642Z caller=head.go:760 msg="Deleting mmapped chunk files"
        level=info ts=2021-10-18T10:05:08.972106229Z caller=head.go:763 msg="Deletion of mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:05:08.972974055Z caller=head.go:658 msg="On-disk memory mappable chunks replay completed" duration=62.193655ms
        level=info ts=2021-10-18T10:05:08.973001678Z caller=head.go:660 msg="WAL not found"
        : exit status 137

docker stats shows memory increasing right after the last log line until finally container is stopped with 137 exit code. It seems to generate the sixth block but is killed before flushing series and writing the block. Maybe this has something to do with how blocks are generated in thanosbench?

thanosbench does run on my machine, just not on containers...
@yeya24 had the same logs before so maybe he can replicate this?

@saswatamcode
Copy link
Member

saswatamcode commented Oct 27, 2021

The e2e PR was merged so the networking issue should be fixed for macOS in the latest commit!

I think for now the workaround for the createData issue is running thanosbench locally with the same flags and generating the dataset and then running the interactive test.

Is there any reason why we do it in docker? Maybe we can not use containers for this and just grab the latest release binary of thanosbench and run that locally to generate the data? 🤔
However in this case the release would need to be updated with binaries for all platforms.

Also, another issue I came across is that cp doesn't have the -t flag on macOS so this line fails. Replacing it with the line below mitigates this (using xargs to run cp each time instead of once),

testutil.Ok(t, exec("sh", "-c", "find "+prom1Data+"/ -maxdepth 1 -type d | tail -5 | xargs -I {} cp -r {} "+promHA1.Dir()))

And seems like I can't pull the image tag used in the interactive_test for Thanos i.e, thanos:latest, so I had to change it to quay.io/thanos/thanos:v0.23.1.

Everything else seems to work! 💪🏼

@matej-g
Copy link
Collaborator Author

matej-g commented Oct 28, 2021

I think for now the workaround for the createData issue is running thanosbench locally with the same flags and generating the dataset and then running the interactive test.

Is there any reason why we do it in docker? Maybe we can not use containers for this and just grab the latest release binary of thanosbench and run that locally to generate the data? thinking However in this case the release would need to be updated with binaries for all platforms.

Hm, but will this bring us any mitigation? This seems to be related to user's setup, i.e. their machine either does not have enough RAM overall or they need to bump up their memory limit, in case they are using Docker on Mac. We will lose the flexibility of having this in Docker and will need to deal with binaries (both for Linux and MacOS on top of that).
My suggestion would be either:

  1. Give a warning in the docs about the memory requirements
  2. Choose a smaller profile for thanosbench run which will not consume that much memory (my personal preference)
  3. Address this on the thanosbench side (not sure how feasible)

Also, another issue I came across is that cp doesn't have the -t flag on macOS ...

Thanks for this! I adjusted the command.

And seems like I can't pull the image tag used in the interactive_test for Thanos i.e, thanos:latest, so I had to change it to quay.io/thanos/thanos:v0.23.1.

I think we should specify in the docs that you need to first run make docker, in order to build the image locally, since thanos:latest does not exist in the registry..

Signed-off-by: Matej Gera <matejgera@gmail.com>
@saswatamcode
Copy link
Member

  1. Choose a smaller profile for thanosbench run which will not consume that much memory (my personal preference)

Yes, I think this would be better too! Or we can leave the choice of profiles up to the users(via a const which can be substituted into the docker commands, default being continuous-1w-small) and mention that within docs (with the warning for memory requirements). That way anyone can adjust it easily! 🙂

Thanks for this! I adjusted the command.

Thanks!

I think we should specify in the docs that you need to first run make docker, in order to build the image locally, since thanos:latest does not exist in the registry..

Oh, I was unaware! But yes, mentioning this as a step would be great! 🙂

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
@matej-g
Copy link
Collaborator Author

matej-g commented Oct 28, 2021

I made the changes and updated the docs to make it clearer as well. I chose continuous-30d-tiny as default profile (requires much less memory), which can be overridden by setting BLOCK_PROFILE environment variable to any other plan profile. I updated the docs as well to explain all of this.

cc @saswatamcode @yeya24 PTAL!

@matej-g matej-g changed the title Examples: Bump efficientgo/e2e to fix the interactive test for MacOS users Examples: Fix the interactive test for MacOS users Oct 28, 2021
Signed-off-by: Matej Gera <matejgera@gmail.com>
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! It looks awesome now! 💫

@matej-g
Copy link
Collaborator Author

matej-g commented Nov 2, 2021

@yeya24 when you get a moment, may we get a review / merge here if it looks OK to you? Thanks 😊

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Working great for me.

@yeya24 yeya24 merged commit 1c3b984 into thanos-io:main Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants