Examples: Fix the interactive test for MacOS users #4779

matej-g · 2021-10-14T15:37:11Z

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

This PR includes a couple of fixes, targeted towards MacOS users, namely:

Bumping to latest version of efficientgo/e2e includes a fix which will allow running the interactive test on MacOS
Adjusting some cp -t usage in the code which was not compatible with MacOS version of cp.
Made the block generation profile variable, with smaller default, in order to make this run with less RAM.
Updated documentation to include more details.

Verification

Perhaps an Apple user could give the final confirmation? 🍏

Signed-off-by: Matej Gera <matejgera@gmail.com>

saswatamcode · 2021-10-14T18:43:27Z

I encountered this error (after running the interactive test previously i.e, interactive/data dir exists) but not sure if this related to efficentgo/e2e,

=== RUN   TestReadOnlyThanosSetup
23:57:38 Starting cadvisor
23:57:46 Ports for container interactive-cadvisor >> Local ports: map[http:8080] Ports available from host: map[http:59667]
23:57:46 Starting monitoring
23:57:49 cadvisor: W1014 18:27:49.933149       1 sysinfo.go:203] Nodes topology is not available, providing CPU topology
23:57:49 cadvisor: W1014 18:27:49.933514       1 sysfs.go:348] unable to read /sys/devices/system/cpu/cpu0/online: open /sys/devices/system/cpu/cpu0/online: no such file or directory
23:57:49 cadvisor: W1014 18:27:50.014122       1 oomparser.go:173] error reading /dev/kmsg: read /dev/kmsg: broken pipe
23:57:49 cadvisor: E1014 18:27:50.014232       1 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:388 msg="No time or size retention was set so using the default time retention" duration=15d
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:426 msg="Starting Prometheus" version="(version=2.27.0, branch=HEAD, revision=24c9b61221f7006e87cd62b9fe2901d43e19ed53)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:431 build_context="(go=go1.16.4, user=root@f27daa3b3fec, date=20210512-18:04:51)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:432 host_details="(Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64 monitoring (none))"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:433 fd_limits="(soft=1048576, hard=1048576)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.177Z caller=main.go:434 vm_limits="(soft=unlimited, hard=unlimited)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.303Z caller=web.go:540 component=web msg="Start listening for connections" address=:9090
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.316Z caller=main.go:803 msg="Starting TSDB ..."
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.376Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.488Z caller=head.go:741 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.489Z caller=head.go:755 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=45.5µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.491Z caller=head.go:761 component=tsdb msg="Replaying WAL, this may take a while"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:818 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.6788ms wal_replay_duration=5.0222ms total_replay_duration=17.7563ms
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.548Z caller=main.go:828 fs_type=65735546
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:831 msg="TSDB started"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=26.8292ms remote_storage=79.4µs web_handler=18.8µs query_engine=17.4µs scrape=4.1701ms scrape_sd=360.6µs notify=14.5µs notify_sd=324.1µs rules=16.3µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:775 msg="Server is ready to receive web requests."
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.786Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.818Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=32.7781ms remote_storage=16.1µs web_handler=12.1µs query_engine=13.5µs scrape=189.8µs scrape_sd=300.8µs notify=13.8µs notify_sd=15.2µs rules=13.9µs
23:57:54 Ports for container interactive-monitoring >> Local ports: map[http:9090] Ports available from host: map[http:59705]
    interactive_test.go:107: interactive_test.go:107:
        
         unexpected error: Prometheus failed to scrape local endpoint after 2 minutes, check monitoring Prometheus logs
        
23:59:54 Killing monitoring
23:59:56 Killing cadvisor
--- FAIL: TestReadOnlyThanosSetup (154.81s)
FAIL
FAIL    command-line-arguments  156.287s
FAIL

Also, logs were same as the comment after starting afresh (without existing interactive/data dir).

matej-g · 2021-10-14T19:36:09Z

I encountered this error (after running the interactive test previously i.e, interactive/data dir exists) but not sure if this related to efficentgo/e2e,

Oof, I think I know this one 🤕. Since we're running the Prometheus instance in Docker, but we want to scrape metrics from the host machine, there needs to be a connection from the container to the host. The framework assumes that the host machine will be reachable on the network's gateway IP, but this does not seem to work in all cases. Since the host metrics never get scraped, the operation times out after 2 minutes.

saswatamcode · 2021-10-15T04:58:17Z

Oh! I think this is due to docker networking being different on macOS.

An easy fix seems to be just replacing this line with d.hostAddr = "gateway.docker.internal" (would need to detect OS), which I tried locally and it works (thanosbench issue remains though). 🙂

Maybe I can raise PR to e2e?

matej-g · 2021-10-15T06:15:35Z

Maybe I can raise PR to e2e?

Absolutely, go for it! 🥳 I'd be happy to review it.

saswatamcode · 2021-10-16T07:04:14Z

For the thanosbench issue, I don't think it's related to the platform as createData() uses containers for it.

Seems like createData() should be generating 9 blocks for each data-dir i.e, store1, store2, prom1, prom2 but it only generates five before erroring out with status code 137. I tested this on both Darwin x86_64 and Linux x86_64(wsl2) and got the same results i.e, same logs and 5 blocks generated before the error.

Changing block plan profile from continuous-1w-small to continuous-30d-tiny seems to work here i.e, it actually generates 9 blocks and doesn't error out. But I think it only has 5 series per block.

Edit: Also the status code 137 means probably means it's getting OOM killed in some way, even though docker inspect shows it wasn't,

[
    {
        "Id": "8162c0b62bc2446929407e762e1ddff1e27832c6ce97b883297597eb1848e17e",
        "Created": "2021-10-16T13:45:19.620969732Z",
        "Path": "/bin/thanosbench",
        "Args": [
            "block",
            "gen",
            "--output.dir",
            "/shared"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2021-10-16T13:45:22.172899406Z",
            "FinishedAt": "2021-10-16T13:47:11.717758878Z"
        },
        "Image": "sha256:bef7a74fb0aacfaa470ecea92033c38e640e09a2bd2ccae7fcdd176530a834cc",

Maybe this is an issue with thanosbench or I'm doing something wrong? 🤕

matej-g · 2021-10-18T07:56:12Z

Edit: Also the status code 137 means probably means it's getting OOM killed in some way, even though docker inspect shows it wasn't,

Maybe this is an issue with thanosbench or I'm doing something wrong? face_with_head_bandage

Works fine on my machine ™️. I'm wondering if you're right about the memory, especially if you're running on MacOS or in other virtualized environment with a memory limit (on my Linux machine, I seem to be limited only by my host machine's available memory). What does docker stats show you when you run the test? What is the MEM USAGE / LIMIT for thanosbench container(s)?

saswatamcode · 2021-10-18T10:32:56Z

Yes, I considered that. Docker for macOS limits containers to 2GB memory and half the number of host CPUs by default. So I bumped memory to 8GB and max CPUs. But still the same result. The test creates two containers for each store (one for block plan and one for block gen). The block gen container MEM USAGE / LIMIT is at 7.005GiB / 7.774GiB by the end.

        level=info ts=2021-10-18T10:04:51.148618306Z caller=block.go:94 msg="generating block" spec="[1626667200005 - 1626696000004](7h59m59.999s) "
        level=info ts=2021-10-18T10:04:51.185266496Z caller=head.go:644 msg="Replaying on-disk memory mappable chunks if any"
        level=error ts=2021-10-18T10:04:51.21856407Z caller=head.go:649 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:04:51.218680786Z caller=head.go:760 msg="Deleting mmapped chunk files"
        level=info ts=2021-10-18T10:04:51.218699391Z caller=head.go:763 msg="Deletion of mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:04:51.218709409Z caller=head.go:658 msg="On-disk memory mappable chunks replay completed" duration=33.361098ms
        level=info ts=2021-10-18T10:04:51.218716063Z caller=head.go:660 msg="WAL not found"
        level=info ts=2021-10-18T10:05:04.207970966Z caller=writer.go:123 msg=flushing series_count=10000 mint=2021-07-19T04:00:15.005Z maxt=2021-07-19T12:00:00.005Z
        level=info ts=2021-10-18T10:05:07.100724541Z caller=compact.go:494 msg="write block" mint=1626667215005 maxt=1626696000006 ulid=01FJ9DS7AGE6YSRBHXEVHF1RC0 duration=2.892528562s
        level=info ts=2021-10-18T10:05:07.226623846Z caller=block.go:100 msg="generated block" path=/shared/01FJ9DS7AGE6YSRBHXEVHF1RC0 count=5
        level=info ts=2021-10-18T10:05:08.7544727Z caller=block.go:94 msg="generating block" spec="[1626494400006 - 1626667200005](47h59m59.999s) "
        level=info ts=2021-10-18T10:05:08.910671352Z caller=head.go:644 msg="Replaying on-disk memory mappable chunks if any"
        level=error ts=2021-10-18T10:05:08.971713339Z caller=head.go:649 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:05:08.972019642Z caller=head.go:760 msg="Deleting mmapped chunk files"
        level=info ts=2021-10-18T10:05:08.972106229Z caller=head.go:763 msg="Deletion of mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 17"
        level=info ts=2021-10-18T10:05:08.972974055Z caller=head.go:658 msg="On-disk memory mappable chunks replay completed" duration=62.193655ms
        level=info ts=2021-10-18T10:05:08.973001678Z caller=head.go:660 msg="WAL not found"
        : exit status 137

docker stats shows memory increasing right after the last log line until finally container is stopped with 137 exit code. It seems to generate the sixth block but is killed before flushing series and writing the block. Maybe this has something to do with how blocks are generated in thanosbench?

thanosbench does run on my machine, just not on containers...
@yeya24 had the same logs before so maybe he can replicate this?

saswatamcode · 2021-10-27T17:53:54Z

The e2e PR was merged so the networking issue should be fixed for macOS in the latest commit!

I think for now the workaround for the createData issue is running thanosbench locally with the same flags and generating the dataset and then running the interactive test.

Is there any reason why we do it in docker? Maybe we can not use containers for this and just grab the latest release binary of thanosbench and run that locally to generate the data? 🤔
However in this case the release would need to be updated with binaries for all platforms.

Also, another issue I came across is that cp doesn't have the -t flag on macOS so this line fails. Replacing it with the line below mitigates this (using xargs to run cp each time instead of once),

testutil.Ok(t, exec("sh", "-c", "find "+prom1Data+"/ -maxdepth 1 -type d | tail -5 | xargs -I {} cp -r {} "+promHA1.Dir()))

And seems like I can't pull the image tag used in the interactive_test for Thanos i.e, thanos:latest, so I had to change it to quay.io/thanos/thanos:v0.23.1.

Everything else seems to work! 💪🏼

matej-g · 2021-10-28T08:55:48Z

I think for now the workaround for the createData issue is running thanosbench locally with the same flags and generating the dataset and then running the interactive test.

Is there any reason why we do it in docker? Maybe we can not use containers for this and just grab the latest release binary of thanosbench and run that locally to generate the data? thinking However in this case the release would need to be updated with binaries for all platforms.

Hm, but will this bring us any mitigation? This seems to be related to user's setup, i.e. their machine either does not have enough RAM overall or they need to bump up their memory limit, in case they are using Docker on Mac. We will lose the flexibility of having this in Docker and will need to deal with binaries (both for Linux and MacOS on top of that).
My suggestion would be either:

Give a warning in the docs about the memory requirements
Choose a smaller profile for thanosbench run which will not consume that much memory (my personal preference)
Address this on the thanosbench side (not sure how feasible)

Also, another issue I came across is that cp doesn't have the -t flag on macOS ...

Thanks for this! I adjusted the command.

And seems like I can't pull the image tag used in the interactive_test for Thanos i.e, thanos:latest, so I had to change it to quay.io/thanos/thanos:v0.23.1.

I think we should specify in the docs that you need to first run make docker, in order to build the image locally, since thanos:latest does not exist in the registry..

Signed-off-by: Matej Gera <matejgera@gmail.com>

saswatamcode · 2021-10-28T09:54:16Z

Choose a smaller profile for thanosbench run which will not consume that much memory (my personal preference)

Yes, I think this would be better too! Or we can leave the choice of profiles up to the users(via a const which can be substituted into the docker commands, default being continuous-1w-small) and mention that within docs (with the warning for memory requirements). That way anyone can adjust it easily! 🙂

Thanks for this! I adjusted the command.

Thanks!

I think we should specify in the docs that you need to first run make docker, in order to build the image locally, since thanos:latest does not exist in the registry..

Oh, I was unaware! But yes, mentioning this as a step would be great! 🙂

Signed-off-by: Matej Gera <matejgera@gmail.com>

matej-g · 2021-10-28T15:51:19Z

I made the changes and updated the docs to make it clearer as well. I chose continuous-30d-tiny as default profile (requires much less memory), which can be overridden by setting BLOCK_PROFILE environment variable to any other plan profile. I updated the docs as well to explain all of this.

cc @saswatamcode @yeya24 PTAL!

…nteractive-test

Signed-off-by: Matej Gera <matejgera@gmail.com>

saswatamcode

Thank you!! It looks awesome now! 💫

matej-g · 2021-11-02T19:59:18Z

@yeya24 when you get a moment, may we get a review / merge here if it looks OK to you? Thanks 😊

yeya24

Thanks! Working great for me.

matej-g added 3 commits October 14, 2021 17:25

Bump e2e version

f771e73

Signed-off-by: Matej Gera <matejgera@gmail.com>

Update docs

435449c

Signed-off-by: Matej Gera <matejgera@gmail.com>

Update CHANGELOG

24bfdc8

Signed-off-by: Matej Gera <matejgera@gmail.com>

matej-g mentioned this pull request Oct 14, 2021

Interactive Example: Do not use monitoring endpoint #4726

Closed

2 tasks

saswatamcode mentioned this pull request Oct 15, 2021

Fix docker network host addr for macOS efficientgo/e2e#17

Merged

Bump e2e version; fix cp -t usage

6059aef

Signed-off-by: Matej Gera <matejgera@gmail.com>

matej-g added 2 commits October 28, 2021 17:44

Introduce variable block plan profile

dec0228

Signed-off-by: Matej Gera <matejgera@gmail.com>

Update example docs

6e66e9a

Signed-off-by: Matej Gera <matejgera@gmail.com>

Merge remote-tracking branch 'origin/main' into bump-e2e-pkg-to-fix-i…

0720aee

…nteractive-test

matej-g changed the title ~~Examples: Bump efficientgo/e2e to fix the interactive test for MacOS users~~ Examples: Fix the interactive test for MacOS users Oct 28, 2021

Re-comment the skip line

55f688a

Signed-off-by: Matej Gera <matejgera@gmail.com>

saswatamcode approved these changes Oct 28, 2021

View reviewed changes

yeya24 approved these changes Nov 3, 2021

View reviewed changes

yeya24 merged commit 1c3b984 into thanos-io:main Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples: Fix the interactive test for MacOS users #4779

Examples: Fix the interactive test for MacOS users #4779

matej-g commented Oct 14, 2021 •

edited

saswatamcode commented Oct 14, 2021 •

edited

matej-g commented Oct 14, 2021 •

edited

saswatamcode commented Oct 15, 2021 •

edited

matej-g commented Oct 15, 2021

saswatamcode commented Oct 16, 2021 •

edited

matej-g commented Oct 18, 2021

saswatamcode commented Oct 18, 2021 •

edited

saswatamcode commented Oct 27, 2021 •

edited

matej-g commented Oct 28, 2021

saswatamcode commented Oct 28, 2021

matej-g commented Oct 28, 2021

saswatamcode left a comment

matej-g commented Nov 2, 2021

yeya24 left a comment

Examples: Fix the interactive test for MacOS users #4779

Examples: Fix the interactive test for MacOS users #4779

Conversation

matej-g commented Oct 14, 2021 • edited

Changes

Verification

saswatamcode commented Oct 14, 2021 • edited

matej-g commented Oct 14, 2021 • edited

saswatamcode commented Oct 15, 2021 • edited

matej-g commented Oct 15, 2021

saswatamcode commented Oct 16, 2021 • edited

matej-g commented Oct 18, 2021

saswatamcode commented Oct 18, 2021 • edited

saswatamcode commented Oct 27, 2021 • edited

matej-g commented Oct 28, 2021

saswatamcode commented Oct 28, 2021

matej-g commented Oct 28, 2021

saswatamcode left a comment

Choose a reason for hiding this comment

matej-g commented Nov 2, 2021

yeya24 left a comment

Choose a reason for hiding this comment

matej-g commented Oct 14, 2021 •

edited

saswatamcode commented Oct 14, 2021 •

edited

matej-g commented Oct 14, 2021 •

edited

saswatamcode commented Oct 15, 2021 •

edited

saswatamcode commented Oct 16, 2021 •

edited

saswatamcode commented Oct 18, 2021 •

edited

saswatamcode commented Oct 27, 2021 •

edited