Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sinks::prometheus::remote_write::integration_tests::insert_metrics_over_http CI failure #5612

Closed
jszwedko opened this issue Dec 18, 2020 · 6 comments · Fixed by #5928
Closed
Assignees
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) sink: prometheus_exporter Anything `prometheus_exporter` sink related type: bug A code related bug.

Comments

@jszwedko
Copy link
Member

jszwedko commented Dec 18, 2020

This is the first time I've seen this failure, but putting it here to track. It passes locally for me.

https://github.com/timberio/vector/runs/1578308476?check_suite_focus=true#step:9:566

test sinks::prometheus::remote_write::integration_tests::insert_metrics_over_http ... Dec 18 18:18:03.124 DEBUG http: vector::internal_events::http_client: Sending HTTP request. uri=http://localhost:8086/api/v1/prom/write?db=testdb0 method=POST version=HTTP/1.1 headers={"x-prometheus-remote-write-version": "0.1.0", "content-encoding": "snappy", "content-type": "application/x-protobuf", "user-agent": "Vector/0.12.0 (x86_64-unknown-linux-gnu)", "accept-encoding": "identity"} body=[188 bytes]
Dec 18 18:18:03.166 DEBUG http: vector::internal_events::http_client: HTTP response. status=404 Not Found version=HTTP/1.1 headers={"content-type": "application/json", "request-id": "5e1c9907-415d-11eb-8003-0242ac120002", "x-influxdb-build": "OSS", "x-influxdb-error": "database not found: \"testdb0\"", "x-influxdb-version": "1.8.3", "x-request-id": "5e1c9907-415d-11eb-8003-0242ac120002", "date": "Fri, 18 Dec 2020 18:18:03 GMT", "content-length": "44"} body=[44 bytes]
Dec 18 18:18:03.167 ERROR request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 404 Not Found"
Dec 18 18:18:03.168 ERROR request{request_id=0}: vector::sinks::util::sink: Response wasn't successful. response=Response { status: 404, version: HTTP/1.1, headers: {"content-type": "application/json", "request-id": "5e1c9907-415d-11eb-8003-0242ac120002", "x-influxdb-build": "OSS", "x-influxdb-error": "database not found: \"testdb0\"", "x-influxdb-version": "1.8.3", "x-request-id": "5e1c9907-415d-11eb-8003-0242ac120002", "date": "Fri, 18 Dec 2020 18:18:03 GMT", "content-length": "44"}, body: b"{\"error\":\"database not found: \\\"testdb0\\\"\"}\n" }
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sinks/prometheus/remote_write.rs:241:38
stack backtrace:
   0:     0x557fa290b6a0 - std::backtrace_rs::backtrace::libunwind::trace::h72c2fb8038f1bbee
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/libunwind.rs:96
   1:     0x557fa290b6a0 - std::backtrace_rs::backtrace::trace_unsynchronized::h1e3b084883f1e78c
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/mod.rs:66
   2:     0x557fa290b6a0 - std::sys_common::backtrace::_print_fmt::h3bf6a7ebf7f0394a
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:79
   3:     0x557fa290b6a0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2e8cb764b7fe02e7
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:58
   4:     0x557fa2978e9c - core::fmt::write::h7a1184eaee6a8644
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/fmt/mod.rs:1080
   5:     0x557fa28fd1d2 - std::io::Write::write_fmt::haeeb374d93a67eac
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/io/mod.rs:1516
   6:     0x557fa291044d - std::sys_common::backtrace::_print::h1d14a7f6ad632dc8
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:61
   7:     0x557fa291044d - std::sys_common::backtrace::print::h301abac8bb2e3e81
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:48
   8:     0x557fa291044d - std::panicking::default_hook::{{closure}}::hde0cb80358a6920a
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:208
   9:     0x557fa29100f8 - std::panicking::default_hook::h9b1a691049a0ec8f
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:227
  10:     0x557fa2910c71 - std::panicking::rust_panic_with_hook::h2bdec87b60580584
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:577
  11:     0x557fa2910819 - std::panicking::begin_panic_handler::{{closure}}::h101ca09d9df5db47
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:484
  12:     0x557fa290bb0c - std::sys_common::backtrace::__rust_end_short_backtrace::h3bb85654c20113ca
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:153
  13:     0x557fa29107d9 - rust_begin_unwind
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:483
  14:     0x557fa29759a1 - core::panicking::panic_fmt::h48c31e1e3d550146
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:85
  15:     0x557fa29758ed - core::panicking::panic::h184ede6dd822ffb4
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:50
  16:     0x557fa0f5487c - core::option::Option<T>::unwrap::h12f5af527d79c71f
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
  42:     0x557fa168b068 - <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hc48829c2ba8b55d6
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panic.rs:308
  43:     0x557fa168b068 - std::panicking::try::do_call::h95feedb4d9e8dd5b
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:381
  44:     0x557fa168b068 - std::panicking::try::hc4c23f5f7d3bdd43
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:345
  45:     0x557fa168b068 - std::panic::catch_unwind::hf3a57d903cc5007a
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panic.rs:382
  46:     0x557fa168b068 - test::run_test_in_process::h09c598ed2a1f6695
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:543
  47:     0x557fa168b068 - test::run_test::run_test_inner::{{closure}}::h0723a32c882738ea
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:449
  48:     0x557fa168a75d - test::run_test::run_test_inner::hc4b9c44ce6ecf9ca
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:474
  49:     0x557fa16888de - test::run_test::h0db11a6e3cda4ec7
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:504
  50:     0x557fa16759f0 - test::run_tests::hc96070f51a12d729
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:283
  51:     0x557fa16759f0 - test::console::run_tests_console::h11e87d1a2904198b
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/console.rs:280
  52:     0x557fa1684487 - test::test_main::h377dafce3c4130c7
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:121
  53:     0x557fa1685829 - test::test_main_static::hf9870a0e0ff80b56
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/test/src/lib.rs:140
  54:     0x557fa06f6096 - vector::main::h0692b03911947697
  55:     0x557fa02ed77b - core::ops::function::FnOnce::call_once::ha4f0baac09e9c368
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/ops/function.rs:227
  56:     0x557fa02d7d2e - std::sys_common::backtrace::__rust_begin_short_backtrace::h3f0dd362ff2ec3e1
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:137
  57:     0x557f9fd98c81 - std::rt::lang_start::{{closure}}::h5ed6416a2be9cc5d
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/rt.rs:66
  58:     0x557fa2911187 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::he179d32a5d10d957
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/ops/function.rs:259
  59:     0x557fa2911187 - std::panicking::try::do_call::hcb3d5e7be089b2b4
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:381
  60:     0x557fa2911187 - std::panicking::try::h7ac93b0cd56fb701
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:345
  61:     0x557fa2911187 - std::panic::catch_unwind::h7b40e396c93a4fcd
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panic.rs:382
  62:     0x557fa2911187 - std::rt::lang_start_internal::h142b9cc66267fea1
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/rt.rs:51
  63:     0x557f9fd98c57 - std::rt::lang_start::hcaf5cc05b5811d60
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/rt.rs:65
  64:     0x557fa06f60ca - main
  65:     0x7f1a321f60b3 - __libc_start_main
  66:     0x557f9fd7f6be - _start
  67:                0x0 - <unknown>
FAILED
@jszwedko jszwedko added the type: bug A code related bug. label Dec 18, 2020
@binarylogic binarylogic added domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) sink: prometheus labels Dec 18, 2020
@binarylogic binarylogic added this to the 2020-12-21 Kryptek Yeti milestone Dec 18, 2020
@bruceg
Copy link
Member

bruceg commented Dec 18, 2020

This looks like a race condition between two tests, where both create the same influx db, and then one deletes it before the other can finish using it for its tests. However, they are supposed to all create their own database, serialized by an atomic, so this should be impossible.

@bruceg
Copy link
Member

bruceg commented Dec 18, 2020

It is a race condition, but not between multiple vector threads as I first thought, but rather creating the test database, populating it, and querying the results. I have reproduced it a few times on my system, but only by running it repeatedly in a loop. I'm trying to figure out what exactly is the ordering problem.

@binarylogic binarylogic added sink: prometheus_exporter Anything `prometheus_exporter` sink related and removed sink: prometheus labels Dec 24, 2020
@bruceg
Copy link
Member

bruceg commented Jan 7, 2021

After running these tests for a long time again, using unique database names instead of fixed ones, I have hit a failure. It appears the query to create the database is returning before the database is actually created, and then the first (and only) insert of metrics to the database fails with 404 Not Found. A simple delay after creating the database should resolve this, but I would like to see why the initialization function returns before initialization is actually done.

Rust debug logs (a couple of newlines added for readability):

test sinks::prometheus::remote_write::integration_tests::insert_metrics_over_https ... 
Jan 06 18:56:26.241 DEBUG http: vector::internal_events::http_client: Sending HTTP request. uri=https://localhost:8087/api/v1/prom/write?db=testdb1609980986183480525 method=POST version=HTTP/1.1 headers={"x-prometheus-remote-write-version": "0.1.0", "content-encoding": "snappy", "content-type": "application/x-protobuf", "user-agent": "Vector/0.12.0 (g43325a3 x86_64-unknown-linux-gnu 2021-01-06)", "accept-encoding": "identity"} body=[184 bytes]
Jan 06 18:56:26.354 DEBUG http: vector::internal_events::http_client: HTTP response. status=404 Not Found version=HTTP/1.1 headers={"content-type": "application/json", "request-id": "2b513593-5083-11eb-92eb-6e1e44399893", "x-influxdb-build": "OSS", "x-influxdb-error": "database not found: \"testdb1609980986183480525\"", "x-influxdb-version": "1.8.3", "x-request-id": "2b513593-5083-11eb-92eb-6e1e44399893", "date": "Thu, 07 Jan 2021 00:56:26 GMT", "content-length": "62"} body=[62 bytes]
Jan 06 18:56:26.354 ERROR request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 404 Not Found"
Jan 06 18:56:26.354 ERROR request{request_id=0}: vector::sinks::util::sink: Response wasn't successful. response=Response { status: 404, version: HTTP/1.1, headers: {"content-type": "application/json", "request-id": "2b513593-5083-11eb-92eb-6e1e44399893", "x-influxdb-build": "OSS", "x-influxdb-error": "database not found: \"testdb1609980986183480525\"", "x-influxdb-version": "1.8.3", "x-request-id": "2b513593-5083-11eb-92eb-6e1e44399893", "date": "Thu, 07 Jan 2021 00:56:26 GMT", "content-length": "62"}, body: b"{\"error\":\"database not found: \\\"testdb1609980986183480525\\\"\"}\n" }
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sinks/prometheus/remote_write.rs:243:38

InfluxDB logs:

ts=2021-01-07T00:56:26.205136Z lvl=info msg="Executing query" log_id=0RYa4ES0000 service=query query="DROP DATABASE testdb1609980986183480525"
ts=2021-01-07T00:56:26.231968Z lvl=info msg="Executing query" log_id=0RYa4ES0000 service=query query="CREATE DATABASE testdb1609980986183480525"
[httpd] 127.0.0.1 - - [07/Jan/2021:00:56:26 +0000] "POST /api/v1/prom/write?db=testdb1609980986183480525 HTTP/1.1" 404 62 "-" "Vector/0.12.0 (g43325a3 x86_64-unknown-linux-gnu 2021-01-06)" 2b513593-5083-11eb-92eb-6e1e44399893 91773
[httpd] 127.0.0.1 - - [07/Jan/2021:00:56:26 +0000] "GET /query?q=drop+database+testdb1609980985736051674 HTTP/1.1" 200 61 "-" "-" 2b45ee4b-5083-11eb-92e8-6e1e44399893 165701
[httpd] 127.0.0.1 - - [07/Jan/2021:00:56:26 +0000] "GET /query?q=drop+database+testdb1609980986183480525 HTTP/1.1" 200 193 "-" "-" 2b48a31c-5083-11eb-92e9-6e1e44399893 147995
[httpd] 127.0.0.1 - - [07/Jan/2021:00:56:26 +0000] "GET /query?q=create+database+testdb1609980986183480525 HTTP/1.1" 200 195 "-" "-" 2b4cbb77-5083-11eb-92ea-6e1e44399893 134383
ts=2021-01-07T00:56:26.379093Z lvl=info msg="Executing query" log_id=0RYa4ES0000 service=query query="SELECT \"key\" FROM testdb1609980986183480525.autogen._series"
[httpd] 127.0.0.1 - - [07/Jan/2021:00:56:26 +0000] "GET /query?q=show+series+on+testdb1609980986183480525 HTTP/1.1" 200 33 "-" "-" 2b632c8e-5083-11eb-92ec-6e1e44399893 283

@jszwedko
Copy link
Member Author

jszwedko commented Jan 7, 2021

Interesting, that is very odd and might warrant an issue upstream on influxdb to see if that is expected behavior.

@jszwedko
Copy link
Member Author

jszwedko commented Jan 7, 2021

I don't know if we have a pattern for this in the test suite already, but one way to deal with those eventually consistent issues in integration tests is to retry the requests with a back-off and a cap rather than just a fixed delay.

@bruceg
Copy link
Member

bruceg commented Jan 7, 2021

  1. I'm doubtful this will get much attention from upstream given that I am using the old influxdb v1 server for this test (v2 dropped direct support for the prometheus remote_write protocol).
  2. Where would you suggest I add a retry? The database creation routines are reporting success, and the only error happens in the code paths we want to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) sink: prometheus_exporter Anything `prometheus_exporter` sink related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants