Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: flaky box/net.box_wait_connected_gh-3856 test on FreeBSD #5083

Closed
avtikhon opened this issue Jun 16, 2020 · 0 comments
Closed

test: flaky box/net.box_wait_connected_gh-3856 test on FreeBSD #5083

avtikhon opened this issue Jun 16, 2020 · 0 comments
Assignees
Labels
flaky test qa Issues related to tests or testing subsystem

Comments

@avtikhon
Copy link
Contributor

Tarantool version:
Tarantool 2.5.0-142-ged935572b
Target: FreeBSD-amd64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=OFF
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-common -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror
CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-common -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror

OS version:
FreeBSD 12

Bug description:
https://gitlab.com/tarantool/tarantool/-/jobs/596437958
https://gitlab.com/tarantool/tarantool/-/jobs/596349477
https://gitlab.com/tarantool/tarantool/-/jobs/594179381

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

Steps to reproduce:

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat
@avtikhon avtikhon added qa Issues related to tests or testing subsystem flaky test labels Jun 16, 2020
@avtikhon avtikhon self-assigned this Jun 16, 2020
avtikhon added a commit that referenced this issue Jun 16, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

To avoid of such issue and check indeed that "wait_connected = false"
is ignored the test should wait when connection state became 'initial'
and only after that the test can be checked.

Closes #5083
@avtikhon avtikhon added this to ON REVIEW in Quality Assurance Jun 16, 2020
avtikhon added a commit that referenced this issue Jun 16, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

The test uses external Google DNS IP, check information on it:
  https://developers.google.com/speed/public-dns/docs/using
This issue appears because the link is external and connection
may fail from time to time. In this case the test should wait
till connection state became 'initial' and only after that the
test can continue.

Closes #5083
@avtikhon avtikhon moved this from ON REVIEW to DOING in Quality Assurance Jun 19, 2020
avtikhon pushed a commit that referenced this issue Jun 20, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

The reason of the fail was that getaddrinfo() returned EIA_SERVICE for an
incorrect TCP/IP port on FreeBSD, but crops it as modulo of 65536 on
Linux/glibc. Checked with local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Reviewed-by: Alexander V. Tikhonov <avtikhon@tarantool.org>
avtikhon added a commit that referenced this issue Jun 22, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
avtikhon added a commit that referenced this issue Jun 22, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Vladislav Shpilevoy<v.shpilevoy@tarantool.org>
@avtikhon avtikhon moved this from DOING to ON REVIEW in Quality Assurance Jun 22, 2020
avtikhon added a commit that referenced this issue Jun 23, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
kyukhin pushed a commit that referenced this issue Jun 26, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
(cherry picked from commit d51be6f)
kyukhin pushed a commit that referenced this issue Jun 26, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
(cherry picked from commit d51be6f)
kyukhin pushed a commit that referenced this issue Jun 26, 2020
Found issue running test on FreeBSD VBox host:

 [011] --- box/net.box_wait_connected_gh-3856.result	Mon Jun 15 09:39:49 2020
 [011] +++ box/net.box_wait_connected_gh-3856.reject	Fri May  8 08:23:30 2020
 [011] @@ -12,7 +12,8 @@
 [011]  - opts:
 [011]      wait_connected: false
 [011]    host: 8.8.8.8
 [011] -  state: initial
 [011] +  state: error
 [011] +  error: Invalid argument
 [011]    port: '123456'
 [011]  ...
 [011]  c:close()

A. Turenko made deep investigation and found that the reason of the
fail was that getaddrinfo() returned EIA_SERVICE for an incorrect
TCP/IP port on FreeBSD, but crops it as modulo of 65536 on Linux/glibc.
Checked with his local script './getaddrinfo':

  (Linux/glibc) $ ./getaddrinfo 8.8.8.8 123456
  ----
  family: AF_INET
  socktype: SOCK_STREAM
  protocol: IPPROTO_TCP
  host: 8.8.8.8
  serv: 57920

  (FreeBSD) $ ./getaddrinfo 8.8.8.8 123456
  getaddrinfo: Service was not recognized for socket type

So obvious fix is to change 123456 to something less or equal to
65535. Say, 1234.

The test depended on an order in which fibers were scheduled
(net_box.connect() creates a separate fiber for connecting in background
using fiber.create(), which yields). Unlikely our fiber were not get
execution time during the connection attempt, so it was more like a
formal thing.

But we can decrease probability of this situation even more if we'll
grab all connection fields just when net_box.connect() returns, not
after yield in console (which is due to waiting a next command from
test-run).

Closes #5083

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
(cherry picked from commit d51be6f)
@avtikhon avtikhon moved this from ON REVIEW to DONE in Quality Assurance Jun 26, 2020
@avtikhon avtikhon removed this from DONE in Quality Assurance Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test qa Issues related to tests or testing subsystem
Projects
None yet
Development

No branches or pull requests

1 participant