Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddressSanitizer check failed in untyped_result_set_row::get_blob during dtest auth_test.TestAuth.kill_one_of_the_nodes_with_the_auth_info_test #4226

Closed
bhalevy opened this issue Feb 14, 2019 · 18 comments

Comments

@bhalevy
Copy link
Contributor

commented Feb 14, 2019

(Similar to, yet different than #4168)

Scylla version: e70286a
Test log: #50 auth_test.TestAuth.kill_one_of_the_nodes_with_the_auth_info_test/node1.log

    #12 0x4c4f6d5 in cql3::untyped_result_set_row::get_blob(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const cql3/untyped_result_set.hh:67
    #13 0x6e93718 in bool cql3::untyped_result_set_row::get_as<bool>(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const cql3/untyped_result_set.hh:71
    #14 0xbc17f2b in operator() auth/standard_role_manager.cc:104

auth/standard_role_manager.cc:

 84 static future<std::optional<record>> find_record(cql3::query_processor& qp, std::string_view role_name) {
 85     static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
 86             meta::roles_table::qualified_name(),
 87             meta::roles_table::role_col_name);
 88 
 89     return qp.process(
 90             query,
 91             consistency_for_role(role_name),
 92             internal_distributed_timeout_config(),
 93             {sstring(role_name)},
 94             true).then([](::shared_ptr<cql3::untyped_result_set> results) {
 95         if (results->empty()) {
 96             return std::optional<record>();
 97         }
 98 
 99         const cql3::untyped_result_set_row& row = results->one();
100 
101         return std::make_optional(
102                 record{
103                         row.get_as<sstring>(sstring(meta::roles_table::role_col_name)),
104                         row.get_as<bool>("is_superuser"),

                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

105                         row.get_as<bool>("can_login"),
106                         (row.has("member_of")
107                                  ? row.get_set<sstring>("member_of")
108                                  : role_set())});
109     });
110 }

@bhalevy bhalevy added the dtest label Feb 14, 2019

@slivne slivne added this to the 3.1 milestone Feb 14, 2019

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

This is a very suspicious error. I wonder this these kinds of failures (like with #4168) are due to change in the code elsewhere, since I never encountered these before to the best of my memory.

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

I'll investigate.

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

In "dev" mode, the dtest failed but there was no helpful information in the logs for the node.

I was able to reproduce the leak report from LeakSanitizer in debug mode.

By inserting

assert(row.has("can_login"));

I was able to trigger an abort. I'm assuming that get_as will access an invalid memory location in the absence of the requested column.

The next step is to figure out why the column is missing.

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

@slivne, I've gone back to 6b3a97e and I can't get this dtest to pass. Here, you made a commit which the message says fixes the test. Was the test passing for you at this time?

I'm trying to find a commit where the test passes so I can bisect. Perhaps it never passed?

@slivne

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

@slivne

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

A theory that might explain the test failure is that the replication factor for the system_auth keyspace is not set to be equal to the size of the cluster (4, in this case) at the start of the test.

Therefore, when a node is killed, it's possible that it takes down the replica with the needed rows.

I'll try to change the test to correctly set the replication factor and see if it resolves the issue.

@bhalevy

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

This run hit the same problem on a different test case: dtest-debug/76/artifact/logs-debug.2/1552439066134_auth_test.TestAuth.transitional_auth_from_pwdauth_test/node1.log:

==7888==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_descriptions.cc:80 "((0 && "Address is not in memory and not in shadow?")) != (0)" (0x0, 0x0)
...
    #0 0x7f987e2c82b6  (/lib64/libasan.so.5+0xfb2b6)
    #1 0x7f987e2e567d in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (/lib64/libasan.so.5+0x11867d)
    #2 0x7f987e1fb654  (/lib64/libasan.so.5+0x2e654)
    #3 0x7f987e1fc756  (/lib64/libasan.so.5+0x2f756)
    #4 0x7f987e1fe4b3  (/lib64/libasan.so.5+0x314b3)
    #5 0x7f987e2c7ade  (/lib64/libasan.so.5+0xfaade)
    #6 0x7f987e205e5c in __interceptor_memmove (/lib64/libasan.so.5+0x38e5c)
    #7 0xe76815 in signed char* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<signed char>(signed char const*, signed char const*, signed char*) /usr/include/c++/8/bits/stl_algobase.h:368
    #8 0xe77a56 in signed char* std::__copy_move_a<false, signed char*, signed char*>(signed char*, signed char*, signed char*) /usr/include/c++/8/bits/stl_algobase.h:386
    #9 0xdf8cec in signed char* std::__copy_move_a2<false, signed char*, signed char*>(signed char*, signed char*, signed char*) /usr/include/c++/8/bits/stl_algobase.h:422
    #10 0xd5c2a5 in signed char* std::copy<signed char*, signed char*>(signed char*, signed char*, signed char*) /usr/include/c++/8/bits/stl_algobase.h:455
    #11 0xcf0794 in seastar::basic_sstring<signed char, unsigned int, 31u, false>::basic_sstring(seastar::basic_sstring<signed char, unsigned int, 31u, false> const&) /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/sstring.hh:183
    #12 0x3cf95bb in cql3::untyped_result_set_row::get_blob(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const cql3/untyped_result_set.hh:67
    #13 0x59a47fc in bool cql3::untyped_result_set_row::get_as<bool>(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const cql3/untyped_result_set.hh:71
    #14 0x90c4467 in operator() auth/standard_role_manager.cc:104
    #15 0x910b859 in apply /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/apply.hh:35
    #16 0x910ba02 in apply<auth::find_record(cql3::query_processor&, std::string_view)::<lambda(seastar::shared_ptr<cql3::untyped_result_set>)>&, seastar::shared_ptr<cql3::untyped_result_set>&&> /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/apply.hh:43
    #17 0x910bbb7 in apply<auth::find_record(cql3::query_processor&, std::string_view)::<lambda(seastar::shared_ptr<cql3::untyped_result_set>)>&, seastar::shared_ptr<cql3::untyped_result_set>&&> /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/future.hh:1398
    #18 0x90faf78 in operator() /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/future.hh:989
    #19 0x9121439 in call /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/util/noncopyable_function.hh:71
    #20 0x91657d2 in seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>::operator()(seastar::shared_ptr<cql3::untyped_result_set>&&) const /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/util/noncopyable_function.hh:145
    #21 0x9159519 in seastar::apply_helper<seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>, std::tuple<seastar::shared_ptr<cql3::untyped_result_set> >&&, std::integer_sequence<unsigned long, 0ul> >::apply(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&, std::tuple<seastar::shared_ptr<cql3::untyped_result_set> >&&) /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/apply.hh:35
    #22 0x9159658 in auto seastar::apply<seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>, seastar::shared_ptr<cql3::untyped_result_set> >(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&, std::tuple<seastar::shared_ptr<cql3::untyped_result_set> >&&) /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/apply.hh:43
    #23 0x915983b in seastar::future<std::optional<auth::record> > seastar::futurize<seastar::future<std::optional<auth::record> > >::apply<seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>, seastar::shared_ptr<cql3::untyped_result_set> >(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&, std::tuple<seastar::shared_ptr<cql3::untyped_result_set> >&&) /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/future.hh:1478
    #24 0x918789e in auto seastar::future<seastar::shared_ptr<cql3::untyped_result_set> >::then_impl<seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>, seastar::future<std::optional<auth::record> > >(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&)::{lambda(auto:1&&)#1}::operator()<seastar::future_state<seastar::shared_ptr<cql3::untyped_result_set> > >(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&) /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/future.hh:1015
    #25 0x91881ec in seastar::continuation<seastar::future<seastar::shared_ptr<cql3::untyped_result_set> >::then_impl<seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>, seastar::future<std::optional<auth::record> > >(seastar::noncopyable_function<seastar::future<std::optional<auth::record> > (seastar::shared_ptr<cql3::untyped_result_set>&&)>&&)::{lambda(auto:1&&)#1}, seastar::shared_ptr<cql3::untyped_result_set> >::run_and_dispose() /jenkins/workspace/scylla-master/dtest-debug/scylla/seastar/include/seastar/core/future.hh:425
@slivne

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

dtest waiting for review

@slivne

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

@hakuch - the fix in dtest - is strange - we need to make sure a user that boots a cluster will not hit this - I think we need to find the source cause in scylla not change the test

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@slivne: I'm reasonably sure the error was in dtest, and not Scylla.

Internally, Scylla's auth code is structured differently than Cassandra. They have a single asynchronous process creating the default superuser, but we have two asynchronous processes to create it: one for the role-manage, and one for the authenticator.

The test was previously not waiting for both of these processes to complete; only a single one. I believe that's why the errors happened.

@hakuch

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@slivne, perhaps we can add a check in the role-manager and authenticator which throws an authentication error if the default superuser is not ready yet. That would prevent an invalid memory access.

Combined with the changes to dtest, that would probably cover all the cases we care about.

@slivne

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

@eliransin - please comment ?

@hakuch
In my view we should not crash in any case

  • its fine not to allow login if it happens to early in the boot process - if this can help
@eliransin

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

@slivne I agree, a crash is never a good option.
@hakuch lets make the change

@eliransin eliransin assigned eliransin and unassigned hakuch Jun 16, 2019

@slivne slivne added the bug label Jun 27, 2019

@slivne

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

@eliransin - lets prioritize this over other additional features

@eliransin

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2019

on it.

@eliransin

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2019

I couldn't reproduce locally even not with a branch of dtest which is deated to when the original issue was opened.
I couldn't find anywhere in the code a reason why this specific column value will be missing. The column value is always assigned - in the code at least. They are all written together so there is no reason that some of them will have values while others don't. But the fact is that it happened.
I will add checks for the columns.

eliransin added a commit to eliransin/scylla that referenced this issue Jul 9, 2019

auth: Add resiliancy against potentialy corrupted table
There has been a case where reading from the internal
authentication roles table resulted in a memorry access
violation. The problem was that some columns that was assumed
to have value for each row of the table didn't. According to
the code this situation shouldn't have happened, but it did.
This change adds a check for existance of those columns (can_login
and superuser columns). In the case they don't exist an error
is issued but the defult value for those columns is used.
The reason is that the error will not self correct by crashing and
in a lot of cases those fields are not relevant. One case that
can still happen is that if the corrupte row is indeed the superuser
that can loggin, the administrator will be left out of the system.
A manual correction of the table will be needed which is the case
even before our patch (the system will crash instead of just no logging
the user in), however, after this patch the log will indicate the
problem so we will know it needs fixing by changing the table manually
and from outside scylla.

Fixes scylladb#4226

avikivity added a commit that referenced this issue Jul 11, 2019

auth: Add resiliancy against potentialy corrupted table
There has been a case where reading from the internal
authentication roles table resulted in a memorry access
violation. The problem was that some columns that was assumed
to have value for each row of the table didn't. According to
the code this situation shouldn't have happened, but it did.
This change adds a check for existance of those columns (can_login
and superuser columns). In the case they don't exist an error
is issued but the defult value for those columns is used.
The reason is that the error will not self correct by crashing and
in a lot of cases those fields are not relevant. One case that
can still happen is that if the corrupte row is indeed the superuser
that can loggin, the administrator will be left out of the system.
A manual correction of the table will be needed which is the case
even before our patch (the system will crash instead of just no logging
the user in), however, after this patch the log will indicate the
problem so we will know it needs fixing by changing the table manually
and from outside scylla.

Fixes #4226

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190709072443.16318-1-eliransin@scylladb.com>
@bhalevy

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

As discussed in our f2f meeting, this issue shouldn't be gating for 3.1
Let's retarget to 3.x (x > 1 :))

@slivne slivne modified the milestones: 3.1, 3.2 Jul 11, 2019

eliransin added a commit to eliransin/scylla that referenced this issue Jul 14, 2019

auth: Prevent race between role_manager and pasword_authenticator
When scylla is started for the first time with PasswordAuthenticator
enabled, there it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. This happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager is started completely, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished starting.

Tests:
  1. Unit tests (release)
  2. Auth and cqlsh dtests.

Fixes scylladb#4226

avikivity added a commit that referenced this issue Jul 15, 2019

auth: Prevent race between role_manager and pasword_authenticator
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.

Tests:
  1. Unit tests (release)
  2. Auth and cqlsh auth related dtests.

Fixes #4226

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
(cherry picked from commit 997a146)

avikivity added a commit that referenced this issue Jul 15, 2019

auth: Prevent race between role_manager and pasword_authenticator
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.

Tests:
  1. Unit tests (release)
  2. Auth and cqlsh auth related dtests.

Fixes #4226

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
(cherry picked from commit 997a146)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.