Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untyped result sets may cause segfaults when parsing disengaged optionals #6915

Closed
psarna opened this issue Jul 24, 2020 · 16 comments
Closed
Assignees
Milestone

Comments

@psarna
Copy link
Contributor

psarna commented Jul 24, 2020

In untyped_result_set.hh, function get_view silently assumes that the optional value kept in its internal map is always engaged. It's not always the case, since the optional values come from the tables (they are often a result of the SELECT statement), and thus they shouldn't assume that the values are always present.
For instance, system_keyspace::load_view_build_progress can fail when a particular column - "first_token" - is not present in the system.view_builds_in_progress table - which is entirely possible, since we cannot assume any particular contents inside a CQL table.
Current code creates an incorrect bytes_view object when it's given a disengaged optional, and it may lead to unexpected segfaults later. It's better to check if the optional value is engaged, and throw if it's not.

@psarna psarna self-assigned this Jul 24, 2020
@slivne slivne modified the milestones: 4.2, 4.3 Jul 26, 2020
psarna added a commit to psarna/scylla that referenced this issue Jul 27, 2020
In untyped_result_set::get_view, there exists a silent assumption
that the underlying data, which is an optional, to always be engaged.
In case the value happens to be disengaged it may lead to creating
an incorrect bytes view from a disengaged optional.
In order to make the code safer (since values parsed by this code
often come from the network and can contain virtually anything)
a segfault is replaced with an exception, by calling optional's
value() function, which throws when called on disengaged optionals.

Fixes scylladb#6915
Message-Id: <090d3f323f693eaa833cad551a044a934b06678d.1595602089.git.sarna@scylladb.com>
@avikivity
Copy link
Member

@psarna should this be backported? I don't think we have a trigger.

@psarna
Copy link
Contributor Author

psarna commented Aug 3, 2020

We don't have a trigger, but corrupt system tables would suffice. I think I'll backport, since it's a bug (segfault instead of graciously failing with a recoverable exception) and the backporting cost is cheap.

@psarna
Copy link
Contributor Author

psarna commented Aug 4, 2020

Neither 4.1 nor 4.2 backports cleanly, because CDC relied on dereferencing disengaged optionals somewhere, and it shows in cdc_static_and_clustered_rows_test... it's definitely fixed already in master, so I guess that this fix, whatever it is, should be backported first? Or is CDC considered experimental in 4.1/4.2, which means no backports? @haaawk any idea which patch fixed the CDC issue?

@haaawk
Copy link
Contributor

haaawk commented Aug 4, 2020

Do you know which optional was the problem in CDC, @psarna ?

@psarna
Copy link
Contributor Author

psarna commented Aug 4, 2020

Not precisely, this is the output of the failed test:

 -- generates 3 rows: preimage(static), delta(static), delta(clustering)
 update ks.t set vs = 2, vc = 2 where pk = 0 and ck = 0;
 {
-	"status" : "ok"
+	"message" : "std::bad_optional_access (bad optional access)",
+	"status" : "error"
 }
 -- generates 4 rows: preimage(static), preimage(clustering), delta(static), delta(clustering)
 update ks.t set vs = 3, vc = 3 where pk = 0 and ck = 0;
 {
-	"status" : "ok"
+	"message" : "std::bad_optional_access (bad optional access)",
+	"status" : "error"
 }
 select "cdc$batch_seq_no", "cdc$operation", ck, vs, vc from ks.t_scylla_cdc_log where pk = 0 and ck = 0 allow filtering;
 {
-	"rows" : 
-	[
-		{
-			"cdc$batch_seq_no" : "2",
-			"cdc$operation" : "1",
-			"ck" : "0",
-			"pk" : "0",
-			"vc" : "2"
-		},
-		{
-			"cdc$batch_seq_no" : "2",
-			"cdc$operation" : "0",
-			"ck" : "0",
-			"pk" : "0",
-			"vc" : "2"
-		},
-		{
-			"cdc$batch_seq_no" : "3",
-			"cdc$operation" : "1",
-			"ck" : "0",
-			"pk" : "0",
-			"vc" : "3"
-		}
-	]
+	"rows" : null
 }

@haaawk
Copy link
Contributor

haaawk commented Aug 4, 2020

Does the test fail every time or are there some special conditions?

@psarna
Copy link
Contributor Author

psarna commented Aug 4, 2020

every single time on 4.1/4.2, never on master

@haaawk
Copy link
Contributor

haaawk commented Aug 4, 2020

Could you try removing "vs = 2" and "vs = 3" from updates and see if this fixes he exception?

@psarna
Copy link
Contributor Author

psarna commented Aug 4, 2020

Well, it definitely fixes std::bad_optional_access (bad optional access), because these were the only two statements that produced the error, but the test still fails after removing these updates (as expected).

@haaawk
Copy link
Contributor

haaawk commented Aug 4, 2020

Cool. That narrows down the search to the code related to static row. If we want to find the patch that fixed the problem.

@eliransin
Copy link
Contributor

@psarna / @haaawk what should be done to find this patch? (besides brute force bisection of course 🙂 )
Can one of you help with that?

@haaawk
Copy link
Contributor

haaawk commented Aug 17, 2020

Do we have to do anything? It seems that the patch is fine on master is backporting necessary?

@eliransin
Copy link
Contributor

@haaawk, according to this #6915 (comment) I was under the impression
that we don't even know what patch is to be backported (cdc patch that fixes relying on disengaged optionals mentioned by @psarna - #6915 (comment))
If I'm wrong and we do know what this patch is, can you please reference it here so I can see if it is a viable candidate for
backport?

@haaawk
Copy link
Contributor

haaawk commented Aug 23, 2020

You're right, @eliransin - we don't know which CDC patch affects the behavior. I was referring to the fact that it's not clear we want to backport this fix at all. @avikivity wrote:

@psarna should this be backported? I don't think we have a trigger.

and then @psarna wrote:

We don't have a trigger [...] and the backporting cost is cheap

This left me with the impression there's no need for backporting and spending time on looking for this CDC patch wouldn't be the best use of our time when we can do something more useful instead.

@eliransin
Copy link
Contributor

Thanks @haaawk for clarification. I will update if we have a trigger.

@avikivity
Copy link
Member

Looks like there was no trigger, so removing the backport candidate label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants