New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Materialized Views #1141

Open
tzach opened this Issue Mar 29, 2016 · 16 comments

Comments

@belliottsmith

This comment has been minimized.

Show comment
Hide comment
@belliottsmith

belliottsmith Apr 12, 2016

I recommend being very cautious about Materialized Views - their failure cases are problematic, and poorly understood. Their consistency semantics are similarly challenging, and even assuming all of these things are fine they are quite constrained in capability in their current design (and that is an artefact of the design, not a short term constraint).

belliottsmith commented Apr 12, 2016

I recommend being very cautious about Materialized Views - their failure cases are problematic, and poorly understood. Their consistency semantics are similarly challenging, and even assuming all of these things are fine they are quite constrained in capability in their current design (and that is an artefact of the design, not a short term constraint).

@nyh

This comment has been minimized.

Show comment
Hide comment
@nyh

nyh Apr 13, 2016

Contributor

Hi @belliottsmith thanks for the comment. Given that Cassandra's classic "secondary index" also has its share of problems (notablely, very poor scalability) and that the newer "SASI" did not fix them, which of these three indexing methods do you think is most valuable to implement in ScyllaDB? Or do you have some fourth technique in mind?

Contributor

nyh commented Apr 13, 2016

Hi @belliottsmith thanks for the comment. Given that Cassandra's classic "secondary index" also has its share of problems (notablely, very poor scalability) and that the newer "SASI" did not fix them, which of these three indexing methods do you think is most valuable to implement in ScyllaDB? Or do you have some fourth technique in mind?

@belliottsmith

This comment has been minimized.

Show comment
Hide comment
@belliottsmith

belliottsmith Apr 14, 2016

As far as functionality is concerned, the concept of Materialized Views is clearly more powerful than Cassandra’s Secondary Indexes, although they do have some limitations, as low cardinality data can result in hot spots. If you want to implement the most useful feature, Materialized Views is clearly going to be it - but if you want to implement the most correct feature, it might not be. Secondary Indexes have the advantage of being trivial to reason about, and make corresponding promises on.

But, the concept can be implemented in a multitude of ways. I don’t have near enough time to fully martial all of my thoughts on the topic, and even I did it would not be enough thought anyway. I do not know of an adequate solution to the problem as it stands, and while I’m confident one exists, I’m not convinced a simple one does, at least not with Cassandra’s multi-master eventually consistent architecture. Certainly it will require a great deal of thought.

As far as I have encountered, there are three basic approaches to this problem: solutions involving local CAS with broadcast of changes to MV replicas, global CAS, or complex tombstones. And none actually describes an algorithm; there are a multitude of instances of each.

Cassandra has done the first of these, but it pairs each base replica with one of the MV replicas for each MV. Thus if that base node is lost, and was the only base replica to receive a given update before dying, the MV update can propagate to all of its replicas without any existing base replica knowing the value exists, so it will never be removed. This inconsistency is difficult to detect without a special MV repair mechanism (which does not currently exist, but is trivial to write and costly to run). See my comment on C-6477 for some more similar problems. To summarise these and some others:

  • Ghost records
  • Availability:
    • A single node loss results in impossibility of QUORUM in the MV for a proportion of all updates (although QUORUM reads may be correctly serviced, until the failing node is restored at which point they may also fail; a CL=ONE query may always return the wrong result, rather than transiently)
    • Two nodes lost means MVs will not reflect any of a proportion of base table updates; if operating in unsafe/performant mode, the changes may never be reflected
  • Consistency:
    • There is no way for a client to know when (or if) an MV is updated; Consistency Levels ignore MVs
    • What do CLs even mean when MVs are involved; CL=1 on base table simply means you only know it reached one but eventual consistency will be met; once the MV is involved CL=1 may result in eventual _in_consistency
  • Multiple columns for MV keys are impossible, since we cannot know what history occurred in all base replicas, so we do not know what old value(s) to delete from the MV

Further, ensuring (some) safety in the current design collapses throughput since it relies heavily on the batch-log, so it is disabled by default. This means a larger window exists for inconsistencies (ghost records become more likely)

I suspect a correct solution is in the direction of: routing all updates to a base replica (so that coordinators are just stupid proxies*). That base replica would write to its local commit (or batch) log then tell all of its peers what it is going to do, each writing to its commit log and then responding and proceeding; as soon as any replica knows another replica is making the change, it is happy to proceed. ALL peers that receive the update would send updates to at least 2 MV replicas, possibly all of them, not clearing its local log entry until this has been done. To support lower consistency levels than QUORUM a random node could be selected to substitute for the base replicas we failed to reach, whom would hold batch-log entries for those peers to replay on startup.

This is, as I’m sure you can tell, by no means fully thought out; it’s a research project that requires some simulation IMO before confidence can be built in the safety of any approach. I’ve yet to be paid to think about this full time, and it really requires a great deal of that, since it’s never been solved to my knowledge.

This is what you get for choosing Cassandra to supplant :)

* which is fine if client’s are all token aware as they are mostly today, although in giant deployments this can be disabled to shrink the number of TCP connections the cluster is managing

belliottsmith commented Apr 14, 2016

As far as functionality is concerned, the concept of Materialized Views is clearly more powerful than Cassandra’s Secondary Indexes, although they do have some limitations, as low cardinality data can result in hot spots. If you want to implement the most useful feature, Materialized Views is clearly going to be it - but if you want to implement the most correct feature, it might not be. Secondary Indexes have the advantage of being trivial to reason about, and make corresponding promises on.

But, the concept can be implemented in a multitude of ways. I don’t have near enough time to fully martial all of my thoughts on the topic, and even I did it would not be enough thought anyway. I do not know of an adequate solution to the problem as it stands, and while I’m confident one exists, I’m not convinced a simple one does, at least not with Cassandra’s multi-master eventually consistent architecture. Certainly it will require a great deal of thought.

As far as I have encountered, there are three basic approaches to this problem: solutions involving local CAS with broadcast of changes to MV replicas, global CAS, or complex tombstones. And none actually describes an algorithm; there are a multitude of instances of each.

Cassandra has done the first of these, but it pairs each base replica with one of the MV replicas for each MV. Thus if that base node is lost, and was the only base replica to receive a given update before dying, the MV update can propagate to all of its replicas without any existing base replica knowing the value exists, so it will never be removed. This inconsistency is difficult to detect without a special MV repair mechanism (which does not currently exist, but is trivial to write and costly to run). See my comment on C-6477 for some more similar problems. To summarise these and some others:

  • Ghost records
  • Availability:
    • A single node loss results in impossibility of QUORUM in the MV for a proportion of all updates (although QUORUM reads may be correctly serviced, until the failing node is restored at which point they may also fail; a CL=ONE query may always return the wrong result, rather than transiently)
    • Two nodes lost means MVs will not reflect any of a proportion of base table updates; if operating in unsafe/performant mode, the changes may never be reflected
  • Consistency:
    • There is no way for a client to know when (or if) an MV is updated; Consistency Levels ignore MVs
    • What do CLs even mean when MVs are involved; CL=1 on base table simply means you only know it reached one but eventual consistency will be met; once the MV is involved CL=1 may result in eventual _in_consistency
  • Multiple columns for MV keys are impossible, since we cannot know what history occurred in all base replicas, so we do not know what old value(s) to delete from the MV

Further, ensuring (some) safety in the current design collapses throughput since it relies heavily on the batch-log, so it is disabled by default. This means a larger window exists for inconsistencies (ghost records become more likely)

I suspect a correct solution is in the direction of: routing all updates to a base replica (so that coordinators are just stupid proxies*). That base replica would write to its local commit (or batch) log then tell all of its peers what it is going to do, each writing to its commit log and then responding and proceeding; as soon as any replica knows another replica is making the change, it is happy to proceed. ALL peers that receive the update would send updates to at least 2 MV replicas, possibly all of them, not clearing its local log entry until this has been done. To support lower consistency levels than QUORUM a random node could be selected to substitute for the base replicas we failed to reach, whom would hold batch-log entries for those peers to replay on startup.

This is, as I’m sure you can tell, by no means fully thought out; it’s a research project that requires some simulation IMO before confidence can be built in the safety of any approach. I’ve yet to be paid to think about this full time, and it really requires a great deal of that, since it’s never been solved to my knowledge.

This is what you get for choosing Cassandra to supplant :)

* which is fine if client’s are all token aware as they are mostly today, although in giant deployments this can be disabled to shrink the number of TCP connections the cluster is managing

@nyh nyh referenced this issue Apr 18, 2016

Open

Secondary index #401

@tzach tzach added this to the Yellowfin milestone Apr 18, 2016

@tzach tzach modified the milestones: 1.4, Yellowfin Jun 21, 2016

@tzach tzach modified the milestones: 2.0, 1.4 Aug 31, 2016

@duarten

This comment has been minimized.

Show comment
Hide comment
@duarten

duarten Oct 16, 2016

Member

I've started to collect my thoughts on this here.

I feel it's barely scratching the surface. As with any distributed systems problem, you can just keep digging and digging - and keep finding ever more subtle problems.

Member

duarten commented Oct 16, 2016

I've started to collect my thoughts on this here.

I feel it's barely scratching the surface. As with any distributed systems problem, you can just keep digging and digging - and keep finding ever more subtle problems.

@haoyixin

This comment has been minimized.

Show comment
Hide comment
@haoyixin

haoyixin commented Oct 24, 2016

+1

@dorlaor

This comment has been minimized.

Show comment
Hide comment
@dorlaor

dorlaor Oct 24, 2016

Contributor

We're actively working on it. You can google the materialized view early patches.
The plan is to be ready around Dec/Jan

Contributor

dorlaor commented Oct 24, 2016

We're actively working on it. You can google the materialized view early patches.
The plan is to be ready around Dec/Jan

@dahankzter

This comment has been minimized.

Show comment
Hide comment
@dahankzter

dahankzter Feb 16, 2017

How is this coming along? Would be cool to test something.

dahankzter commented Feb 16, 2017

How is this coming along? Would be cool to test something.

@tzach

This comment has been minimized.

Show comment
Hide comment
@tzach

tzach Feb 16, 2017

Contributor

@dahankzter we plan to have an initial, experimental version in Scylla 1.8

Contributor

tzach commented Feb 16, 2017

@dahankzter we plan to have an initial, experimental version in Scylla 1.8

@dahankzter

This comment has been minimized.

Show comment
Hide comment
@dahankzter

dahankzter Feb 16, 2017

dahankzter commented Feb 16, 2017

@dahankzter

This comment has been minimized.

Show comment
Hide comment
@dahankzter

dahankzter Mar 10, 2017

Around when is 1.8 slated for release? I would very much like to try this stuff.

dahankzter commented Mar 10, 2017

Around when is 1.8 slated for release? I would very much like to try this stuff.

@dorlaor

This comment has been minimized.

Show comment
Hide comment
@dorlaor

dorlaor Mar 10, 2017

Contributor

@duarten can fill you in with the exact state. Today we can do MV table creation and insert.
IIUC we have a half-way implementation for deletes and updates. It should make its way to 1.8 as
experimental and 1.9 as a fully functional feature

Contributor

dorlaor commented Mar 10, 2017

@duarten can fill you in with the exact state. Today we can do MV table creation and insert.
IIUC we have a half-way implementation for deletes and updates. It should make its way to 1.8 as
experimental and 1.9 as a fully functional feature

@duarten

This comment has been minimized.

Show comment
Hide comment
@duarten

duarten Mar 10, 2017

Member

Currently, views can be created (and updated and deleted), however, only new data inserted into the base table after view creation will be propagated to the views. We're working on propagating updates and deletes, and then we'll work on propagating the entries that were already in the base table.

Member

duarten commented Mar 10, 2017

Currently, views can be created (and updated and deleted), however, only new data inserted into the base table after view creation will be propagated to the views. We're working on propagating updates and deletes, and then we'll work on propagating the entries that were already in the base table.

@dorlaor

This comment has been minimized.

Show comment
Hide comment
@dorlaor

dorlaor Mar 10, 2017

Contributor

@duarten are the update/delete in 1.7 in experimental mode?
It may provide value to users already

Contributor

dorlaor commented Mar 10, 2017

@duarten are the update/delete in 1.7 in experimental mode?
It may provide value to users already

@duarten

This comment has been minimized.

Show comment
Hide comment
@duarten

duarten Mar 10, 2017

Member

@dorlaor not yet, still working on it.

Member

duarten commented Mar 10, 2017

@dorlaor not yet, still working on it.

@raphaelsc

This comment has been minimized.

Show comment
Hide comment
@raphaelsc

raphaelsc Dec 23, 2017

Contributor

@duarten can be closed?

Contributor

raphaelsc commented Dec 23, 2017

@duarten can be closed?

@duarten

This comment has been minimized.

Show comment
Hide comment
@duarten
Member

duarten commented Dec 23, 2017

@raphaelsc Nop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment