Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Snapshotting #4641

Merged
merged 2 commits into from
Jun 8, 2020
Merged

Rewrite Snapshotting #4641

merged 2 commits into from
Jun 8, 2020

Conversation

Zelldon
Copy link
Member

@Zelldon Zelldon commented Jun 1, 2020

Hey Team,

sorry for the big PR but was saw no better way to do this. I think most of the new lines are actually unit tests.

@pihme I think it is just for you to learn how it works don't need to be to check every details.

@npepinpe or @deepthidevaki who ever has time would be nice if you could take a look, especially for raft and zeebe endpoints where we use it. Hope it makes sense to you.

Happy to get your feedback.

Description

Introduces new interfaces for Snapshotting and rewrites and merge existing implementation.

PersistedSnapshotStore can be used to take snapshots and receive
snapshot chunks. After all chunks are received or the snapshot is taken
the volatile snapshot can be persisted. It is always possible
to abort an snapshot, before persisting. The store can also be used
to get the latest snapshot which was persisted. The store allows
to register listeners, which are called on new persisted snapshots.

On receiving snapshot chunks and consuming them, checksums of the chunks
are verified. On persisting an received snapshot the complete snapshot
checksum is verified.

An persisted snapshot can be read from its persisted storage via a
reader and will be chunked in several snapshot chunks, which
then can be transmitted and consumed again by the store.

The StateSnapshotController use by the AsyncSnapshotDirector and is the
only one which takes a snapshot. The ReplicationController is
registered at the SnapshotStore as listener to replicate new persisted
snapshots. The replication will send all chunks over the network.

The ReplicationController will receive the chunks on the follower side and will use
the newReceviedSnapshot method of the store to consume the SnapshotChunks.

On the LeaderRole (in the AbstractAppender) the last snapshot can also
be replicated. This is done via the SnapshotChunkReader, which
reads the chunks from the persisted store.

The Follower (in PassiveRole) will consume the replicated chunks and
will store it similar to the ReplicationController.

On both ReplicationController and PassiveRole the checksum of the
snapshot chunks but also of the complete snapshot is verified before
the received snapshot is persisted.

Related work:

  • Most snapshot implementation details are now in Atomix
  • Zeebe related snapshoting like replication and taking of snapshot is
    part of Broker
  • Use SBE for snapshot chunk replication/consumption
  • StreamProcessorTests moved again to engine module - removed snapshot
    related tests, dont need to be part of that
  • clean up management schema file in broker
  • Wrote a bunch of new unit tests for snapshot implementation

Class Diagram

Generate Class diagram via Intellij:

snapshot

Related issues

closes #4424
closes #4604
closes #4345
closes #4122
closes #4067
closes #3214
closes #2721
closes #4425

Pull Request Checklist

  • All commit messages match our commit message guidelines
  • The submitting code follows our code style
  • If submitting code, please run mvn clean install -DskipTests locally before committing

@Zelldon Zelldon force-pushed the zell-4424 branch 2 times, most recently from 93d1213 to f89f99c Compare June 2, 2020 05:14
Copy link
Contributor

@pihme pihme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good stuff. I like the unit tests. Several smaller comments.

The important feedback is for the state controller, where I think there is a memory leak and the metrics aren't updated correctly in some cases.

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some initial comments.

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more minor comments. 🙂

@Zelldon
Copy link
Member Author

Zelldon commented Jun 8, 2020

@pihme @deepthidevaki thanks for reviewing. I applied almost all review hints, except the interface changes. It felt like it make it more complex then necessary and I would need to do more changes. We can discuss them if necessary. Happy to hear your feedback.

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I have just 2 more comments/questions.

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

 Description:

 Introduces new interfaces for Snapshotting.

 PersistedSnapshotStore can be used to take snapshots and receive
 snapshot chunks. After all chunks are received or the snapshot is taken
 the volatile snapshot can be persisted. It is always possible
 to abort an snapshot, before persisting. The store can also be used
 to get the latest snapshot which was persisted. The store allows
 to register listeners, which are called on new persisted snapshots.

 On receiving snapshot chunks and consuming them, checksums of the chunks
 are verified. On persisting an received snapshot the complete snapshot
 checksum is verified.

 An persisted snapshot can be read from it persisted storage via a
 snapshot reader and will be chunked in several snapshot chunks, which
 then can be transmitted and consumed again by the store.

 The StateSnapshotController use by the AsyncSnapshotDirector and is the
 only one which takes a snapshot. The ReplicationController is
 registered at the SnapshotStore as listener to replicate new persisted
 snapshots. The replication will send all chunks over the network.

 The ReplicationController will receive the chunks on the follower side and will use
 the newReceviedSnapshot method of the store to consume the SnapshotChunks.

 On the LeaderRole (in the AbstractAppender) the last snapshot can also
 be replicated. This is done via the SnapshotChunkReader, which
 reads the chunks from ther persisted store.

 The Follower (in PassiveRole) will consume the replicated chunks and
 will store it similar to the ReplicationController.

 On both ReplicationController and PassiveRole the checksum of the
 snapshot chunks but also of the complete snapshot is verified before
 the received snapshot is persisted.

 Related work:

  * Most implemation details are now in Atomix
  * Zeebe related snapshoting like replication and taking of snapshot is
    part of Broker
  * Use sbe for snapshot chunk replication/consumption
  * StreamProcessorTests moved again to engine module - removed snapshot
    related tests, dont need to be part of that
  * clean up managment schema file in broker

  * Wrote a bunch of new unit tests for snapshot implementation
@Zelldon
Copy link
Member Author

Zelldon commented Jun 8, 2020

bors r+

@zeebe-bors
Copy link
Contributor

zeebe-bors bot commented Jun 8, 2020

Build succeeded

@zeebe-bors zeebe-bors bot merged commit 3ca5f98 into develop Jun 8, 2020
@zeebe-bors zeebe-bors bot deleted the zell-4424 branch June 8, 2020 15:24
@Zelldon
Copy link
Member Author

Zelldon commented Jun 9, 2020

Forget to run the benchmark before merge sorry. I run overnight the benchmarks and they look good.

Benchmark with PR changes:

processing
pvc-usage

Benchmark with current Develop (incl. changes):
long-processing
long-pvc

Benchmark with Develop (excl. changes):

without-proc
without-pvc

zeebe-bors bot added a commit that referenced this pull request Jun 10, 2020
4684: chore(monitor): adjust snapshot duration panel r=Zelldon a=Zelldon

## Description

Hey @deepthidevaki I missed to update the grafana dashboard in #4641 
I changed in the PR the gauge to a Histogram. This PR now changes the panel in the grafana dashboard.

**OLD**
![old-snapshot](https://user-images.githubusercontent.com/2758593/84110685-9fd0e200-aa25-11ea-9e10-9ba837fcf6f4.png)

**NEW**

![new-snapshot](https://user-images.githubusercontent.com/2758593/84110711-a95a4a00-aa25-11ea-88ea-d26535ea6c9a.png)


<!-- Please explain the changes you made here. -->


#

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 14, 2024
* style: add carbonised skeleton to instance history

* refactor: use default error message in empty panel

* refactor: use separate component for error message

* refactor: simplify passing props
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment