Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Engine Redesign for improving scalability #31

Closed
ambud opened this issue Jun 25, 2017 · 2 comments · Fixed by #66
Closed

Storage Engine Redesign for improving scalability #31

ambud opened this issue Jun 25, 2017 · 2 comments · Fixed by #66
Assignees
Projects

Comments

@ambud
Copy link
Member

ambud commented Jun 25, 2017

Summary:

Redesign storage engine to support hundreds of thousands of independent time series.

Description:

The current Sidewinder Disk Storage engine stores one unique time series bucket per file, the size of the bucket is configurable however, it still posses a restriction on how many unique timeseries can there be on a given server as the number of open files is limited.

While the dc4d448 does try avoid the issue of max open files by closing the files as soon as the MappedByteBuffer is created, this only pushes the envelope so far and the fundamental issue is unresolved.

The LRU based design proposal I created earlier can only help mitigate the issue when there actually aren't as many concurrent writes for time series. In the case there are, this would cause a lot of cache evictions causing frequent cache swapping adding to degraded performance.

Proposal
The New Storage Engine design proposes to decouple compression and persistence responsibilities, combine multiple series into 1 file while keeping the concept of time series buckets. The whole design is based on a memory allocator that grants buffers to series buckets on request, these buffers are slices of a memory mapped file segment. Once the file reaches a certain size new files are created and existing file is closed. This redesign refactors a lot of components in the StorageEngine while preserving the interface as much as possible therefore there's minimal impact of writer and reader components of the database. Additional testing is added as well to help improve the reliability of the system.

OLD:

Summary:

Limit maximum number of open files.

Description:

Create an Least Recently Used based eviction system to automatically close data files that are not being written to or read from. Operating systems have limit on maximum number of open files, if a user / system exceeds that an exception is thrown that can't be recovered unless files are closed. The LRU based module will prevent this exception from being thrown by proactively limiting exceeding this limit. This feature is specially very helpful for series storing historical data.

@ambud ambud added Bug and removed Enhancement labels Jul 15, 2017
ambud added a commit that referenced this issue Aug 6, 2017
Cleaning up archival moving sql package
ambud added a commit that referenced this issue Aug 7, 2017
ambud added a commit that referenced this issue Aug 9, 2017
time series buckets

Fixing bugs with refactor
@ambud ambud self-assigned this Aug 9, 2017
@ambud ambud changed the title Limit max open files Storage Engine Redesign for increasing series count Aug 9, 2017
@ambud ambud changed the title Storage Engine Redesign for increasing series count Storage Engine Redesign for improving scalability Aug 9, 2017
ambud added a commit that referenced this issue Aug 10, 2017
ambud added a commit that referenced this issue Aug 10, 2017
ambud added a commit that referenced this issue Aug 10, 2017
ambud added a commit that referenced this issue Aug 12, 2017
ambud added a commit that referenced this issue Aug 12, 2017
ambud added a commit that referenced this issue Aug 12, 2017
ambud added a commit that referenced this issue Aug 13, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests
ambud added a commit that referenced this issue Aug 13, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)
ambud added a commit that referenced this issue Aug 13, 2017
adding unit tests for persistent measurement
ambud added a commit that referenced this issue Aug 14, 2017
ambud added a commit that referenced this issue Aug 14, 2017
ambud added a commit that referenced this issue Aug 14, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement
ambud added a commit that referenced this issue Aug 15, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31
@ambud
Copy link
Member Author

ambud commented Aug 15, 2017

Pending items:

  • Write more consistency tests for the system

ambud added a commit that referenced this issue Aug 15, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug
ambud added a commit that referenced this issue Aug 15, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements
ambud added a commit that referenced this issue Aug 16, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage
ambud added a commit that referenced this issue Aug 16, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues
@ambud ambud closed this as completed in #66 Aug 16, 2017
ambud added a commit that referenced this issue Aug 16, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues
@ambud
Copy link
Member Author

ambud commented Aug 16, 2017

Committed to development branch

ambud added a commit that referenced this issue Aug 20, 2017
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue
ambud added a commit that referenced this issue Aug 23, 2017
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue
ambud added a commit that referenced this issue Aug 23, 2017
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

Fixing pom issue
ambud added a commit that referenced this issue Aug 23, 2017
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue
ambud added a commit that referenced this issue Aug 23, 2017
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds
ambud added a commit that referenced this issue Sep 6, 2017
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc
ambud added a commit that referenced this issue Sep 6, 2017
* Clustering enhancements and Graphite Server (#75)

* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc

* [ci skip] Updating build-info file

* [ci skip]prepare release sidewinder-parent-0.0.27

* [ci skip]prepare for next development iteration
ambud added a commit that referenced this issue Sep 6, 2017
* Clustering enhancements and Graphite Server (#75)

* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc

* [ci skip] Updating build-info file

* [ci skip]prepare release sidewinder-parent-0.0.27

* [ci skip]prepare for next development iteration
ambud added a commit that referenced this issue Feb 10, 2018
Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue
ambud added a commit that referenced this issue Feb 10, 2018
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

Fixing pom issue
ambud added a commit that referenced this issue Feb 10, 2018
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds
ambud added a commit that referenced this issue Feb 10, 2018
* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc
ambud added a commit that referenced this issue Feb 10, 2018
* Clustering enhancements and Graphite Server (#75)

* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc

* [ci skip] Updating build-info file

* [ci skip]prepare release sidewinder-parent-0.0.27

* [ci skip]prepare for next development iteration
ambud added a commit that referenced this issue Feb 10, 2018
* Clustering enhancements and Graphite Server (#75)

* Adding file channel close as a temporary solution to max open files limit '#31'

* New storage engine implementation #31 (#66)

Cleaning up archival moving sql package

time series buckets

Fixing bugs with refactor

and diskstorage engine

Validating and fixing unit tests

windows)

adding unit tests for persistent measurement

Adding concurrency fixes #31

Fixing recovery bug

Fixing broken recovery system due to missing offsets, fixing unit tests
for recovery, adding parallel recovery for measurements

Improving test converage

Fixing minor code issues

fixing potential concurrency issue

* Adding basic GRPC code for writes with disruptor (#67)

Fixing rpc build issue

* Fixing javadoc and javadoc builds

* #60 and #74

Adding increment size

Refactoring code to make default methods in StorageEngine from the
implementations, fixing scripts.

Adding clustering implementations

Changing back to disktagindex

Adding Location Aware Routing Engine

Fixing javadocs

Adding graphite server for with support for plaintext protocol #73

Fixing bad javadoc

* [ci skip] Updating build-info file

* [ci skip]prepare release sidewinder-parent-0.0.27

* [ci skip]prepare for next development iteration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Core
Awaiting triage
Development

Successfully merging a pull request may close this issue.

1 participant