Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make queries utilise secondary indexes #1925

Merged
merged 99 commits into from
Oct 13, 2023

Conversation

islamaliev
Copy link
Contributor

@islamaliev islamaliev commented Oct 4, 2023

Relevant issue(s)

Resolves #1555

Description

With this change the the secondary indexes are utilised during querying data.

A dedicated Indexer fetcher is implemented to perform fetching of values of indexed fields.

Now there is a separate filter package that houses a lot of methods for working with filters.

A new metric indexesFetched is introduced into @explain to provide information about how many indexes has been fetched.

It also includes an update to the testing framework to allow adding custom asserters.
The new ExplainResultsAsserter is used with this new feature.

Benchmark

I ran some query benchmarks with regular fetching and indexed fetching on 10k docs and here is the summary:

Time per Operation (ns/op)
Indexed: Took approximately 2,196,481 nanoseconds (~2.2 ms) per operation
Original: Took approximately 74,037,628 nanoseconds (~74 ms) per operation
Boost Factor: Indexed is approximately 33.7 times faster than the original.

Memory Allocation (B/op)
Indexed: Allocated approximately 1,053,403 bytes (~1 MB) per operation
Original: Allocated approximately 45,218,327 bytes (~45 MB) per operation
Boost Factor: Indexed uses approximately 42.9 times less memory than the original.

Number of Allocations (allocs/op)
Indexed: Made 13,331 memory allocations per operation
Original: Made 494,494 memory allocations per operation
Boost Factor: Indexed has approximately 37.1 times fewer memory allocations than the original.

Summary
The indexed approach is 33.7 times faster, uses 42.9 times less memory, and has 37.1 times fewer memory allocations compared to the original approach.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Integration tests

Specify the platform(s) on which this was tested:

  • MacOS

@islamaliev islamaliev self-assigned this Oct 4, 2023
@islamaliev islamaliev added feature New feature or request area/query Related to the query component perf Performance issue or suggestion labels Oct 4, 2023
@islamaliev islamaliev added this to the DefraDB v0.8 milestone Oct 4, 2023
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: The purpose of indexes is performance gains on read, at the cost of performance on write (and storage space). I really think performance tests should be included in this PR.

I would like to see performance tests for non-indexed vs indexed queries, on both read and write. Without these we are not really testing that this code change is doing anything useful, and could be seen as an explain-feature or a simple refactor.

praise: What I have seen so far of the production code looks good, but I have left a few comments dotted about. Will finish my review a bit later once some of them have been addressed.

db/fetcher/indexer.go Show resolved Hide resolved
planner/select.go Show resolved Hide resolved
planner/scan.go Outdated Show resolved Hide resolved
planner/planner.go Outdated Show resolved Hide resolved
@islamaliev
Copy link
Contributor Author

I would like to see performance tests for non-indexed vs indexed queries, on both read and write. Without these we are not really testing that this code change is doing anything useful, and could be seen as an explain-feature or a simple refactor.

Do you mean adding tests to tests/bench?

@AndrewSisley
Copy link
Contributor

Do you mean adding tests to tests/bench?

Adding to tests/bench would probably be the simplest for now, we could also assert on one being faster than the other (with appropriate protections against variation).

At the moment there is nothing that really tests the behaviour change that the users care about (query speed).

@islamaliev islamaliev force-pushed the feat/islam/query-with-secondary-indexes branch from 8b682c0 to 189f586 Compare October 5, 2023 15:46
@codecov
Copy link

codecov bot commented Oct 9, 2023

Codecov Report

Attention: 121 lines in your changes are missing coverage. Please review.

Comparison is base (bc4c704) 74.67% compared to head (869f867) 74.98%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1925      +/-   ##
===========================================
+ Coverage    74.67%   74.98%   +0.30%     
===========================================
  Files          234      241       +7     
  Lines        23044    23616     +572     
===========================================
+ Hits         17208    17707     +499     
- Misses        4661     4709      +48     
- Partials      1175     1200      +25     
Flag Coverage Δ
all-tests 74.98% <86.34%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
client/index.go 100.00% <100.00%> (ø)
datastore/errors.go 100.00% <100.00%> (ø)
db/collection_index.go 97.24% <100.00%> (+0.60%) ⬆️
db/fetcher/fetcher.go 76.54% <100.00%> (+0.23%) ⬆️
db/index.go 97.25% <100.00%> (-0.35%) ⬇️
errors/defraError.go 100.00% <100.00%> (ø)
planner/datasource.go 83.33% <100.00%> (-8.33%) ⬇️
planner/filter/complex.go 100.00% <ø> (ø)
planner/filter/copy_field.go 96.36% <100.00%> (+1.01%) ⬆️
planner/filter/extract_properties.go 100.00% <100.00%> (ø)
... and 14 more

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc4c704...869f867. Read the comment docs.

@islamaliev islamaliev force-pushed the feat/islam/query-with-secondary-indexes branch from 4ad4751 to 4323b69 Compare October 10, 2023 14:14
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting a few comments now, as they are pilling up a bit and I need a breather.

Overall it looks good, but the tests in particular need so work I think.

client/index.go Outdated Show resolved Hide resolved
db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
db/fetcher/indexer_iterators.go Outdated Show resolved Hide resolved
tests/bench/fixtures/fixtures.go Outdated Show resolved Hide resolved
tests/bench/fixtures/fixtures.go Outdated Show resolved Hide resolved
tests/integration/utils2.go Outdated Show resolved Hide resolved
tests/integration/index/query_performance_test.go Outdated Show resolved Hide resolved
datastore/util.go Outdated Show resolved Hide resolved
datastore/util.go Outdated Show resolved Hide resolved
datastore/util.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work Islam. Just one minor thought on the extra param ion the bench tests.

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work Islam. Just one minor thought on the extra param ion the bench tests.

Note: Make sure all conversations with Andy are resolved before merging.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please resolve the remaining conversations before merge though.

I have a request for future PRs too - when resolving a conversation, can you please comment in the conversation/thread what you have done to resolve it. It makes it easier to review, as I don't have to then go and re-find the location (which can sometimes be tricky due to github/other changes), and then figure out what has been done to resolve it (also sometimes tricky and occasionally requiring guesswork).

@islamaliev islamaliev merged commit c8bde64 into develop Oct 13, 2023
30 checks passed
@islamaliev islamaliev deleted the feat/islam/query-with-secondary-indexes branch October 13, 2023 22:26
nasdf added a commit to nasdf/defradb that referenced this pull request Oct 13, 2023
commit c8bde64
Author: Islam Aliev <aliev.islam@gmail.com>
Date:   Sat Oct 14 00:26:22 2023 +0200

    feat: Make queries utilise secondary indexes (sourcenetwork#1925)

    ## Relevant issue(s)

    Resolves sourcenetwork#1555

    ## Description

    With this change the the secondary indexes are utilised during querying
    data.

    A dedicated `Indexer` fetcher is implemented to perform fetching of
    values of indexed fields.

    Now there is a separate `filter` package that houses a lot of methods
    for working with filters.

    A new metric `indexesFetched` is introduced into `@explain` to provide
    information about how many indexes has been fetched.

    It also includes an update to the testing framework to allow adding
    custom asserters.
    The new ExplainResultsAsserter is used with this new feature.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1555

## Description

With this change the the secondary indexes are utilised during querying
data.

A dedicated `Indexer` fetcher is implemented to perform fetching of
values of indexed fields.

Now there is a separate `filter` package that houses a lot of methods
for working with filters.

A new metric `indexesFetched` is introduced into `@explain` to provide
information about how many indexes has been fetched.

It also includes an update to the testing framework to allow adding
custom asserters.
The new ExplainResultsAsserter is used with this new feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component feature New feature or request perf Performance issue or suggestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sec. Indexes: Make queries use indexes for fetching data
5 participants