Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Fetcher filter and field optimization #1500

Merged
merged 30 commits into from
Jun 27, 2023

Conversation

jsimnz
Copy link
Member

@jsimnz jsimnz commented May 12, 2023

Relevant issue(s)

Resolves #490
Resolves #1582 (indirectly)

Description

This is a reduced version of #491. It takes a very different approach, and tries to keep as much of the existing Fetcher structure as possible.

Basically, this will try to eagerly ignore documents that don't pass the given filter at the fetcher level. This means we can apply various optimizations then if the filter was applied at the scanNode level like before.

There will be two followup tasks in future PRs.
A) Proper lexigraphical encoding
B) Proper datastore iterator instead of query.

There are still some outstanding bugs and failing tests, but I wanted to get the reviews coming in on the core implementation while I continue to squash bugs to parallelize the time spent

Also..ignore some of the lingering Print statements for debugging, they will be removed before merging.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...
  • Remove debug print statements.

How has this been tested?

Manually, unit + integration tests, CI

Specify the platform(s) on which this was tested:

  • Ubuntu (WSL2)

@jsimnz jsimnz added area/datastore Related to the datastore / storage engine system area/db-system Related to the core system related components of the DB refactor This issue specific to or requires *notable* refactoring of existing codebases and components action/no-benchmark Skips the action that runs the benchmark. labels May 12, 2023
@jsimnz jsimnz added this to the DefraDB v0.5.1 milestone May 12, 2023
@jsimnz jsimnz self-assigned this May 12, 2023
@jsimnz jsimnz requested a review from AndrewSisley May 12, 2023 12:55
planner/scan.go Outdated
Comment on lines 135 to 190
// Keeping for refence as there is an outstanding
// question.
//
// Don't need to run filter here anymore since the
// fetcher will run the filter for us
//
// Question: How do we want to handle `info.filterMatches`
// now? @shahzadlone
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer: Technically if len(n.currentValue.Fields) != 0 shouldn't the state still be the same (i.e. doc match was found)? It's not explicitly a filter match, but would be fine to keep it anywhere after the if and before the return true, nil.
Also, open to changing the name if you have any better ideas to represent a "successful document found while scanning / fetching, which matched / passed the filters"

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some early comments as I know you are actively working on this and looking for feedback asap. As mentioned over discord, I really like the core approach.

@@ -115,6 +115,18 @@ func (sd SchemaDescription) GetFieldKey(fieldName string) uint32 {
return uint32(0)
}

// GetField returns the field of the given name.
func (sd SchemaDescription) GetField(name string) (FieldDescription, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: There is already such a function on CollectionDescription - I suggest refactoring that one to just call this one.

schemaFields map[uint32]client.FieldDescription
fields []*client.FieldDescription
filter *mapper.Filter
mapping *core.DocumentMapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This does not appear to be used and should be removed

@@ -178,14 +183,32 @@ func convertToInt(propertyName string, untypedValue any) (int64, error) {

// @todo: Implement Encoded Document type
type encodedDocument struct {
mapping *core.DocumentMapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This is one change that I am really not a fan of. I was under the impression that fetcher was something that 'transcended' the planner system, yet here we are making a planner/mapper type a property on a core fetcher type (instead of a param to a couple of planner-centric funcs).

IIRC the only reason DocumentMapping is in the core package is because it had to be referenced by fetcher.FetchNextDoc and encodedDocument.DecodeToDoc. It is very coupled to the planner system and if you really want it as a prop on a fetcher packages type, I would instead suggest that we instead move fetcher out of db and into planner - I don't like that very much either, but here we are increasing the level of coupling between the two.

Might be worth having a call about this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a hard call, I'd say youre both right and wrong here.

Its needed here as you note because of fetcher.FetchNextDoc which returns a core.Doc type, which is used by the planner. In reality, the DocumentMapping and core.Doc could really be used anywhere, as its conceptually only tied to the CollectionDescription.

I was almost going to hoist it into the description just to see what that would look like, as that would mean a core.Doc could be used anywhere. Theres nothing specific core.Doc has to the planner. The goal of it is to have an effecient array based representation for a document, instead of relying on a map[string]any everywhere. Its primarily used in the planner, but as there are other places that document handling needs to be refactored (#935) it could be the case the core.Doc is used everywhere.

But thats a bigger conversation that we don't need to get into.

Bottom line, the fetcher now needs to be able to run a filter evaluation, it can use the old style of filter evaluation that was used before the mapper existed, but then we would be doing double work to decode the document. Once to a map[string]any to run with the old version of parser.RunFilter, and once to return for FetchNextDoc.

Additionally, when mapper was implemented, the connor package was migrated to use map[FilterKey]any instead of map[string]any which is generated in mapper/targetable.go.

So it seems its somewhat bound to use the mapper system due to the filter requirements from within the fetcher.

Open to suggestions.

One reason to avoid putting fetcher in planner is that it can be run entirely on its own, as you can see in db/fetcher_test.go, and from within the Collection API, which uses the fetcher.

Happy to have a call 📞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke about this a while back, and I don't think we'll see eye-to-eye on this any time soon :) Author view to trump reviewer (ignoring CTO status ;)), and mark this convo as resolved.

I have a plan for the fetcher that I want to try out in the near future anyway, which will pretty much bypass this conflict entirely if it works out.

}

func (encdoc *encodedDocument) decodeToDoc(filter bool) (core.Doc, error) {
if encdoc.mapping == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment on the mapping prop, I see the fact that we have to check this here (in the hot/usage path) and potentially error as a symptom of a bad constructor/type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cost of checking this is miniscule compare to the cost of decoding and running the filter.

Same answer to my thoughts above. Fetcher now needs to run filters, filters need a core.Doc, which you can only get by using a core.DocumentMapping.

If we use map[string]any to run the filter on the fetcher (like we did pre-mapper days) then we can remove the dependency on the mapping stuff, but then we double the decoding work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm much less worried about performance here - I flagged this as it is an error path that we do not/should not need in the 'doing' part of the code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to use a pointer's value within the function that doesn't define it, then we need to check that it's not nil. I feel like it makes sense to do that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm much less worried about performance here - I flagged this as it is an error path that we do not/should not need in the 'doing' part of the code

This doesn't address my point that we must do this here, I see no way around this. Moreover, this is based on how you wrote the mapper/filter/doc.

Tangentially, your noted objection to these x != nil come up all the time in reviews, and we keep reminding you that its as command and idiomatic as it gets for Go code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chain was only marked as a suggestion, so I guess we can assume that it is staying as is.

Tangentially, I think the insular dogma of 'idomatic Go' far worse than any deficiencies in the language, and it very much seems to prevent the language from evolving along with the rest of the programming profession. This problem has been solved for a very long time in pretty much every other major language, yet the Go community is locked by this dogma into relying on old fashioned 'everything that is a pointer is optional and you have no way of knowing if it really exists' mentality.

If you make this an function parameter it will never be nil, and it will be pretty obvious even if we refuse to unsubscribe from such dogma, as there are only a couple of references to it (the func, and parent func). Even if you keep the (dead) check in, it will be easier to read the code paths and see that it is dead code.

@jsimnz It felt like most of your comment was directed at me, not the suggestion itself - I hope I have explained why I raised this comment, and why I will continue to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's a function parameter it can still be nil, it doesn't change anything.

In fact, it is a function parameter, just at a different function / call site to initialize the fetcher.

It should still be checked.

--

Regarding the idiomatic. Ultimately it doesn't matter what other languages do. This Is Go, and Go convention wins. The language authors and core maintainers have been very explicit that the goals as a language are not to be fancy, and will often ignore more modern PL techniques and research, in an attempt at "simplicity" and clarity.

The pointer has to be checked regardless. Putting it in something like a homegrown results or option container doesn't change that, nor does making it a function argument.

My comment was directed at the historical nature of this thread, as it comes up often, and the response is often the same on both ends.

PS. Word for word my previous response was 50% directed at you lol. The first half of the message was at your suggestion, as I pointed out there's almost no way around having the doc mapper in the fetcher unless either refactor the filter stuff, or double the decoding work. Neither seem ideal compared to leaving as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a reply here in the other comment thread you raised the issue on too, with my same points, from Friday.

#1500 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving this convo, as it is going nowhere, and is of very low value lol. Leave as-is if prefered.

// we update the bitsets as we collect values
// by clearing the bit for the FieldID
filterSet *bitset.BitSet // filter fields
selectSet *bitset.BitSet // select fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This appear unused and should be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used, but was removed, thanks for catching!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is still still relevant? I can't see if there was any changes as a result.

// allfields map[uint32]*client.FieldDescription

// static bitsets
filterSet *bitset.BitSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: The usage of this property is not self explanatory and needs code-comments, I am very tired atm, but it should still be easier to understand than it currently is for me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Documentation here is really good - thanks John :)

// 0 means we we ignore the field
// we update the bitsets as we collect values
// by clearing the bit for the FieldID
filterSet *bitset.BitSet // filter fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'm not a fan of the fact that this property is only ever used from outside of this type - to me it strongly suggests that it should be moved off the encodedDocument type and held somewhere else, but perhaps I will warm to it as my brain-fog fades and the documentation increases :)

nitpick: Also feels a bit odd that this (and doc, and the encPropertys in Properties) is a pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the properties of encodedDocument are directly accessed outside the type. Its really just a container type for holding state, and then its reponsible for combining state.

the doc filterSet could be on the fetcher directly, which would then have two sets, a "static" and a "dynamic" set, but as this is specific to the other fields within the encodedDocument type, it felt appropriate to keep here. Also means if I need to pass around the doc for any reason, the bitset goes with it.

re nitpick: Theres no good reason for the encProperty to be a pointer, but BitSet must be a pointer type as that's how its defined in its package. New method returns pointer, all methods are defined on pointer reciever, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the properties of encodedDocument are directly accessed outside the type. Its really just a container type for holding state, and then its reponsible for combining state.

Didnt realize you saw it that way - in that case I do think that this should have no funcs defined on it then as that means this type is used both as an object, and as a data-struct, which I really dislike. Change that is well out of scope regardless of what anyone thinks of it here though. Convo resolved :)

// seekKV will seek through results/iterator until it reaches
// the target key, or if the target key doesn't exist, the
// next smallest key that is greater than the target.
func (df *DocumentFetcher) seekKV(key string) (bool, *KeyValue, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: It doesn't look like key needs to be a string, and it feels very strange that it is. Is this just because there is no easy to use std Golang []byte comparator and you want to save immediate dev time? If so I think it is worth documenting that (and very-maybe an issue).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly just a conceptual similarity choice as atm the KV level keys are considered strings, but digging further into it theres benefits to being a string.

Firstly, there is an equivalent byte comparator as string comparator.

But, at the moment, core.DataStoreKey.Bytes() just calls []byte(datastorekey.ToString()) anyway. Which will result in an extra allocation anyway ("safe" conversion between strings and bytes allocates+copies).

At the moment, theres just a lot of conversion back and forth between strings, bytes, back to strings, and bytes, and prob a few more. So until we get organized around the proper binary serialization, it doesn't really matter which to use here. But, the one fact we know is that datastorekey.ToString() is cheaper than datastorekey.Bytes(). So its a toss.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can revisit this when we change the key encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, theres just a lot of conversion back and forth between strings, bytes, back to strings, and bytes, and prob a few more. So until we get organized around the proper binary serialization, it doesn't really matter which to use here. But, the one fact we know is that datastorekey.ToString() is cheaper than datastorekey.Bytes(). So its a toss.

Ah fair enough, happy as-is then.

I think we can revisit this when we change the key encoding.

I do see that as quite tangential/separate, and wouldnt recommend doing those two tasks together.

}

// run filter
passed, err := mapper.RunFilter(filterDoc, df.filter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This at first glance looked like a horribly inefficient way of doing this, before I remembered that multiple fields might be involved. I think it is worth documenting this.

(also if for whatever reason you strongly disagree with that suggestion, please then remove the // run filter comment as it doesnt add any value - I assume you added it as you were planning out the implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a comment above the if docDone line that said // if docDone xD. Just my process, thanks for flagging the comment.

What seemed ineffecient at first glance? It gets run anyway, previously on scanner, now on fetcher.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What seemed ineffecient at first glance?

I honestly cant remember, will reply again if I think I remember and if this code is still relevant (just going through the old comments before reviewing the new code atm)

if err != nil {
return nil, err
}
passed, err := mapper.RunFilter(decodedDoc, df.filter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: You appear to be running the filter twice when docDone is true, please remove that before merge :)

Copy link
Collaborator

@fredcarle fredcarle May 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to see this. Can you point me to where the second run of the filter happens?

Edit: I looked at it some more and I think I see it.

Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed high-level, I mostly really like the changes, much smaller and simpler than the previous fetcher stuff.

Shared my thoughts on the questions you asked.

Will give a more deeper second look soon.

@@ -25,6 +25,7 @@ const (
errVFetcherFailedToDecodeNode string = "(version fetcher) failed to decode protobuf"
errVFetcherFailedToGetDagLink string = "(version fetcher) failed to get node link from DAG"
errFailedToGetDagNode string = "failed to get DAG Node"
errMissingMap string = "missing document mapper"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
errMissingMap string = "missing document mapper"
errMissingMapper string = "missing document mapper"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return false, err
}
// @question: Should this be after the length check?
n.execInfo.docFetches++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer: Represents the attempts at fetching docs. Since the Fetch call is before and we do this after ensuring it was not a failure, it's fine where it is at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here for both this answer and your other comment.

Technically it means that docFetches and filterMatches will basically be the same value (or differ by at most 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it means that docFetches and filterMatches will basically be the same value (or differ by at most 1).

I am going to talk from the perspective of what used to happen... The majority of the time, docFetches and iterations would be similar (because for every iteration we would attempt to fetch. But it's not strictly true. Also docFetches and filterMatches will mostly NOT be same, their difference is not strictly most by 1. It all depended on the number of docs and how many match the query or not. Consider these examples from our current execution explain tests.

Consider scan node execute results...

TestExecuteExplainRequestWithOrderFieldOnParent

"iterations":    uint64(3),
"docFetches":    uint64(3),
"filterMatches": uint64(2),

TestExecuteExplainMutationRequestWithCreate

"iterations":    uint64(1),
"docFetches":    uint64(1),
"filterMatches": uint64(1),

TestExecuteExplainMutationRequestWithDeleteUsingFilter

"iterations":    uint64(2),
"docFetches":    uint64(3),
"filterMatches": uint64(1),

TestExecuteExplainMutationRequestWithUpdateUsingFilter

"iterations":    uint64(4),
"docFetches":    uint64(6),
"filterMatches": uint64(2),

TestExecuteExplainRequestWithDocumentsButNoMatches

"iterations":    uint64(1),
"docFetches":    uint64(3),
"filterMatches": uint64(0),

For example in the last example it had two documents, it fetched 2 times and none matched the filter. The third time it fetched there was no more docs so total attempts are 3, with 0 matches, and we entered scan node once.

Fetcher update in this PR

I have to read the implementation in more detail to see if it is still possible to differentiate between the attempts fetches took place till match found vs total matches found. Might even just need to return a variable with result from the fetcher call if we want to keep track similar to before of all the fetch attempts till match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My whole point is that because of the changes in this PR, the behavior is now different. Previously, the scanner "fetched" a document from the fetcher, then the scanner was responsible for filtering.

One of the core changes in this PR is that we are moving the filtering from the scanner to the fetcher.

So keeping track of both filterFetches and filterMatches in the scanner is meaningless now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup makes sense, I did review the implementation in more depth and seems like in order to get that info now we would have to have the fetcher logic store that metric and then communicate back to the scanner. IMO let's keep the metrics we collect scoped to planner nodes only so the ones we need to bend over backward for in a more complicated manner "out-side" the plan nodes, I would skip those metrics for now.
tldr: can just remove/skip the metric that isn't as relevant now

planner/scan.go Outdated
Comment on lines 135 to 190
// Keeping for refence as there is an outstanding
// question.
//
// Don't need to run filter here anymore since the
// fetcher will run the filter for us
//
// Question: How do we want to handle `info.filterMatches`
// now? @shahzadlone
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer: Technically if len(n.currentValue.Fields) != 0 shouldn't the state still be the same (i.e. doc match was found)? It's not explicitly a filter match, but would be fine to keep it anywhere after the if and before the return true, nil.
Also, open to changing the name if you have any better ideas to represent a "successful document found while scanning / fetching, which matched / passed the filters"

for _, field := range col.Schema.Fields {
df.schemaFields[uint32(field.ID)] = field
df.selectFields = make(map[uint32]client.FieldDescription, len(fields))
// @todo: We can probably make this size more accurate, this is an upper bound
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: How would you make this more accurate? You seem to be using all the allocated KVs in the for loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the @todo here is old, the line it refers to has been removed since opening the PR xD

Start(ctx context.Context, txn datastore.Txn, spans core.Spans) error
FetchNext(ctx context.Context) (*encodedDocument, error)
FetchNextDecoded(ctx context.Context) (*client.Document, error)
FetchNextDoc(ctx context.Context, mapping *core.DocumentMapping) ([]byte, core.Doc, error)
Close() error
}

// KeyValue is a KV store response containing the resulting core.Key and byte array value.
type KeyValue struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does this need to be public? Asking because it doesn't seem to be use outside of this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. DocumentFetcher.KV() returns this type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be public. The NextKV, NextKey, and ProcessKV public methods all don't need to exist, theyre old methods mostly for debugging, they can be removed, so KeyValue can be made private.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a ticket to action this in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented now

db/fetcher/fetcher.go Outdated Show resolved Hide resolved
Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume our fetcher is already somewhat covered with tests, but I see here quite some new or updated logic (like filtering, mapping, seeking...). Isn't it supposed to be covered as well?

@@ -27,6 +28,10 @@ type encProperty struct {
Desc client.FieldDescription
Raw []byte

// Filter flag to determine if this flag
// is needed for eager filter evaluation
Filter bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question : don't we want to stick to a "normal" convention of naming bool vars with prefixes like "Is", "Has"... This one could be names ShouldFilter
Current name is surely shorted, but reading if prop.Filter { feels a bit ambiguous (at least to my view)
Maybe it's still my C++ vision where you can if basically any type :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just bringing your attention to this one John.

if encdoc.doc.Fields[prop.Desc.ID] != nil { // used cached decoded fields
continue
}
if filter && prop.Filter != filter { // only get filter fields if filter=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: could write if filter && !prop.Filter {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@codecov
Copy link

codecov bot commented Jun 10, 2023

Codecov Report

Patch coverage: 79.53% and project coverage change: +0.18 🎉

Comparison is base (4fe32de) 73.47% compared to head (b45633a) 73.65%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1500      +/-   ##
===========================================
+ Coverage    73.47%   73.65%   +0.18%     
===========================================
  Files          188      188              
  Lines        19320    19608     +288     
===========================================
+ Hits         14195    14442     +247     
- Misses        4073     4102      +29     
- Partials      1052     1064      +12     
Flag Coverage Δ
all-tests 73.65% <79.53%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
core/data.go 91.61% <ø> (ø)
db/fetcher/errors.go 20.00% <ø> (ø)
planner/mapper/errors.go 100.00% <ø> (ø)
planner/top.go 68.53% <ø> (ø)
planner/planner.go 76.70% <40.00%> (+0.31%) ⬆️
request/graphql/parser/filter.go 51.61% <61.54%> (+3.34%) ⬆️
planner/type_join.go 75.19% <71.43%> (+0.32%) ⬆️
db/fetcher/encoded_doc.go 62.73% <77.42%> (+0.32%) ⬆️
planner/scan.go 87.70% <78.38%> (-2.37%) ⬇️
db/fetcher/fetcher.go 69.58% <78.57%> (+5.78%) ⬆️
... and 12 more

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fe32de...b45633a. Read the comment docs.

encdoc.Key = nil
if encdoc.mapping != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This is a bit odd without a New function. Here mapping should never be nil no? Yet at the same time if it ever is, the doc property will never be cleared by reset leaving this instance in an invalid state.

I'd suggest adding a New(...) EncodedDocument func that ensures mapping is never nil, or if for whatever reason you really dont want that, add an else clause that just sets doc to nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a Reset instead of a New so it keeps using the same allocated/defined struct on the fetcher.

As for the else, the doc will already be nil if the mapping is also nil, and the mapping never gets updated once its set from the beginning, so its either nil or not for the whole thing. I can add if you you prefer, but its not a codepath that will add any meaning/safety imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a Reset instead of a New so it keeps using the same allocated/defined struct on the fetcher.

I think you misunderstand why I mentioned New, by declaring a constructor you are declaring a restricted initial state. There you can enforce mapping is not nil and reset loses a dead code branch (the implicit else), as well as possibility of the invalid doc state that I mentioned. New can even call Reset internally for a stronger initial state guarantee/description if you want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the fields of encodedDocument are handled in fetcher brings me to the conclusion that a Reset function makes more sense then a New function.

)

// Fetcher is the interface for collecting documents from the underlying data store.
// It handles all the key/value scanning, aggregation, and document encoding.
type Fetcher interface {
Init(col *client.CollectionDescription, fields []*client.FieldDescription, reverse bool, showDeleted bool) error
Init(col *client.CollectionDescription, fields []client.FieldDescription,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Breaking up the props onto multiple lines, but still with multiple props per line is really quite hard to read and I'd much prefer that each prop would have their own line. and not share with the func name and brackets:

foo(
    a int,
    b int,
    ...
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I didn't like it when I was doing it, but I also hate one line per for interface definitions. But I guess Ill have to, since the both alternatives suck more 🙃

// only run filter if we've collected all the fields
// required for filtering. This is tracked by the bitsets.
if df.filterSet.Equal(df.doc.filterSet) {
df.ranFilter = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Minor, this looks like it is set in the wrong location - if errors are produced ranFilter will be true, yet the filter may not have been run. I think this should be set after the error check for mapper.RunFilter (currently between ln 539 and 540)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yea it should be. I don't think it makes a diff since it will error out anyway, but for sanity 👍

@@ -43,7 +43,7 @@ func (p *Planner) getCollectionScanPlan(parsed *mapper.Select) (planSource, erro
}

scan := p.Scan(parsed)
err = scan.initCollection(colDesc)
err = scan.initCollection(colDesc) // @todo: Should `initCollection` be merged into `p.Scan(...)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I have an very large preference for this todo to be done. I guess it is technically out of scope, but initCollection is amusingly horrible (presumably from a time where unsure whether it would be shared/re-inited) and it would be quick todo.

Failing that, please make sure that either the todo is removed, or is bound to a ticket, before you merge this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jsimnz jsimnz force-pushed the jsimnz/refactor/fetcher-filter-selection-v2 branch from 9ca4ee6 to 8efc47c Compare June 27, 2023 19:51
@jsimnz jsimnz force-pushed the jsimnz/refactor/fetcher-filter-selection-v2 branch from 481162c to b45633a Compare June 27, 2023 20:52
@jsimnz jsimnz dismissed AndrewSisley’s stale review June 27, 2023 21:40

Resolved comments and todos

@jsimnz jsimnz merged commit 6c0c5bd into develop Jun 27, 2023
12 checks passed
@jsimnz jsimnz deleted the jsimnz/refactor/fetcher-filter-selection-v2 branch June 27, 2023 21:41
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#490 
Resolves sourcenetwork#1582 (indirectly)

## Description

This is a reduced version of sourcenetwork#491. It takes a very different approach,
and tries to keep as much of the existing Fetcher structure as possible.

Basically, this will try to eagerly ignore documents that don't pass the
given filter at the fetcher level. This means we can apply various
optimizations then if the filter was applied at the scanNode level like
before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. area/datastore Related to the datastore / storage engine system area/db-system Related to the core system related components of the DB refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupBy item in result set even when not requested Fetcher field selection and filter optimization
5 participants