New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structural Search: Modify revisions values to match on Zoekt branches #51076
Structural Search: Modify revisions values to match on Zoekt branches #51076
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff 614a0aa...4ab7a17.
|
// Replacement is required typically because the revision is a commit that resolves to the latest indexed commit | ||
// for a branch in Zoekt but the branch name in Zoekt (either "HEAD" or a value defined in experimentalFeatures.search.index.branches) | ||
// is a values that is not the commit. | ||
func (rb *IndexedRepoRevs) GetRepoRevsFromBranchRepos() map[api.RepoID]*search.RepositoryRevisions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two checks missing from this new method:
- If
len(branchRepos) > 1
then we should check which branch a given revision belongs to e.g. branch 1 has commits not in the commit history of branch 2, so we don't know where a revision belongs. - Confirming a given
zoektquery.BranchRepos
contains the repo id fromRepoRevs
.
Re: (1.) we don't have this information available so this would require additional information, this is the additional scope I mentioned in the PR.
Re: (2.) I don't believe we need this check since a multiple repository input would usually still have all searches run against HEAD
. There is still the case of specifying multiple commits with the rev:
clause and one of those commits is the latest commit of a non-default indexed branch in which case we could potentially figure out which branch in branchRepos we should use based on the repo ID, but this is not guaranteed because we could still have multiple branches of the same repository in branchRepos
. So I believe it would make more sense to do the work of the follow up ticket instead of adding this RepoID check here.
@@ -191,7 +191,8 @@ func (s *SearchJob) Run(ctx context.Context, clients job.RuntimeClients, stream | |||
|
|||
repoSet := []repoData{UnindexedList(unindexed)} | |||
if indexed != nil { | |||
repoSet = append(repoSet, IndexedMap(indexed.RepoRevs)) | |||
repoRevsFromBranchRepos := indexed.GetRepoRevsFromBranchRepos() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we are compensating for a bug in Zoekt or at least in PartitionRepos
? If the query specifies the revision@HEAD, indexed
in line 179 should contain that repo and the revision, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the query specifies the revision@HEAD, indexed in line 179 should contain that repo and the revision, shouldn't it?
Correct, a query where we specify HEAD
as our revision will already have that revision and branch values of HEAD
in the indexed
object, so this won't make any changes. For example
repo:^github\.com/sourcegraph/infrastructure$@HEAD
or
repo:^github\.com/sourcegraph/infrastructure$ rev:HEAD
In most cases GetRepoRevsFromBranchRepos
won't make changes. The main case it does need to make changes is when the following are true for the revision specified in the query:
- the revision is a commit hash and is the latest commit of an indexed branch in Zoekt
- the revision is not the branch name indexed in Zoekt (i.e. it is not a revision specified in
experimentalFeatures.search.index.branches
)
In this case we'll pass the commit hash to searcher as the branch name for our branchRepos
query against Zoekt, but there is no branch whose name is the commit hash so we get zero results. This is the reason why in the original issue the code insights shows 0 results just for the latest commit at the time of processing (see attached screencap). For this reason we do this check (which admittedly is a fix with the hope of not refactoring a lot of code, I'll comment on this in answering your other question) which will replace commit hash with either a branch name in indexed.branchRepos
, or empty string. The branches in indexed.branchRepos
were already confirmed to resolve as branches in Zoekt as part of PartitionRepos()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we are compensating for a bug in Zoekt or at least in PartitionRepos?
I believe we are compensating for how the structural search job logic currently works. I believe Zoekt is working okay and PartitionRepos()
works okay for non structural search queries. Those two prefer to have the branchRepos object (i.e. confirmed branches in Zoekt) be used to populate a zoekt BranchRepos
query, e.g. here and here.
Structural search for indexed commit & branch is a bit different since we construct zoekt branchRepos query based on limited information, see here: we use the revision which could be a commit hash that zoekt can't map to an indexed branch name, as described in the previous scenario.
I'm open to investigating further and owning a refactor effort but I'm starting Escalation engineering rotation and am OOO later this week so that's why I'm leaning towards splitting this smaller scoped fix from the larger effort of fixing the branchRepos query construction logic. There's also a query that is still flawed which would be part of a larger refactor, which I mentioned in the PR description, but that use case will not show up in the code insights reported issue, so with this PR we can at least resolve the code insights bug.
for k := range rb.branchRepos { | ||
updated.Revs[i] = k | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for k := range rb.branchRepos { | |
updated.Revs[i] = k | |
break | |
} | |
updated.Revs[i] = rb.branchRepos[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefanhengl I went with the for loop + break because Goland gives me '0' (type untyped int) cannot be represented by the type string
since it's a map with string keys.
Do you know any way around this besides the loop? Is my ide being silly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup goland is correct since branchRepos is a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong change. As you mention in your root cause the problem is we pass p.Branch as the branch name to zoekt. But p.Branch means something different to zoekt. IE this code here https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/searcher/internal/search/search_structural.go?L318
That code should correctly use the IndexedRepoRevs struct to create the zoekt query.
for k := range rb.branchRepos { | ||
updated.Revs[i] = k | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup goland is correct since branchRepos is a map.
repoRevs := make(map[api.RepoID]*search.RepositoryRevisions, len(rb.RepoRevs)) | ||
|
||
for repoID, repoRev := range rb.RepoRevs { | ||
updated := *repoRev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't actually a copy since repoRev.Revs is a slice. You need to do something ugly like
updated.Revs = append([]string{}, repoRev.Revs...)
I'd just explicitly declare the struct instead of dereferencing.
} else { | ||
// if there are multiple branches then fall back to HEAD | ||
// clear value to identify to zoekt to utilize branch HEAD regardless of repo ID | ||
updated.Revs[i] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like the wrong assumption. Don't we need to find the branch with the same commit. If that doesn't exist, then use head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current IndexedRepoRevs
returned by the PartitionRepos()
call does not allow us to map our revisions/commits to a specific branch in the case where the query specifies the commit hash. If there's only one branch in IndexedRepoRevs.branchRepos
then we can make the assumption that I make in my new method. If IndexedRepoRevs
has two branches because of multi branch indexing and two commit hash revisions then we can't map them without more information.
I have a few ideas in mind for resolving this and I'll elaborate in reply to your other PR comment
The current I decided to throw out this PR as a quicker solution after pairing with @camdencheek and talking about how to fix this branchRepos query construction for structural search. Since I'm open to shelving this PR and looking into what we need to add to I might not have time during my escalation engineering rotation to get a fix in. One benefit of this PR's fix is that the missing branch->revision mapping that we discussed does not impact the code insights use case because code insights does not accept revisions in its repo filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest, I didn't understand the problem from everything you have written down. I spent quite a bit of time trying to understand this and I think I get it now. I am going to say it back to you to see if I understand.
The root cause of the problem is structural search runs on searcher and the API for searcher takes in RepositoryRevisions. So we can't communicate down IndexedRepoRevs to the zoekt running on searcher.
To fix this problem you are introducing a function which creates a RepositoryRevision that will always work in Zoekt.
This feels like the correct trade-off vs changing the interface of searcher right now. However, the implementation looks very tricky and hard to justify as correct.
As such I am going to request changes for the following: change the implementation to only read from branchRepos. The attempt to try and use existing RepoRevs is part of what I think makes this so confusing? Or is trying to use RepoRevs needed for this to work correctly? Maybe to enrich commit info?
Either way I would make the high level logic follow the name of the function which is saying we are generating reporevs from branchrepos. Additionally maybe adjust the name to warn people to not use this. IE this should only be used as a hack for structural search.
Additionally can you make the doc string more succinct. I think the overly wordy docstring is contributing to confusion. Be explicit in saying that this exists because of searcher not being able to take this structure over its API.
Alternatively: can't we just run partitionrepos on searcher? No idea how insane that idea is.
This aligns with my understanding, and we create a
That's correct; try to use a
We could likely figure out a way to use branchRepos. Using
I haven't considered this. I'll think it over. Some different solutions I had in mind while investigating this, in order of increasing effort/changes:
I'll ponder this some more and we can also discuss in a call next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving since this fixes the bug for structural search and only affects structural search. We want this to go out in the next patch release. On a call we had there was some suggestions for follow-up where we put the "structural hacks" into just the structural package to make it less coupled.
…#51076) Issue: #50506 The issue manifests for a structural search for any revisions specified in the query that have both of the following: 1) the revision is a commit and is the latest commit of an indexed branch in Zoekt 2) the revision is not the branch name indexed in Zoekt (i.e. it is not a revision specified in `experimentalFeatures.search.index.branches`) The following fix works for all cases where the indexed branch to query is `HEAD`. Therefore this will resolve the code insights use case since code insights repo filter does not accept revisions. The fix in this PR is intended to be something we can ship quickly and therefore I hoped to avoid modifying existing logic as much as possible. For this reason we keep the `SearchJob` arguments and logic unchanged but just modify the repo revisions passed in. Note that there is still a remaining issue where if we specify two revisions for our query and both of those revisions match both criteria described above, then we do not have enough information to determine which branch to use since we'll have two branches and neither of them have a map to their latest commit. To resolve this I believe we'd have to modify [`PartitionRepos()`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L177:6) and/or [`IndexedRepoRevs`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L38) which would be a bit more involved so I am planning to spin off to a separate ticket. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> New method has test cases added (cherry picked from commit 633c37b)
… Zoekt branches (#51419) Issue: #50506 The issue manifests for a structural search for any revisions specified in the query that have both of the following: 1) the revision is a commit and is the latest commit of an indexed branch in Zoekt 2) the revision is not the branch name indexed in Zoekt (i.e. it is not a revision specified in `experimentalFeatures.search.index.branches`) The following fix works for all cases where the indexed branch to query is `HEAD`. Therefore this will resolve the code insights use case since code insights repo filter does not accept revisions. The fix in this PR is intended to be something we can ship quickly and therefore I hoped to avoid modifying existing logic as much as possible. For this reason we keep the `SearchJob` arguments and logic unchanged but just modify the repo revisions passed in. Note that there is still a remaining issue where if we specify two revisions for our query and both of those revisions match both criteria described above, then we do not have enough information to determine which branch to use since we'll have two branches and neither of them have a map to their latest commit. To resolve this I believe we'd have to modify [`PartitionRepos()`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L177:6) and/or [`IndexedRepoRevs`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L38) which would be a bit more involved so I am planning to spin off to a separate ticket. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> New method has test cases added <br> Backport 633c37b from #51076 Co-authored-by: Gary Lee <105310278+gl-srgr@users.noreply.github.com>
Issue: #50506
The issue manifests for a structural search for any revisions specified in the query that have both of the following:
experimentalFeatures.search.index.branches
)The following fix works for all cases where the indexed branch to query is
HEAD
. Therefore this will resolve the code insights use case since code insights repo filter does not accept revisions.The fix in this PR is intended to be something we can ship quickly and therefore I hoped to avoid modifying existing logic as much as possible. For this reason we keep the
SearchJob
arguments and logic unchanged but just modify the repo revisions passed in.Note that there is still a remaining issue where if we specify two revisions for our query and both of those revisions match both criteria described above, then we do not have enough information to determine which branch to use since we'll have two branches and neither of them have a map to their latest commit. To resolve this I believe we'd have to modify
PartitionRepos()
and/orIndexedRepoRevs
which would be a bit more involved so I am planning to spin off to a separate ticket.Test plan
New method has test cases added