Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structural Search: Modify revisions values to match on Zoekt branches #51076

Merged

Conversation

gl-srgr
Copy link
Contributor

@gl-srgr gl-srgr commented Apr 25, 2023

Issue: #50506

The issue manifests for a structural search for any revisions specified in the query that have both of the following:

  1. the revision is a commit and is the latest commit of an indexed branch in Zoekt
  2. the revision is not the branch name indexed in Zoekt (i.e. it is not a revision specified in experimentalFeatures.search.index.branches)

The following fix works for all cases where the indexed branch to query is HEAD. Therefore this will resolve the code insights use case since code insights repo filter does not accept revisions.

The fix in this PR is intended to be something we can ship quickly and therefore I hoped to avoid modifying existing logic as much as possible. For this reason we keep the SearchJob arguments and logic unchanged but just modify the repo revisions passed in.

Note that there is still a remaining issue where if we specify two revisions for our query and both of those revisions match both criteria described above, then we do not have enough information to determine which branch to use since we'll have two branches and neither of them have a map to their latest commit. To resolve this I believe we'd have to modify PartitionRepos() and/or IndexedRepoRevs which would be a bit more involved so I am planning to spin off to a separate ticket.

Test plan

New method has test cases added

@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Apr 25, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 614a0aa...4ab7a17.

Notify File(s)
@camdencheek internal/search/structural/structural.go
internal/search/zoekt/indexed_search.go
internal/search/zoekt/indexed_search_test.go
@jtibshirani internal/search/structural/structural.go
internal/search/zoekt/indexed_search.go
internal/search/zoekt/indexed_search_test.go
@keegancsmith internal/search/structural/structural.go
internal/search/zoekt/indexed_search.go
internal/search/zoekt/indexed_search_test.go

// Replacement is required typically because the revision is a commit that resolves to the latest indexed commit
// for a branch in Zoekt but the branch name in Zoekt (either "HEAD" or a value defined in experimentalFeatures.search.index.branches)
// is a values that is not the commit.
func (rb *IndexedRepoRevs) GetRepoRevsFromBranchRepos() map[api.RepoID]*search.RepositoryRevisions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two checks missing from this new method:

  1. If len(branchRepos) > 1 then we should check which branch a given revision belongs to e.g. branch 1 has commits not in the commit history of branch 2, so we don't know where a revision belongs.
  2. Confirming a given zoektquery.BranchRepos contains the repo id from RepoRevs.

Re: (1.) we don't have this information available so this would require additional information, this is the additional scope I mentioned in the PR.

Re: (2.) I don't believe we need this check since a multiple repository input would usually still have all searches run against HEAD. There is still the case of specifying multiple commits with the rev: clause and one of those commits is the latest commit of a non-default indexed branch in which case we could potentially figure out which branch in branchRepos we should use based on the repo ID, but this is not guaranteed because we could still have multiple branches of the same repository in branchRepos. So I believe it would make more sense to do the work of the follow up ticket instead of adding this RepoID check here.

@gl-srgr gl-srgr requested a review from a team April 25, 2023 06:44
@@ -191,7 +191,8 @@ func (s *SearchJob) Run(ctx context.Context, clients job.RuntimeClients, stream

repoSet := []repoData{UnindexedList(unindexed)}
if indexed != nil {
repoSet = append(repoSet, IndexedMap(indexed.RepoRevs))
repoRevsFromBranchRepos := indexed.GetRepoRevsFromBranchRepos()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are compensating for a bug in Zoekt or at least in PartitionRepos? If the query specifies the revision@HEAD, indexed in line 179 should contain that repo and the revision, shouldn't it?

Copy link
Contributor Author

@gl-srgr gl-srgr Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the query specifies the revision@HEAD, indexed in line 179 should contain that repo and the revision, shouldn't it?

Correct, a query where we specify HEAD as our revision will already have that revision and branch values of HEAD in the indexed object, so this won't make any changes. For example

repo:^github\.com/sourcegraph/infrastructure$@HEAD

or

repo:^github\.com/sourcegraph/infrastructure$ rev:HEAD

In most cases GetRepoRevsFromBranchRepos won't make changes. The main case it does need to make changes is when the following are true for the revision specified in the query:

  1. the revision is a commit hash and is the latest commit of an indexed branch in Zoekt
  2. the revision is not the branch name indexed in Zoekt (i.e. it is not a revision specified in experimentalFeatures.search.index.branches)

In this case we'll pass the commit hash to searcher as the branch name for our branchRepos query against Zoekt, but there is no branch whose name is the commit hash so we get zero results. This is the reason why in the original issue the code insights shows 0 results just for the latest commit at the time of processing (see attached screencap). For this reason we do this check (which admittedly is a fix with the hope of not refactoring a lot of code, I'll comment on this in answering your other question) which will replace commit hash with either a branch name in indexed.branchRepos, or empty string. The branches in indexed.branchRepos were already confirmed to resolve as branches in Zoekt as part of PartitionRepos().

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are compensating for a bug in Zoekt or at least in PartitionRepos?

I believe we are compensating for how the structural search job logic currently works. I believe Zoekt is working okay and PartitionRepos() works okay for non structural search queries. Those two prefer to have the branchRepos object (i.e. confirmed branches in Zoekt) be used to populate a zoekt BranchRepos query, e.g. here and here.

Structural search for indexed commit & branch is a bit different since we construct zoekt branchRepos query based on limited information, see here: we use the revision which could be a commit hash that zoekt can't map to an indexed branch name, as described in the previous scenario.

I'm open to investigating further and owning a refactor effort but I'm starting Escalation engineering rotation and am OOO later this week so that's why I'm leaning towards splitting this smaller scoped fix from the larger effort of fixing the branchRepos query construction logic. There's also a query that is still flawed which would be part of a larger refactor, which I mentioned in the PR description, but that use case will not show up in the code insights reported issue, so with this PR we can at least resolve the code insights bug.

Comment on lines +64 to +67
for k := range rb.branchRepos {
updated.Revs[i] = k
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for k := range rb.branchRepos {
updated.Revs[i] = k
break
}
updated.Revs[i] = rb.branchRepos[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhengl I went with the for loop + break because Goland gives me '0' (type untyped int) cannot be represented by the type string since it's a map with string keys.

Do you know any way around this besides the loop? Is my ide being silly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup goland is correct since branchRepos is a map.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong change. As you mention in your root cause the problem is we pass p.Branch as the branch name to zoekt. But p.Branch means something different to zoekt. IE this code here https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/searcher/internal/search/search_structural.go?L318

That code should correctly use the IndexedRepoRevs struct to create the zoekt query.

Comment on lines +64 to +67
for k := range rb.branchRepos {
updated.Revs[i] = k
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup goland is correct since branchRepos is a map.

repoRevs := make(map[api.RepoID]*search.RepositoryRevisions, len(rb.RepoRevs))

for repoID, repoRev := range rb.RepoRevs {
updated := *repoRev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't actually a copy since repoRev.Revs is a slice. You need to do something ugly like

updated.Revs = append([]string{}, repoRev.Revs...)

I'd just explicitly declare the struct instead of dereferencing.

} else {
// if there are multiple branches then fall back to HEAD
// clear value to identify to zoekt to utilize branch HEAD regardless of repo ID
updated.Revs[i] = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like the wrong assumption. Don't we need to find the branch with the same commit. If that doesn't exist, then use head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current IndexedRepoRevs returned by the PartitionRepos() call does not allow us to map our revisions/commits to a specific branch in the case where the query specifies the commit hash. If there's only one branch in IndexedRepoRevs.branchRepos then we can make the assumption that I make in my new method. If IndexedRepoRevs has two branches because of multi branch indexing and two commit hash revisions then we can't map them without more information.

I have a few ideas in mind for resolving this and I'll elaborate in reply to your other PR comment

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Apr 26, 2023

I think this is the wrong change. As you mention in your root cause the problem is we pass p.Branch as the branch name to zoekt. But p.Branch means something different to zoekt. IE this code here https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/searcher/internal/search/search_structural.go?L318

That code should correctly use the IndexedRepoRevs struct to create the zoekt query.

The current IndexedRepoRevs would need additional information to do this successfully in the case where revisions map only has commit hash values for defining revisions and the branch map only has non-commit values as branch names e.g. HEAD, myFeatureBranch or 5.0. If each map has a single entry then I can make that connection but if there is more than one in each then I can't connect revisions to branches.

I decided to throw out this PR as a quicker solution after pairing with @camdencheek and talking about how to fix this branchRepos query construction for structural search. Since IndexedRepoRevs and PartitionRepos() gets used by non structural search operations it'd be be nice to only modify structural search parts of the code. That being said we're lacking the necessary information in the current IndexedRepoRevs to be 100% certain on which branch's latest commit hash matches one of the query's revisions, so we can pursue a solution that provides that certainty.

I'm open to shelving this PR and looking into what we need to add to IndexedRepoRevs and/or PartitionRepos() to make the missing information available.

I might not have time during my escalation engineering rotation to get a fix in. One benefit of this PR's fix is that the missing branch->revision mapping that we discussed does not impact the code insights use case because code insights does not accept revisions in its repo filter.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I didn't understand the problem from everything you have written down. I spent quite a bit of time trying to understand this and I think I get it now. I am going to say it back to you to see if I understand.

The root cause of the problem is structural search runs on searcher and the API for searcher takes in RepositoryRevisions. So we can't communicate down IndexedRepoRevs to the zoekt running on searcher.

To fix this problem you are introducing a function which creates a RepositoryRevision that will always work in Zoekt.

This feels like the correct trade-off vs changing the interface of searcher right now. However, the implementation looks very tricky and hard to justify as correct.

As such I am going to request changes for the following: change the implementation to only read from branchRepos. The attempt to try and use existing RepoRevs is part of what I think makes this so confusing? Or is trying to use RepoRevs needed for this to work correctly? Maybe to enrich commit info?

Either way I would make the high level logic follow the name of the function which is saying we are generating reporevs from branchrepos. Additionally maybe adjust the name to warn people to not use this. IE this should only be used as a hack for structural search.

Additionally can you make the doc string more succinct. I think the overly wordy docstring is contributing to confusion. Be explicit in saying that this exists because of searcher not being able to take this structure over its API.

Alternatively: can't we just run partitionrepos on searcher? No idea how insane that idea is.

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Apr 27, 2023

The root cause of the problem is structural search runs on searcher and the API for searcher takes in RepositoryRevisions. So we can't communicate down IndexedRepoRevs to the zoekt running on searcher.

This aligns with my understanding, and we create a branchRepos zoekt query in searcher from the limited information of the branch name the client sends (which is just the revision), instead of using the branchRepos map constructed in PartitionRepos().

To fix this problem you are introducing a function which creates a RepositoryRevision that will always work in Zoekt.

That's correct; try to use a branchRepos branch name in lieu of a commit hash revision that is not a branch name.

As such I am going to request changes for the following: change the implementation to only read from branchRepos. The attempt to try and use existing RepoRevs is part of what I think makes this so confusing? Or is trying to use RepoRevs needed for this to work correctly? Maybe to enrich commit info?

We could likely figure out a way to use branchRepos. Using RepoRevs reduces the amount of changes since TextSearchJob iterates over each repo, and then iterates over each repo's set of revisions, sending one searcher request per repo+rev pair. Branch Repos alone doesn't gives us a means to iterate by repo but I can see us using the entire IndexedRepoRevs so we can read from branchRepos in TextSearchJob as you described. We already have an indexed field in TextSearchJob and so we could do these extra checks only for indexed search requests. There's another low level implementation detail for mapping branches to revisions to solve the multi-revision, single repo search query scenario I described earlier, but we can discuss that when I put it in a PR.

Alternatively: can't we just run partitionrepos on searcher? No idea how insane that idea is.

I haven't considered this. I'll think it over.

Some different solutions I had in mind while investigating this, in order of increasing effort/changes:

  1. modify revs with values we know work in Zoekt: what this PR tries to do.

  2. pass branchRepos along so we can read that as necessary: This is along the lines of your suggestion unless I misunderstood.

  3. generate branchRepos query directly from indexed.BranchRepos e.g. how we do it here and here. Regardless of whether we generate this in frontend or searcher we'd have to include a non trivial type in the request and I don't know what effort this would look like.

I'll ponder this some more and we can also discuss in a call next week

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving since this fixes the bug for structural search and only affects structural search. We want this to go out in the next patch release. On a call we had there was some suggestions for follow-up where we put the "structural hacks" into just the structural package to make it less coupled.

@gl-srgr gl-srgr merged commit 633c37b into main May 3, 2023
11 checks passed
@gl-srgr gl-srgr deleted the garyl/structural_search_uses_GetRepoRevsFromBranchRepos branch May 3, 2023 15:37
github-actions bot pushed a commit that referenced this pull request May 3, 2023
…#51076)

Issue: #50506

The issue manifests for a structural search for any revisions specified
in the query that have both of the following:
1) the revision is a commit and is the latest commit of an indexed
branch in Zoekt
2) the revision is not the branch name indexed in Zoekt (i.e. it is not
a revision specified in `experimentalFeatures.search.index.branches`)

The following fix works for all cases where the indexed branch to query
is `HEAD`. Therefore this will resolve the code insights use case since
code insights repo filter does not accept revisions.

The fix in this PR is intended to be something we can ship quickly and
therefore I hoped to avoid modifying existing logic as much as possible.
For this reason we keep the `SearchJob` arguments and logic unchanged
but just modify the repo revisions passed in.

Note that there is still a remaining issue where if we specify two
revisions for our query and both of those revisions match both criteria
described above, then we do not have enough information to determine
which branch to use since we'll have two branches and neither of them
have a map to their latest commit. To resolve this I believe we'd have
to modify
[`PartitionRepos()`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L177:6)
and/or
[`IndexedRepoRevs`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L38)
which would be a bit more involved so I am planning to spin off to a
separate ticket.

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

New method has test cases added

(cherry picked from commit 633c37b)
unknwon pushed a commit that referenced this pull request May 3, 2023
… Zoekt branches (#51419)

Issue: #50506

The issue manifests for a structural search for any revisions specified
in the query that have both of the following:
1) the revision is a commit and is the latest commit of an indexed
branch in Zoekt
2) the revision is not the branch name indexed in Zoekt (i.e. it is not
a revision specified in `experimentalFeatures.search.index.branches`)

The following fix works for all cases where the indexed branch to query
is `HEAD`. Therefore this will resolve the code insights use case since
code insights repo filter does not accept revisions.

The fix in this PR is intended to be something we can ship quickly and
therefore I hoped to avoid modifying existing logic as much as possible.
For this reason we keep the `SearchJob` arguments and logic unchanged
but just modify the repo revisions passed in.

Note that there is still a remaining issue where if we specify two
revisions for our query and both of those revisions match both criteria
described above, then we do not have enough information to determine
which branch to use since we&#39;ll have two branches and neither of
them have a map to their latest commit. To resolve this I believe
we&#39;d have to modify
[`PartitionRepos()`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L177:6)
and/or
[`IndexedRepoRevs`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/search/zoekt/indexed_search.go?L38)
which would be a bit more involved so I am planning to spin off to a
separate ticket.

## Test plan

&lt;!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
--&gt;

New method has test cases added <br> Backport
633c37b from #51076

Co-authored-by: Gary Lee <105310278+gl-srgr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants