Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return files modified by specified contributors #53206

Merged
merged 17 commits into from Jun 14, 2023
Merged

Conversation

gl-srgr
Copy link
Contributor

@gl-srgr gl-srgr commented Jun 9, 2023

A user requested being able to use author: keyword along with a pattern to search for file matches. Currently author keyword is valid for commit and diff search types. The use case is that a provided pattern will match a part of the file that was modified by the author and that matching line may or may not be part of the modified lines of the author's commit.

To support this we add a new predicate file:has.contributor(). Some characteristics of the predicate:

  • Contributor is an existing struct.
  • Regex is supported. We use same regex expression logic as author.
  • Case sensitivity can be applied for the contributor criteria of file:has.contributor() with the existing url parameter.
  • Negation is supported.
  • Multiple predicates will be AND'ed together by default if boolean operator not specified. This is how has.owner works, not sure if all of the predicates are aligned on this. General discrepancies are tracked and described in detail here.
  • Filtering results happens after pattern matches are streamed from child job (discussion. For this reason a user might not always find what they're looking for if the pattern matches only an older revision of a file and that older revision is not the revision streamed back from the child job. One of my proposals would resolve this by iterating through batches of commits with commitSearcher but I'm not sure about performance impact of this approach.

GH Issue

Test plan

new tests for FileHasContributorsJob

@cla-bot cla-bot bot added the cla-signed label Jun 9, 2023
@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jun 9, 2023

I'll finish the tests tomorrow

@gl-srgr gl-srgr requested review from camdencheek and a team June 9, 2023 05:58
@camdencheek
Copy link
Member

For this reason a user might not always find what they're looking for if the pattern matches only an older revision of a file and that older revision is not the revision streamed back from the child job.

I'm not quite sure I'm following on this one.

Let's take an example query: repo:myrepo@oldrevision needle file:has.contributor(camden). If I only contributed after oldrevision, I would not expect this to return any results. This seems correct to me because the file result being returned is a file at a specific revision and the file at that revision has not had any contributions by me.

Does that sound right?

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jun 12, 2023

For this reason a user might not always find what they're looking for if the pattern matches only an older revision of a file and that older revision is not the revision streamed back from the child job.

I'm not quite sure I'm following on this one.

Let's take an example query: repo:myrepo@oldrevision needle file:has.contributor(camden). If I only contributed after oldrevision, I would not expect this to return any results. This seems correct to me because the file result being returned is a file at a specific revision and the file at that revision has not had any contributions by me.

Does that sound right?

That scenario with the query with the revision specified makes sense. I was describing a different use case that might not be something we need to support for this first iteration. It's a bit of a verbose explanation so if you want to discuss in a call let me know. In short I don't believe the below scenario needs to be supported, I just wanted to clarify the gap.

With this current implementation we know a file was modified by a set of contributors but we only visit the file contents at the revisions queried for. If the specific file the user wants only matches the pattern provided for an older revision (e.g. method was renamed/refactored since that older revision) then the file match won't be returned to the filter job in the first place. The user would need to provide a revision that matches the pattern. The commitSearcher approach could get around this because it could confirm not only that some user was a contributor with AuthorMatch, but additionally confirm the commit ID for when that particular user made their changes/worked on the file.

I believe the commitSearcher approach introduces some non trivial trade offs so instead of having to pursue that approach we can just use the post-filter approach in this PR and the users would just try out different tags/revisions for their query. For example, if the user's hazy memory only recalls modifying the file three years ago then pass whatever tags are the major releases at that time (this is in lieu of before: which commit/diff search has). I would expect a user could narrow down that file match in a few attempts.

@camdencheek
Copy link
Member

Got it! Thanks for explaining. Skipping that functionality for now sounds good to me.

@gl-srgr gl-srgr marked this pull request as ready for review June 13, 2023 20:32
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jun 13, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 6a0e88c...d7fb512.

Notify File(s)
@camdencheek internal/search/job/jobutil/BUILD.bazel
internal/search/job/jobutil/filter_file_contributor.go
internal/search/job/jobutil/filter_file_contributor_test.go
internal/search/job/jobutil/job.go
internal/search/query/predicate.go
internal/search/query/predicate_test.go
internal/search/query/types.go
@fkling client/shared/src/search/query/completion.test.ts
client/shared/src/search/query/completion.test.ts
client/shared/src/search/query/predicates.ts
client/shared/src/search/query/predicates.ts
@jtibshirani internal/search/job/jobutil/BUILD.bazel
internal/search/job/jobutil/filter_file_contributor.go
internal/search/job/jobutil/filter_file_contributor_test.go
internal/search/job/jobutil/job.go
internal/search/query/predicate.go
internal/search/query/predicate_test.go
internal/search/query/types.go
@keegancsmith internal/search/job/jobutil/BUILD.bazel
internal/search/job/jobutil/filter_file_contributor.go
internal/search/job/jobutil/filter_file_contributor_test.go
internal/search/job/jobutil/job.go
internal/search/query/predicate.go
internal/search/query/predicate_test.go
internal/search/query/types.go

@@ -514,7 +522,7 @@ func getPathRegexpsFromTextPatternInfo(patternInfo *search.TextPatternInfo) (pat

func computeFileMatchLimit(b query.Basic, p search.Protocol) int {
// Temporary fix:
// If doing ownership search, we post-filter results so we may need more than
// If doing ownership or contributor search, we post-filter results so we may need more than
Copy link
Contributor Author

@gl-srgr gl-srgr Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the missing fix described in this comment meant we wouldn't process count correctly and I started looking into implementing this. Then I realized our LimitJob wraps our fileHasContributorJob so count is handled, we just have some inefficiencies due to child limitJob.child (and related upstream jobs) having count:all.

I'm just following the own approach and setting count:all.

@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jun 13, 2023

📖 Storybook live preview

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jun 13, 2023

@camdencheek PR is updated with code review feedback. Let me know if you want to discuss in a call. Thanks!

// We send one fetch contributors request per file path.
// We should quit early on context deadline exceeded.
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
errs = errors.Append(errs, ctx.Err())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be protected by the mutex, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct, will update

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jun 13, 2023

@camdencheek PR is updated with code review feedback. Let me know if you want to discuss in a call. Thanks!

Not sure about this build failure. I'll take a look rn. Also I forgot about changelog and docs so I'll add those today. You can hold off on review or just approve now (I don't believe the build failure is related to my code)

}

func (f *FileHasContributorPredicate) Unmarshal(params string, negated bool) error {
if _, err := regexp.Compile(params); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax.Parse() is a bit cheaper than a full compile

@gl-srgr gl-srgr merged commit abf62a2 into main Jun 14, 2023
28 checks passed
@gl-srgr gl-srgr deleted the garyl/has_contributors branch June 14, 2023 13:24
ErikaRS pushed a commit that referenced this pull request Jun 22, 2023
A [user
requested](https://sourcegraph.slack.com/archives/C02UM8A9M2R/p1678812044400949)
being able to use `author:` keyword along with a pattern to search for
file matches. Currently `author` keyword is valid for commit and diff
search types. The use case is that a provided pattern will match a part
of the file that was modified by the author and that matching line **may
or may not** be part of the modified lines of the author's commit.

To support this we add a new predicate `file:has.contributor()`. Some
characteristics of the predicate:

- _Contributor_ is an existing
[struct](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/gitserver/gitdomain/common.go?L208).
- Regex is supported. We use same regex expression
[logic](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@2a63d0dc6ff8f520ff26a91c65b4a1366cf36712/-/blob/internal/gitserver/search/match_tree.go?L23-25)
as `author`.
- Case sensitivity can be applied for the contributor criteria of
`file:has.contributor()` with the existing url parameter.
- Negation is supported.
- Multiple predicates will be AND'ed together by default if boolean
operator not specified. This is how `has.owner` works, not sure if all
of the predicates are aligned on this. General discrepancies are tracked
and described in detail
[here](#50539).
- Filtering results happens **after** pattern matches are streamed from
child job
([discussion](#50880 (comment)).
For this reason a user might not always find what they're looking for if
the pattern matches only an older revision of a file and that older
revision is not the revision streamed back from the child job. [One of
my
proposals](#50880 (comment))
would resolve this by iterating through batches of commits with
`commitSearcher` but I'm not sure about performance impact of this
approach.

[GH Issue](#50880) 

## Test plan

new tests for `FileHasContributorsJob`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants