Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff ingester: Allow configurable wildcards to match ecosystem file names, plus some cleanups #975

Merged
merged 5 commits into from
Sep 25, 2023

Conversation

jhrozek
Copy link
Contributor

@jhrozek jhrozek commented Sep 18, 2023

I started adding wildcard support for the package ecosystem matching in the
diff ingester and ended up doing a little cleanup after the code had been
rushed a bit to me demoable. I ended up adding pagination support, removed
some code that was just needed after a refactoring and added the wildcard
support with a basic test. More tests are incoming, there was only so much
time on the plane :-)

Fixes: #954

Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig the cleanup and tests, but let's move the github dependencies to one place as we prepare for the pluggable providers to land


"github.com/google/go-github/v53/github"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the github API directly? Instead, let's include it in the github provider implementation. This way, we'll be able to better abstract this in the future.

Copy link
Contributor Author

@jhrozek jhrozek Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the code calling the GH API directly, but just this line:

	listOpts := &github.ListOptions{PerPage: 30}

I can modify the ListFiles method to accept a generic parameter struct instead, that would get rid of the import, but note that the REST API currently returns GH specific types and there are other methods that accept GH specific types so maybe a bigger cleanup is in order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are indeed gonna need a bigger cleanup

if err != nil {
return nil, fmt.Errorf("error getting pull request files: %w", err)
}
logger := zerolog.Ctx(ctx).With().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a bug for migrating everything to zero log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I think @evankanderson pointed me to code that showed that calling log triggers calling zerolog already? So far I've been just adding zerolog calls to code that I'm writing from scratch or cleaning up, but I don't think we track any systematical cleanup.

evankanderson
evankanderson previously approved these changes Sep 18, 2023
continue
listOpts := &github.ListOptions{PerPage: 30}
for {
prFiles, resp, err := di.cli.ListFiles(ctx, pr.RepoOwner, pr.RepoName, int(pr.Number), listOpts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Long term, it would be nice to be able to use a channel to feed the results from reading the paginated list, but this seems fine for now. (Like using a generator in some other languages.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use channels in the vulnerability evaluator to have a pipeline between the database and the package repository, there is actually seems like it will cut down on the wait time quite a bit and even simplify the code. If you don't mind, in this module I would wait and implement that later (maybe we implement the pagination in the REST API module by returning a channel?)

For the moment a simple loop seems fine because 99% of the list results would be just discarded right away. (most PRs don't add deps..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my thought was that later we could implement lazy pagination (rather than the eager pagination here) by pulling from a channel and having the channel's capacity block further pagination calls.

Don't do this before November, though. We have bigger fish to fry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and more fish :-)
I filed #996 in the meantime linking back here so we are reminded of this discussion later.

Comment on lines 88 to 89
eco := di.getEcosystemForFile(*file.Filename)
if eco == DepEcosystemNone {
logger.Debug().
Str("filename", *file.Filename).
Msg("No ecosystem found, skipping")
continue
}

logger.Debug().
Str("filename", *file.Filename).
Str("package-ecosystem", string(eco)).
Msg("No ecosystem found, skipping")

parser := newEcosystemParser(eco)
if parser == nil {
return nil, fmt.Errorf("no parser found for ecosystem %s", eco)
}

depBatch, err := parser(*file.Patch)
if err != nil {
return nil, fmt.Errorf("error parsing file %s: %w", *file.Filename, err)
}

for i := range depBatch {
dep := depBatch[i]
allDiffs = append(allDiffs, &pb.PrDependencies_ContextualDependency{
Dep: dep,
File: &pb.PrDependencies_ContextualDependency_FilePatch{
Name: file.GetFilename(),
PatchUrl: file.GetRawURL(),
},
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to extract this into a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done locally, will push after a round of testing

Comment on lines 96 to 99
logger.Debug().
Str("filename", *file.Filename).
Str("package-ecosystem", string(eco)).
Msg("No ecosystem found, skipping")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to have this message here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I fixed this already locally, but /in a different branch/ (I have already been building more tests atop this other branch)
I did mean to leave a debug message there, but it shouldn't say that no ecosystem matched..

Comment on lines 102 to 104
if parser == nil {
return nil, fmt.Errorf("no parser found for ecosystem %s", eco)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both this and the eco == DepEcosystemNone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not at the moment and not strictly needed, but at the same time I think it's better to check each call's return value (or alternatively we might want to encapsulate the two calls into one function)

allDiffs = append(allDiffs, &pb.PrDependencies_ContextualDependency{
Dep: dep,
File: &pb.PrDependencies_ContextualDependency_FilePatch{
Name: file.GetFilename(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use file.GetFilename() here, but *file.Filename on lines 88, 91, and 97?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason except inconsistency. Thanks, fixed.

Comment on lines 142 to 147
if strings.HasPrefix(ecoMapping.Depfile, wildcard) {
strippedDepfile := strings.TrimPrefix(ecoMapping.Depfile, wildcard)
if lastComponent == strippedDepfile {
return DependencyEcosystem(ecoMapping.Name)
}
} else if filename == ecoMapping.Depfile {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this is supposed to do -- it feels like this should be a glob (and maybe matching with regexp), but the code doesn't look like the code I'd expect for a glob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the code was difficult to parse, then please block the PR merge that's what reviews are for :-)

Does the current version with filepath.Match read better?
(I wasn't sure initially about fifepatch.Match matching in the middle of the path, seemed like a footgun, but in the end the code is readable and more generic with Match...)

internal/engine/ingester/diff/diff_test.go Show resolved Hide resolved
repeated FilePatch patches = 6; // The list of files changed in the PR. Does not include file contents

int64 author_id = 7; // The author of the PR, will be used to check if we can request changes
int64 author_id = 6; // The author of the PR, will be used to check if we can request changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional to renumber these proto fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this protobuf struct is only ever used between mediator server and the policy engine and not exposed to the clients, so I think it's better to renumber while removing the FilePatch attribute. When/if we send this struct to the clients, renumbering fields should probably be a hard nack.

@JAORMX JAORMX self-requested a review September 19, 2023 06:38
JAORMX
JAORMX previously approved these changes Sep 19, 2023
continue
listOpts := &github.ListOptions{PerPage: 30}
for {
prFiles, resp, err := di.cli.ListFiles(ctx, pr.RepoOwner, pr.RepoName, int(pr.Number), listOpts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my thought was that later we could implement lazy pagination (rather than the eager pagination here) by pulling from a channel and having the channel's capacity block further pagination calls.

Don't do this before November, though. We have bigger fish to fry.

Str("package-ecosystem", string(eco)).
Msg("matched ecosystem")

parser := newEcosystemParser(eco)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Following up from the earlier comment)

My suggestion about the check for eco == DepEcosystemNone was, in fact, to merge the two function calls into one:

if parser == nil {
  logger.Debug().Str("filename", filename).Msg("No ecosystem found, skipping")
  return nil, nil
}

logger.Debug().Str("filename", filename).Str("package-ecosystem", parser.Ecosystem()).
  Msg("matched ecosystem")  // Can also use .Send() IIRC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

internal/engine/ingester/diff/diff_test.go Show resolved Hide resolved
internal/engine/ingester/diff/diff_test.go Show resolved Hide resolved
evankanderson
evankanderson previously approved these changes Sep 25, 2023
@jhrozek
Copy link
Contributor Author

jhrozek commented Sep 25, 2023

@evankanderson would you mind approving this version again? There are no functional changes, just a rebase to resolve conflicts in the autogenerated protobuf files.

@jhrozek jhrozek merged commit e2a22fe into mindersec:main Sep 25, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The package ecosystem ingester should support nested files (maybe a wildcard?)
3 participants