Skip to content

Add support for defining sources via a config file #4172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 6, 2025
Merged

Conversation

mcastorina
Copy link
Collaborator

@mcastorina mcastorina commented May 23, 2025

Description:

Adds support for running sources as specified in "Local config" sections of the docs.

Example for Filesystem:

sources:
- connection:
    '@type': type.googleapis.com/sources.Filesystem
    paths:
    - /home/projects
  name: projects
  type: SOURCE_TYPE_FILESYSTEM
  verify: true
trufflehog multi-scan --config config.yaml

This makes trufflehog easier to use for repeated scans and opens the door for a single invocation to scan multiple sources. It also leverages our existing enterprise-style config format and docs.

Notable changes:

  • Adds a multi-scan subcommand that expects the existing --config flag
  • Allows scanning multiple sources concurrently

@mcastorina mcastorina requested review from a team as code owners May 23, 2025 08:05
Comment on lines 108 to 115
// ConfigurableSource is a Source with most of it's initialization values
// pre-configured and exposes a simplified Init() method. A ConfigurableSource
// can only be configured multiple times but only initialized once.
type ConfigurableSource struct {
Name string
source Source
initFunc func(context.Context, SourceID, JobID) error
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this struct to limit the blast-radius of the protobuf configuration structs.

There's a lot of disparate data we use to initialize a source, a lot of which comes from the configuration file. Encapsulating it in this struct helped me reason about it, but if it's confusing I'm happy to go back to the drawing board (or cede and pass the proto config struct around).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I echo @martinlocklear here. I've tried to pinpoint the little "pea under the mattress" feeling I have, and I think it's the hardcoded callback that functions as a partial for the Source's Init call. Like, a different way to do it would be to pull the config info out of the protobuf and into a nested struct in ConfigurableSource, a la:

type ConfigurableSource struct {
  Name string
  source Source
  config struct {
    name string
    verify bool
    conn *anypb.Any
    concurrency int
  }
}

Then inside Init you can just do:

err := c.source.Init(ctx, c.config.name, jobID, sourceID, c.config.verify, c.config.conn, c.config.concurrency)

Also I'm curious why there's slightly different handling of config.GetName setting ConfigurableSource.Name versus using it in initFunc.

Copy link
Collaborator Author

@mcastorina mcastorina May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can definitely switch it to embed a nested struct.. is that preferable to embedding the protobuf in a private attribute directly?

I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.

Also I'm curious why there's slightly different handling of config.GetName setting ConfigurableSource.Name versus using it in initFunc.

I originally just used config.GetName() but then saw I needed the Name value for other stuff, so I exposed it as a public attribute. So to answer your question, it was an oversight!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that preferable to embedding the protobuf in a private attribute directly?

I think different people think different things, like the tradeoff is tedium/verbosity vs. weird data getting into your internal systems sometimes (the idea here is like, protocol buffers' type system is only so strong--most of the time further validation is warranted). I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.

I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.

Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.

100% agree with this!

Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.

I already switched it to the config struct, so let's just go with that :)

Comment on lines 57 to 68
// Convert to configured sources.
var sourceConfigs []sources.ConfigurableSource
for _, pbSource := range inputYAML.Sources {
s, err := instantiateSourceFromType(pbSource.GetType())
if err != nil {
return nil, err
}
src := sources.NewConfigurableSource(s)
src.Configure(pbSource)

sourceConfigs = append(sourceConfigs, src)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-configuring the sources like this has a few implications:

  1. We lose access to the protobuf configuration (the data is hidden via the ConfigurableSource)
  2. Validation of the config file happens close to parsing it
  3. Because of 2, you can't put unused garbage in the config file (e.g. if you just want to use the config file for custom detectors)

@@ -99,6 +99,7 @@ type Config struct {
// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers)
Concurrency int

ConfiguredSources []sources.ConfigurableSource
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this makes me realize this type has two states: not configured (configurable) and configured. I might change the type to ConfiguredSource and you have to initialize it with the config. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.

And your implementation of "configure" invokes Init, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how my brain works (warning: here lies dragons)

instantiated source (concrete type but empty)
-> configured source (concrete type but knobs tuned and ready for initialization)
-> initialized source (ready to chunk)

the Source interface doesn't quite support this thinking, hence the introduction of a new struct. imo it's easier to reason about the state of a Source object when there's actual types differentiating them.

Also, the Source.Init method imo jumbles together the configuration of the source and some of the "bookkeeping" (runtime stuff like JobID and SourceID). A Source doesn't really need that information to be configured.

Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am requesting changes specifically because of the "configure"/"configurable"/"init" diction, which I think we need to clear up before we land anything like this.

The concept itself - abstracting the protobuf message behind an injectable callback - I'm genuinely unsure how to feel about. I can see the benefits here, but heuristically, abstracting things behind callbacks has a pretty hefty fixed maintenance cost. I'm going to bounce this around in my team to get some more perspectives, and I'm also curious what @zricethezav thinks.

Comment on lines +11 to +12
repeated sources.LocalSource sources = 9;
repeated custom_detectors.CustomRegex detectors = 13;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm into some pretty obscure numbering schemes, you probably haven't heard of them

Comment on lines 147 to 148
// Init returns the initialized Source. The ConfigurableSource is unusable after
// calling this method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make this a one time use only method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Source objects allowed to be initialized more than once? I'm not sure that boundary is defined so I wanted to be conservative here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I think the answer has to be "we don't know", which is worrying 😅. I'll say for myself I don't super understand the Init pattern, like idiomatic Go is NewThingamabob, and thus Thingamabob can't be "new'd" more than once, so like, if you don't do thatn do you then assume you can "new" it more than once? The only way to be sure for a given Source is to read its Init function, which isn't ideal.

@@ -99,6 +99,7 @@ type Config struct {
// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers)
Concurrency int

ConfiguredSources []sources.ConfigurableSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.

And your implementation of "configure" invokes Init, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.

@mcastorina
Copy link
Collaborator Author

abstracting things behind callbacks has a pretty hefty fixed maintenance cost. I'm going to bounce this around in my team to get some more perspectives, and I'm also curious what @zricethezav thinks.

Definitely appreciate more minds on this!

@martinlocklear
Copy link
Contributor

@mcastorina - is there something driving this change other than it would be nice to have? I'm wondering what the "why" is (and can that be added to the PR description and resulting commit?)

@mcastorina
Copy link
Collaborator Author

@mcastorina - is there something driving this change other than it would be nice to have? I'm wondering what the "why" is (and can that be added to the PR description and resulting commit?)

It makes trufflehog easier to use for repeated scans and opens the door for a single invocation to scan multiple sources. It also leverages our existing enterprise-style config format and docs.

@zricethezav
Copy link
Collaborator

@mcastorina can this be flagged as WIP so we don't accidentally merge? We'll want a README entry too explaining the heck out of how this config system works.

@mcastorina mcastorina marked this pull request as draft May 23, 2025 16:27
Copy link
Contributor

@martinlocklear martinlocklear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think I'm really struggling with the naming here (been puzzling over this PR for a while now). In general the trufflehog project naming is rough - not sure what the right balance is towards making improvements on that front here is.

Comment on lines 108 to 115
// ConfigurableSource is a Source with most of it's initialization values
// pre-configured and exposes a simplified Init() method. A ConfigurableSource
// can only be configured multiple times but only initialized once.
type ConfigurableSource struct {
Name string
source Source
initFunc func(context.Context, SourceID, JobID) error
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.

@mcastorina mcastorina marked this pull request as ready for review May 29, 2025 04:48
@mcastorina
Copy link
Collaborator Author

Okay, I think I've addressed all of the (great) feedback and this is ready for review.

@mcastorina mcastorina requested a review from camgunz May 29, 2025 14:45
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some thoughts on ergonomics but nothing blocking.

@@ -67,7 +67,7 @@ type Source interface {
SourceID() SourceID
// JobID returns the initialized job ID used for tracking relationships in the DB.
JobID() JobID
// Init initializes the source.
// Init initializes the source. Calling this method more than once is undefined behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️


source, err := configuredSource.Init(ctx, sourceID, jobID)
if err != nil {
return refs, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this error be enhanced by wrapping? Or does it already have enough information inside it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern of the other sources in pkg/engine which directly return the Init / EnumerateAndChunk errors.

if err := fileSystemSource.Init(ctx, sourceName, jobID, sourceID, true, &conn, runtime.NumCPU()); err != nil {
return sources.JobProgressRef{}, err
}
return e.sourceManager.EnumerateAndScan(ctx, sourceName, fileSystemSource)

But to answer your question, I do think we could add some information (like initialization failed: %w). For consistency I'm going to refrain from that though, and suggest we do that in another PR.

// Start the scan.
ref, err := e.sourceManager.EnumerateAndScan(ctx, configuredSource.Name, source)
if err != nil {
return refs, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about this error


// instantiateSourceFromType creates a concrete implementation of
// sources.Source for the provided type.
func instantiateSourceFromType(sourceType string) (sources.Source, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not the wildest about having a second instantiation path for source structs, but the only alternative I can think of is to have each single-source command call this function, which feels silly.

@mcastorina mcastorina requested a review from zricethezav June 2, 2025 04:05
@zricethezav
Copy link
Collaborator

lgtm

@mcastorina mcastorina merged commit ce3f2ae into main Jun 6, 2025
13 checks passed
@mcastorina mcastorina deleted the source-configs branch June 6, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants