-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for defining sources via a config file #4172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pkg/sources/sources.go
Outdated
// ConfigurableSource is a Source with most of it's initialization values | ||
// pre-configured and exposes a simplified Init() method. A ConfigurableSource | ||
// can only be configured multiple times but only initialized once. | ||
type ConfigurableSource struct { | ||
Name string | ||
source Source | ||
initFunc func(context.Context, SourceID, JobID) error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this struct to limit the blast-radius of the protobuf configuration structs.
There's a lot of disparate data we use to initialize a source, a lot of which comes from the configuration file. Encapsulating it in this struct helped me reason about it, but if it's confusing I'm happy to go back to the drawing board (or cede and pass the proto config struct around).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I echo @martinlocklear here. I've tried to pinpoint the little "pea under the mattress" feeling I have, and I think it's the hardcoded callback that functions as a partial for the Source's Init
call. Like, a different way to do it would be to pull the config info out of the protobuf and into a nested struct in ConfigurableSource
, a la:
type ConfigurableSource struct {
Name string
source Source
config struct {
name string
verify bool
conn *anypb.Any
concurrency int
}
}
Then inside Init
you can just do:
err := c.source.Init(ctx, c.config.name, jobID, sourceID, c.config.verify, c.config.conn, c.config.concurrency)
Also I'm curious why there's slightly different handling of config.GetName
setting ConfigurableSource.Name
versus using it in initFunc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can definitely switch it to embed a nested struct.. is that preferable to embedding the protobuf in a private attribute directly?
I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource
the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.
Also I'm curious why there's slightly different handling of config.GetName setting ConfigurableSource.Name versus using it in initFunc.
I originally just used config.GetName()
but then saw I needed the Name
value for other stuff, so I exposed it as a public attribute. So to answer your question, it was an oversight!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that preferable to embedding the protobuf in a private attribute directly?
I think different people think different things, like the tradeoff is tedium/verbosity vs. weird data getting into your internal systems sometimes (the idea here is like, protocol buffers' type system is only so strong--most of the time further validation is warranted). I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.
I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.
Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.
100% agree with this!
Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.
I already switched it to the config struct, so let's just go with that :)
pkg/config/config.go
Outdated
// Convert to configured sources. | ||
var sourceConfigs []sources.ConfigurableSource | ||
for _, pbSource := range inputYAML.Sources { | ||
s, err := instantiateSourceFromType(pbSource.GetType()) | ||
if err != nil { | ||
return nil, err | ||
} | ||
src := sources.NewConfigurableSource(s) | ||
src.Configure(pbSource) | ||
|
||
sourceConfigs = append(sourceConfigs, src) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-configuring the sources like this has a few implications:
- We lose access to the protobuf configuration (the data is hidden via the
ConfigurableSource
) - Validation of the config file happens close to parsing it
- Because of 2, you can't put unused garbage in the config file (e.g. if you just want to use the config file for custom detectors)
pkg/engine/engine.go
Outdated
@@ -99,6 +99,7 @@ type Config struct { | |||
// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers) | |||
Concurrency int | |||
|
|||
ConfiguredSources []sources.ConfigurableSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of this makes me realize this type has two states: not configured (configurable) and configured. I might change the type to ConfiguredSource
and you have to initialize it with the config. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.
And your implementation of "configure" invokes Init
, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how my brain works (warning: here lies dragons)
instantiated source (concrete type but empty)
-> configured source (concrete type but knobs tuned and ready for initialization)
-> initialized source (ready to chunk)
the Source
interface doesn't quite support this thinking, hence the introduction of a new struct. imo it's easier to reason about the state of a Source
object when there's actual types differentiating them.
Also, the Source.Init
method imo jumbles together the configuration of the source and some of the "bookkeeping" (runtime stuff like JobID
and SourceID
). A Source
doesn't really need that information to be configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am requesting changes specifically because of the "configure"/"configurable"/"init" diction, which I think we need to clear up before we land anything like this.
The concept itself - abstracting the protobuf message behind an injectable callback - I'm genuinely unsure how to feel about. I can see the benefits here, but heuristically, abstracting things behind callbacks has a pretty hefty fixed maintenance cost. I'm going to bounce this around in my team to get some more perspectives, and I'm also curious what @zricethezav thinks.
repeated sources.LocalSource sources = 9; | ||
repeated custom_detectors.CustomRegex detectors = 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm into some pretty obscure numbering schemes, you probably haven't heard of them
pkg/sources/sources.go
Outdated
// Init returns the initialized Source. The ConfigurableSource is unusable after | ||
// calling this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you make this a one time use only method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are Source
objects allowed to be initialized more than once? I'm not sure that boundary is defined so I wanted to be conservative here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; I think the answer has to be "we don't know", which is worrying 😅. I'll say for myself I don't super understand the Init
pattern, like idiomatic Go is NewThingamabob
, and thus Thingamabob
can't be "new'd" more than once, so like, if you don't do thatn do you then assume you can "new" it more than once? The only way to be sure for a given Source is to read its Init
function, which isn't ideal.
pkg/engine/engine.go
Outdated
@@ -99,6 +99,7 @@ type Config struct { | |||
// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers) | |||
Concurrency int | |||
|
|||
ConfiguredSources []sources.ConfigurableSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.
And your implementation of "configure" invokes Init
, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.
Definitely appreciate more minds on this! |
@mcastorina - is there something driving this change other than it would be nice to have? I'm wondering what the "why" is (and can that be added to the PR description and resulting commit?) |
It makes |
@mcastorina can this be flagged as WIP so we don't accidentally merge? We'll want a README entry too explaining the heck out of how this config system works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I think I'm really struggling with the naming here (been puzzling over this PR for a while now). In general the trufflehog project naming is rough - not sure what the right balance is towards making improvements on that front here is.
pkg/sources/sources.go
Outdated
// ConfigurableSource is a Source with most of it's initialization values | ||
// pre-configured and exposes a simplified Init() method. A ConfigurableSource | ||
// can only be configured multiple times but only initialized once. | ||
type ConfigurableSource struct { | ||
Name string | ||
source Source | ||
initFunc func(context.Context, SourceID, JobID) error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.
Okay, I think I've addressed all of the (great) feedback and this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some thoughts on ergonomics but nothing blocking.
@@ -67,7 +67,7 @@ type Source interface { | |||
SourceID() SourceID | |||
// JobID returns the initialized job ID used for tracking relationships in the DB. | |||
JobID() JobID | |||
// Init initializes the source. | |||
// Init initializes the source. Calling this method more than once is undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
||
source, err := configuredSource.Init(ctx, sourceID, jobID) | ||
if err != nil { | ||
return refs, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this error be enhanced by wrapping? Or does it already have enough information inside it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the pattern of the other sources in pkg/engine
which directly return the Init
/ EnumerateAndChunk
errors.
trufflehog/pkg/engine/filesystem.go
Lines 33 to 36 in 931f008
if err := fileSystemSource.Init(ctx, sourceName, jobID, sourceID, true, &conn, runtime.NumCPU()); err != nil { | |
return sources.JobProgressRef{}, err | |
} | |
return e.sourceManager.EnumerateAndScan(ctx, sourceName, fileSystemSource) |
But to answer your question, I do think we could add some information (like initialization failed: %w
). For consistency I'm going to refrain from that though, and suggest we do that in another PR.
// Start the scan. | ||
ref, err := e.sourceManager.EnumerateAndScan(ctx, configuredSource.Name, source) | ||
if err != nil { | ||
return refs, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question about this error
|
||
// instantiateSourceFromType creates a concrete implementation of | ||
// sources.Source for the provided type. | ||
func instantiateSourceFromType(sourceType string) (sources.Source, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not the wildest about having a second instantiation path for source structs, but the only alternative I can think of is to have each single-source command call this function, which feels silly.
lgtm |
Description:
Adds support for running sources as specified in "Local config" sections of the docs.
Example for Filesystem:
This makes
trufflehog
easier to use for repeated scans and opens the door for a single invocation to scan multiple sources. It also leverages our existing enterprise-style config format and docs.Notable changes:
multi-scan
subcommand that expects the existing--config
flag