Add support for defining sources via a config file #4172

mcastorina · 2025-05-23T08:05:54Z

Description:

Adds support for running sources as specified in "Local config" sections of the docs.

sources:
- connection:
    '@type': type.googleapis.com/sources.Filesystem
    paths:
    - /home/projects
  name: projects
  type: SOURCE_TYPE_FILESYSTEM
  verify: true

trufflehog multi-scan --config config.yaml

This makes trufflehog easier to use for repeated scans and opens the door for a single invocation to scan multiple sources. It also leverages our existing enterprise-style config format and docs.

Notable changes:

Adds a multi-scan subcommand that expects the existing --config flag
Allows scanning multiple sources concurrently

mcastorina · 2025-05-23T08:10:50Z

pkg/sources/sources.go

+// ConfigurableSource is a Source with most of it's initialization values
+// pre-configured and exposes a simplified Init() method. A ConfigurableSource
+// can only be configured multiple times but only initialized once.
+type ConfigurableSource struct {
+	Name     string
+	source   Source
+	initFunc func(context.Context, SourceID, JobID) error
+}


I created this struct to limit the blast-radius of the protobuf configuration structs.

There's a lot of disparate data we use to initialize a source, a lot of which comes from the configuration file. Encapsulating it in this struct helped me reason about it, but if it's confusing I'm happy to go back to the drawing board (or cede and pass the proto config struct around).

As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.

Yeah I echo @martinlocklear here. I've tried to pinpoint the little "pea under the mattress" feeling I have, and I think it's the hardcoded callback that functions as a partial for the Source's Init call. Like, a different way to do it would be to pull the config info out of the protobuf and into a nested struct in ConfigurableSource, a la:

type ConfigurableSource struct { Name string source Source config struct { name string verify bool conn *anypb.Any concurrency int } }

Then inside Init you can just do:

err := c.source.Init(ctx, c.config.name, jobID, sourceID, c.config.verify, c.config.conn, c.config.concurrency)

Also I'm curious why there's slightly different handling of config.GetName setting ConfigurableSource.Name versus using it in initFunc.

Can definitely switch it to embed a nested struct.. is that preferable to embedding the protobuf in a private attribute directly?

I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.

Also I'm curious why there's slightly different handling of config.GetName setting ConfigurableSource.Name versus using it in initFunc.

I originally just used config.GetName() but then saw I needed the Name value for other stuff, so I exposed it as a public attribute. So to answer your question, it was an oversight!

is that preferable to embedding the protobuf in a private attribute directly?

I think different people think different things, like the tradeoff is tedium/verbosity vs. weird data getting into your internal systems sometimes (the idea here is like, protocol buffers' type system is only so strong--most of the time further validation is warranted). I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.

I'll say that I originally went with the closure as a tool to disallow modifying the initialization parameters. When you create the ConfiguredSource the values are set, but nothing in Go is air-tight so I'm happy to use the embedded struct if it makes the code easier to grok.

Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.

I tend to lean towards pulling things out of "external" data structures into "internal" ones, but don't feel mega strongly.

100% agree with this!

Oh that's a really interesting point. OK yeah then, that makes me 50/50 on it--so whatever you want.

I already switched it to the config struct, so let's just go with that :)

pkg/engine/postman.go

mcastorina · 2025-05-23T08:16:03Z

pkg/config/config.go

+	// Convert to configured sources.
+	var sourceConfigs []sources.ConfigurableSource
+	for _, pbSource := range inputYAML.Sources {
+		s, err := instantiateSourceFromType(pbSource.GetType())
+		if err != nil {
+			return nil, err
+		}
+		src := sources.NewConfigurableSource(s)
+		src.Configure(pbSource)
+
+		sourceConfigs = append(sourceConfigs, src)
 	}


Pre-configuring the sources like this has a few implications:

We lose access to the protobuf configuration (the data is hidden via the ConfigurableSource)

Validation of the config file happens close to parsing it

Because of 2, you can't put unused garbage in the config file (e.g. if you just want to use the config file for custom detectors)

mcastorina · 2025-05-23T08:18:17Z

pkg/engine/engine.go

@@ -99,6 +99,7 @@ type Config struct {
 	// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers)
 	Concurrency int

+	ConfiguredSources             []sources.ConfigurableSource


The naming of this makes me realize this type has two states: not configured (configurable) and configured. I might change the type to ConfiguredSource and you have to initialize it with the config. 🤔

Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.

And your implementation of "configure" invokes Init, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.

Here's how my brain works (warning: here lies dragons)

instantiated source (concrete type but empty) -> configured source (concrete type but knobs tuned and ready for initialization) -> initialized source (ready to chunk)

the Source interface doesn't quite support this thinking, hence the introduction of a new struct. imo it's easier to reason about the state of a Source object when there's actual types differentiating them.

Also, the Source.Init method imo jumbles together the configuration of the source and some of the "bookkeeping" (runtime stuff like JobID and SourceID). A Source doesn't really need that information to be configured.

rosecodym

I am requesting changes specifically because of the "configure"/"configurable"/"init" diction, which I think we need to clear up before we land anything like this.

The concept itself - abstracting the protobuf message behind an injectable callback - I'm genuinely unsure how to feel about. I can see the benefits here, but heuristically, abstracting things behind callbacks has a pretty hefty fixed maintenance cost. I'm going to bounce this around in my team to get some more perspectives, and I'm also curious what @zricethezav thinks.

pkg/engine/postman.go

rosecodym · 2025-05-23T14:21:17Z

proto/config.proto

+    repeated sources.LocalSource sources = 9;
+    repeated custom_detectors.CustomRegex detectors = 13;


i'm into some pretty obscure numbering schemes, you probably haven't heard of them

rosecodym · 2025-05-23T14:23:58Z

pkg/sources/sources.go

+// Init returns the initialized Source. The ConfigurableSource is unusable after
+// calling this method.


Why did you make this a one time use only method?

Are Source objects allowed to be initialized more than once? I'm not sure that boundary is defined so I wanted to be conservative here.

I agree; I think the answer has to be "we don't know", which is worrying 😅. I'll say for myself I don't super understand the Init pattern, like idiomatic Go is NewThingamabob, and thus Thingamabob can't be "new'd" more than once, so like, if you don't do thatn do you then assume you can "new" it more than once? The only way to be sure for a given Source is to read its Init function, which isn't ideal.

rosecodym · 2025-05-23T14:26:19Z

pkg/engine/engine.go

@@ -99,6 +99,7 @@ type Config struct {
 	// also serves as a multiplier for other worker types (e.g., detector workers, notifier workers)
 	Concurrency int

+	ConfiguredSources             []sources.ConfigurableSource


Yeah, this is getting pretty murky pretty fast. You've introduced the new term "configure" to go along with the existing "initialize," and I don't think we can assume that every reader will intuit how the concepts fit together, so if we go down this route we should document the hell out of it.

And your implementation of "configure" invokes Init, adding another layer of stuff to figure out! It might be helpful to start with the language and then see what code organization pops out.

pkg/config/config.go

mcastorina · 2025-05-23T14:52:51Z

abstracting things behind callbacks has a pretty hefty fixed maintenance cost. I'm going to bounce this around in my team to get some more perspectives, and I'm also curious what @zricethezav thinks.

Definitely appreciate more minds on this!

martinlocklear · 2025-05-23T15:41:04Z

@mcastorina - is there something driving this change other than it would be nice to have? I'm wondering what the "why" is (and can that be added to the PR description and resulting commit?)

mcastorina · 2025-05-23T15:49:45Z

@mcastorina - is there something driving this change other than it would be nice to have? I'm wondering what the "why" is (and can that be added to the PR description and resulting commit?)

It makes trufflehog easier to use for repeated scans and opens the door for a single invocation to scan multiple sources. It also leverages our existing enterprise-style config format and docs.

zricethezav · 2025-05-23T16:22:45Z

@mcastorina can this be flagged as WIP so we don't accidentally merge? We'll want a README entry too explaining the heck out of how this config system works.

martinlocklear

Yeah - I think I'm really struggling with the naming here (been puzzling over this PR for a while now). In general the trufflehog project naming is rough - not sure what the right balance is towards making improvements on that front here is.

main.go

martinlocklear · 2025-05-23T16:54:25Z

pkg/sources/sources.go

+// ConfigurableSource is a Source with most of it's initialization values
+// pre-configured and exposes a simplified Init() method. A ConfigurableSource
+// can only be configured multiple times but only initialized once.
+type ConfigurableSource struct {
+	Name     string
+	source   Source
+	initFunc func(context.Context, SourceID, JobID) error
+}


As someone who has been repeatedly bitten by using interface objects as internal objects, I am wildly supportive of not doing that here. Consider me a +1000 for limiting the blast radius this way.

mcastorina · 2025-05-29T05:22:18Z

Okay, I think I've addressed all of the (great) feedback and this is ready for review.

rosecodym

I have some thoughts on ergonomics but nothing blocking.

rosecodym · 2025-05-30T14:25:00Z

pkg/sources/sources.go

@@ -67,7 +67,7 @@ type Source interface {
 	SourceID() SourceID
 	// JobID returns the initialized job ID used for tracking relationships in the DB.
 	JobID() JobID
-	// Init initializes the source.
+	// Init initializes the source. Calling this method more than once is undefined behavior.


pkg/sources/sources.go

main.go

rosecodym · 2025-05-30T14:45:42Z

pkg/engine/scan.go

+
+		source, err := configuredSource.Init(ctx, sourceID, jobID)
+		if err != nil {
+			return refs, err


Could this error be enhanced by wrapping? Or does it already have enough information inside it?

I followed the pattern of the other sources in pkg/engine which directly return the Init / EnumerateAndChunk errors.

trufflehog/pkg/engine/filesystem.go

Lines 33 to 36 in 931f008

if err := fileSystemSource.Init(ctx, sourceName, jobID, sourceID, true, &conn, runtime.NumCPU()); err != nil {

return sources.JobProgressRef{}, err

}

return e.sourceManager.EnumerateAndScan(ctx, sourceName, fileSystemSource)

But to answer your question, I do think we could add some information (like initialization failed: %w). For consistency I'm going to refrain from that though, and suggest we do that in another PR.

rosecodym · 2025-05-30T14:49:09Z

pkg/engine/scan.go

+		// Start the scan.
+		ref, err := e.sourceManager.EnumerateAndScan(ctx, configuredSource.Name, source)
+		if err != nil {
+			return refs, err


same question about this error

rosecodym · 2025-05-30T14:54:03Z

pkg/config/config.go

+
+// instantiateSourceFromType creates a concrete implementation of
+// sources.Source for the provided type.
+func instantiateSourceFromType(sourceType string) (sources.Source, error) {


I am not the wildest about having a second instantiation path for source structs, but the only alternative I can think of is to have each single-source command call this function, which feels silly.

zricethezav · 2025-06-05T20:13:40Z

lgtm

mcastorina added 6 commits May 23, 2025 00:47

Add config.proto

661e2f1

Make protos

208cf94

Add ConfigurableSource implementation

204d0a0

Parse config protoyaml into configured sources

2f3b668

Refactor setting postman keywords into a helper method on the engine

e84b36b

Add scan subcommand and plumb it all together

21428c2

mcastorina requested review from a team as code owners May 23, 2025 08:05

mcastorina commented May 23, 2025

View reviewed changes

rosecodym requested changes May 23, 2025

View reviewed changes

Refactor ConfigurableSource to ConfiguredSource

8e4432f

Export AhoCorasickCoreKeywords and return the set

cac99f9

mcastorina marked this pull request as draft May 23, 2025 16:27

martinlocklear reviewed May 23, 2025

View reviewed changes

mcastorina added 3 commits May 23, 2025 11:45

Disallow source configurations for non-scan subcommands

4f35465

Merge remote-tracking branch 'origin/main' into source-configs

518aea8

Fix stdin scan compilation errors

b36cda2

zricethezav mentioned this pull request May 27, 2025

[GitHub] Add a GraphQL client to the connector #3837

Open

2 tasks

mcastorina added 3 commits May 28, 2025 21:25

Rename command to multi-scan

606e06e

Add config file documentation

22a0ada

Add more documentation for ConfiguredSource

208f2e2

mcastorina marked this pull request as ready for review May 29, 2025 04:48

Replace initFunc closure with an embedded anonymous struct

931f008

mcastorina requested a review from camgunz May 29, 2025 14:45

mcastorina requested review from martinlocklear and rosecodym May 29, 2025 14:45

rosecodym approved these changes May 30, 2025

View reviewed changes

mcastorina added 2 commits June 1, 2025 21:03

Assign directly to refs

b584da0

Fix typo

1455006

mcastorina requested a review from zricethezav June 2, 2025 04:05

kashifkhan0771 approved these changes Jun 3, 2025

View reviewed changes

zricethezav approved these changes Jun 5, 2025

View reviewed changes

abmussani approved these changes Jun 6, 2025

View reviewed changes

mcastorina merged commit ce3f2ae into main Jun 6, 2025
13 checks passed

mcastorina deleted the source-configs branch June 6, 2025 14:46

blsaccess mentioned this pull request Jun 6, 2025

Update trufflehog to 3.89.0 blacklanternsecurity/bbot#2475

Merged

		repeated sources.LocalSource sources = 9;
		repeated custom_detectors.CustomRegex detectors = 13;

		// Init returns the initialized Source. The ConfigurableSource is unusable after
		// calling this method.

	if err := fileSystemSource.Init(ctx, sourceName, jobID, sourceID, true, &conn, runtime.NumCPU()); err != nil {
	return sources.JobProgressRef{}, err
	}
	return e.sourceManager.EnumerateAndScan(ctx, sourceName, fileSystemSource)

Add support for defining sources via a config file #4172

Add support for defining sources via a config file #4172

Uh oh!

Conversation

mcastorina commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Notable changes:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcastorina May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rosecodym left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcastorina commented May 23, 2025

Uh oh!

martinlocklear commented May 23, 2025

Uh oh!

mcastorina commented May 23, 2025

Uh oh!

zricethezav commented May 23, 2025

Uh oh!

martinlocklear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcastorina commented May 29, 2025

Uh oh!

rosecodym left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zricethezav commented Jun 5, 2025

mcastorina commented May 23, 2025 •

edited

Loading

mcastorina May 29, 2025 •

edited

Loading