Masking rewrite by jpcoenen · Pull Request #267 · secrethub/secrethub-cli

jpcoenen · 2020-03-23T11:09:45Z

Complete rewrite of the masking functionality, with the following goals:

Fix output to stdout and stderr getting mixed when written approximately at the same time. (stdout and stderr mixed in secrethub run #196 )
Make code more legible, understandable and robust. The original implementation had some different constraints and requirements than what is has evolved to. This has led to a lot of unnecessary complexity, which can be removed by rethinking the design.
Reduce the data-dependency on the timing of the masking functionality. The previous implementation had a strong relation between whether input contains secrets and the time consumed by the masker. With the original assumption that only the secret's owner has control over the input of the masker, this was no issue. However, as the masking functionality has broad applicability, it was decided to reduce this data-dependency as much as possible.

TODO: - Add comments - Rethink the matching of secrets (flushN() + lookForMatches()). The current mechanism is pretty inefficient by continuously searching the part of the buffer that is being flushed + maxLookback number of bytes before that. This means that if the program that is being masked uses a lot of small writes, we're doing a regexp lookup many times over on the same text. The previous implementation did this better by only checking every byte once. However, this code was much more complex. There must be a middle-ground here between complexity and performance.

jpcoenen · 2020-03-23T12:01:54Z

Should have said this in the PR description: but for now I'm mainly looking for a high-level review. So: is this new structure we want to go ahead with? Then I'll polish it.

mackenbach · 2020-03-23T16:38:56Z

Small suggestion: @jpcoenen if you want high level overview, maybe draw a quick diagram of how it works? Even a handdrawn sketch would help.

jpcoenen · 2020-03-23T16:40:52Z

Will do 👍 That's coming tomorrow.

This gets rid of the maxLookback. Every byte is only checked once for a secret. It does require some comments. Looks very complicated at first, but with some proper documentation, it is not at all. Mainly because concerns are strictly separated.

A lot cleaner now. And we're back to using <redacted by SecretHub> instead of *****.

jpcoenen · 2020-03-26T12:24:56Z

@SimonBarendse @mackenbach I've done quite a rewrite. It might still look very complex. However, I think that it has been reduced to the inherent complexity of the required behaviour. The signature of every function is as simple as I could get it.

A global overview of how it works (also included a diagram; hope it helps):

Ingesting frames

stream.Write() is called whenever the child process writes something to stdout or stderr. Then this happens:
a. IndexedBuffer.Write(): we store the incoming bytes (a frame) in a buffer
b. multipleMathcer.Write(): we directly check for any secrets.
b. stream.addMatch(): If we find any secrets, we store them in a map with the index of the match (which is incrementing and unqiue over the lifetime of the masker) and the match's length.
c. Masker.registerFrame(): We also set a timer to flush all the bytes of this frame after BufferDelay has passed.

Flushing frames

Run() (executed in a goroutine) continuously waits for timers to expire. if this happens, it calls stream.flush(), which writes the frame to the destination io.Writer, while masking any previously found secrets.

mackenbach

Very comprehensive rewrite/refactor. It looks like you were able to reuse quite a lot of the old logic that was sound and simplified the rest. I have quite a few comments, mostly architectural and naming/documentation.

Note: Last time we touched this functionality, I also made remarks as to the readability and complexity of the code and we then decided to improve that later... which became never. So I hope we can prioritize readability and simplicity now to avoid having to do the same dance a third time.

On the requirements side, I have a few questions that I think are good to tick off explicitly:

Is the secret masking constant-time? If so, I feel we should mention that somewhere.
Is the masking still a best-effort attempt? If so, I feel we should mention that somewhere.
Does it work nicely with the TTY cases where output is written line-by line?
How does it handle multi-line secrets? And if multi-line secrets are written line by line, how does that affect the masking behavior?
Does it work nicely with (non-TTY) cases where bytes are not buffered before being written to our writer? If I understand your code correctly, that would mean every frame is one byte long?
Does it not write to both (stderr and stdout) output streams at the same time (at least in the TTY cases)? In other words, does it keep both output streams from interfering?
Finally, and this is one that didn't come up in the requirements discussion but I realized while reviewing (see also comments in the review): can this run without errors for a long time and with a lot of output being processed over time (e.g. a use case where a server is wrapped with secrethub run and runs for a year, writing log lines to stdout)?

Overall, very well done. I'm happy we are able to (finally) fix this and in a pretty elegant way too.

mackenbach · 2020-03-26T12:45:09Z

+// Run continuously flushes the input buffer for each frame for which the buffer delay has passed.
+// This method should be run in a separate goroutine.
+// If a struct is passed on flushAllChan, all pending frames are flushed to the output and the method returns.
+func (m *Masker) Run() {


Is it Run() or Start()? If Start() makes more sense, then I'd expect there to be a Stop() function too probably. Now the stopping behavior seems to be controlled by the flushAllChan, which is private, but is documented in the public docs here.

Also, do 'streams' still accept write calls after Flush has been called? And is that desired (and documented)?

Good suggestion.

Also, do 'streams' still accept write calls after Flush has been called? And is that desired (and documented)?

// This should be run after all input has been written to the io.Writers of the streams.

I'll see whether I can make this a little more explicit.

You will get a panic if you write to a masked io.Writer after calling Close(). I've documented that behaviour. Catching without a panic requires some messy logic because streams would have to know whether the masker has stopped or not.

Should we catch that error? You could wrap the streams with a very simple io.Writer interface that does exactly that one check. Can be a simple writeFunc or something.

Hmm, interesting approach. Will look into it.

Getting this solid is not as easy as I thought. For example, what happens if your in a write when you stop the masker? At the moment the behaviour is undefined because you're using the masker wrong. If we start guarding for this, I feel like we should do it properly, but that again adds some complexity. Is that what we want?

Yes, there's definitely a tradeoff here. Consider adding to the docs that writing to any of the writers (/streams?) after calling Masker.Close() causes a panic. I think that should cover it well enough for now and we can add complexity when we find a valid use case for it.

That's already there https://github.com/secrethub/secrethub-cli/pull/267/files#diff-331cd1ff2b2231141462b60c73817d48R108

SimonBarendse

Are you still looking for only a high level review?

Based on your diagram, the high level implementation looks good! Looks like the two important jobs, masking and buffering/timeouts are nicely separated.

Is there a need for the bytes to go through the masking first and then in the buffer? Seems like going from the stdout write directly to the buffer could simplify it a bit more, because the bytes don't have to pass through the "stream" (masking engine) one more time. But, while this might seem complex in the diagram it might work out pretty smooth in the implementation?

jpcoenen · 2020-03-26T14:02:17Z

Are you still looking for only a high level review?

Always welcome, but I am pretty convinced now that this architecture is more solid than the previous one.

Is there a need for the bytes to go through the masking first and then in the buffer? Seems like going from the stdout write directly to the buffer could simplify it a bit more, because the bytes don't have to pass through the "stream" (masking engine) one more time. But, while this might seem complex in the diagram it might work out pretty smooth in the implementation?

I don't really get what you mean here. If you think it's important, let's discuss it face-to-face.

This makes it clear what is configurable and what not.

Otherwise no secrets get matched if buffering is disabled. To compensate for matching logic now affecting the buffer delay, a compensating offset is introduced.

florisvdg · 2020-03-31T17:46:34Z

@SimonBarendse should I focus the review on the code or more on the functionality?

mackenbach · 2020-04-01T07:28:23Z

@jpcoenen one question that just hit me: what happens if you have a secret to match that is max size (~512KB if I recall correctly)? Would that create a lot of detectors? Or does it have to do with the starting sequence of the secret?

jpcoenen · 2020-04-01T07:53:25Z

@jpcoenen one question that just hit me: what happens if you have a secret to match that is max size (~512KB if I recall correctly)? Would that create a lot of detectors? Or does it have to do with the starting sequence of the secret?

It only checks for repetitions of the start of the secret. So the size should not matter, only the degree of repetition. If it starts with

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

That does indeed create >300 detectors. If it starts with:

caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

It does with 1 detector.

Would the first scenario be something we want to guard against?

mackenbach · 2020-04-01T07:57:30Z

Great explanation, thanks!

Would the first scenario be something we want to guard against?

No I think it's fine. It would be a very weird edge case and caused by weird behavior on the part of the user. Although we could implement a limit, e.g. max 100 repeating first characters, and document that if the limit is exceeded it isn't guaranteed to mask all (partial) matches?

It's a tiny edge case though. We could also implement it when someone runs into an issue. Right now, the way I get it is that it still masks fine, just the memory (and CPU?) footprint increase hugely, right? That's acceptable for a very weird edge case for now I'd say.

jpcoenen · 2020-04-01T08:37:15Z

Great explanation, thanks!

A good question deserves a good answer.

It's a tiny edge case though. We could also implement it when someone runs into an issue. Right now, the way I get it is that it still masks fine, just the memory (and CPU?) footprint increase hugely, right? That's acceptable for a very weird edge case for now I'd say.

Yep. Probably becomes unusable. I agree with your conclusion.

jpcoenen · 2020-04-01T08:38:25Z

Updated the diagram:

mackenbach · 2020-04-01T11:24:39Z

Solid, with the new diagram you really see how a very complex technical problem can be solved elegantly with no more than necessary complexity. Very good stuff guys.

One tiny typo: I think it's called go masker.Start() now right?

mackenbach · 2020-04-02T09:15:18Z

Okay stupid question that might fuck up your day: how does the masker.Masker depend on the encoding of the sequences [][]byte passed to the constructor AND the encoding of the streams (stdout and stderr)?

For instance, we've encountered a few times UTF16 encoding on Windows. Now I don't know what encoding windows uses for stdout and stderr, but I think it's worth checking for a second:

Is there a real scenario where encoding is different than we're used to?
What would be the impact if that occurred?
Do we want to guard against that, either with code or with documentation?

jpcoenen · 2020-04-02T09:39:27Z

Okay stupid question that might fuck up your day: how does the masker.Masker depend on the encoding of the sequences [][]byte passed to the constructor AND the encoding of the streams (stdout and stderr)?

For instance, we've encountered a few times UTF16 encoding on Windows. Now I don't know what encoding windows uses for stdout and stderr, but I think it's worth checking for a second:
1. Is there a real scenario where encoding is different than we're used to?

2. What would be the impact if that occurred?

3. Do we want to guard against that, either with code or with documentation?

That's a good point. At the moment, if something is written to stdout in UTF-16, it is not masked. We could cover this by introducing a --masking-encoding flag. But I feel that is overkill for now. You are right that we encountered problems with this on Windows, but I don't think that's relevant in this case. The program itself determines what encoding is used in the output. All programs/languages I tested use ASCII or UTF-8, which both seem to work fine.

Unicode characters themselves are supported. ⓗⓔⓛⓛⓞ ⓣⓗⓔⓡⓔ gets masked normally.

For now, we could ask to the docs that only UTF-8 output gets masked.

mackenbach · 2020-04-02T10:38:11Z

Awesome, good balanced response.

Unicode characters themselves are supported. ⓗⓔⓛⓛⓞ ⓣⓗⓔⓡⓔ gets masked normally.

Do we want to add a test for good measure?

For now, we could ask to the docs that only UTF-8 output gets masked.

Yes let's do that.

SimonBarendse

@SimonBarendse should I focus the review on the code or more on the functionality?

@florisvdg As discussed in person during standup, I asked mainly for shared understanding and code ownership to ensure we can maintain this properly and smoothly. This is a complex piece of code, which I think deserves the attention of all of us. Any questions you might have can be reflected in the documentation, to further enhance maintainability.

This is good to go 🚀, but let's make sure to go over this soon @florisvdg , while it's still fresh in our heads.

SimonBarendse · 2020-04-02T16:00:25Z

Before shipping, could you still address this @jpcoenen ?

Unicode characters themselves are supported. ⓗⓔⓛⓛⓞ ⓣⓗⓔⓡⓔ gets masked normally.

Do we want to add a test for good measure?

mackenbach · 2020-04-02T16:15:03Z

PR descriptions get committed, right? If so, please edit it so it's a bit more helpful for later.

The check was not really necessary anyway.

SimonBarendse

jpcoenen · 2020-04-06T09:56:54Z

@SimonBarendse this means you tested it?

SimonBarendse · 2020-04-06T12:37:29Z

@SimonBarendse this means you tested it?

Yes

SimonBarendse reviewed Mar 23, 2020

View reviewed changes

Comment thread internals/cli/masker/writer.go Outdated

Comment thread internals/secrethub/run.go Outdated

Comment thread internals/cli/masker/writer_test.go

Comment thread internals/cli/masker/writer.go Outdated

mackenbach reviewed Mar 23, 2020

View reviewed changes

Comment thread internals/cli/masker/writer.go Outdated

Use old sequenceMatcher to match secrets

44dcc4e

This gets rid of the maxLookback. Every byte is only checked once for a secret. It does require some comments. Looks very complicated at first, but with some proper documentation, it is not at all. Mainly because concerns are strictly separated.

jpcoenen mentioned this pull request Mar 25, 2020

Fix mixing stdout and stderr #265

Closed

jpcoenen added 4 commits March 26, 2020 12:36

Heavily updated new masker implementation

b050778

A lot cleaner now. And we're back to using <redacted by SecretHub> instead of *****.

Add extra comment and a diagram for clarification

6734798

Handle flush error

836afd1

Use cmd.io.Stdout() for testing purposes

ee6dc87

Fix typo in diagram

77c5ebc

jpcoenen requested review from SimonBarendse and mackenbach March 26, 2020 12:27

mackenbach reviewed Mar 26, 2020

View reviewed changes

SimonBarendse reviewed Mar 26, 2020

View reviewed changes

jpcoenen added 8 commits March 26, 2020 16:21

Refactor Run() to Start() and Flush() to Stop()

687ca6a

Unexport methods and rename matchers

3758a9a

Simplify findShift() comment and rename currentIndex to index

06bd87e

Minor comment changes

d02c767

Set options to masker using struct

81f8c92

This makes it clear what is configurable and what not.

Register frame with masker after matching secrets

ddee1b9

Otherwise no secrets get matched if buffering is disabled. To compensate for matching logic now affecting the buffer delay, a compensating offset is introduced.

Add missing argument

8a66ed4

Document masker options

a78ad02

golangcibot reviewed Mar 26, 2020

View reviewed changes

Comment thread internals/cli/masker/stream.go Outdated

Use shorthand time.Until()

0812022

jpcoenen marked this pull request as ready for review March 27, 2020 08:24

mackenbach reviewed Mar 31, 2020

View reviewed changes

Comment thread internals/cli/masker/matcher_test.go

jpcoenen added 2 commits March 31, 2020 16:39

Add comment to writeByte

b6e2608

Improve matcher tests

469b4d5

Update diagram to latest naming scheme

d622804

jpcoenen added 2 commits April 1, 2020 16:18

Fix missed name update in diagram

226712a

Fix typo in comment

d54d6ed

mackenbach added the bug Something isn't working label Apr 2, 2020

jpcoenen requested a review from SimonBarendse April 2, 2020 13:32

SimonBarendse approved these changes Apr 2, 2020

View reviewed changes

Add a test for using a Masker with multiple streams

7d596bc

jpcoenen added 2 commits April 2, 2020 18:16

Add test for masking unicode characters

7ac1f5f

Fix race condition in test

3485cec

The check was not really necessary anyway.

SimonBarendse approved these changes Apr 6, 2020

View reviewed changes

jpcoenen merged commit ef2656e into develop Apr 7, 2020

jpcoenen deleted the feature/mask-rewrite branch April 7, 2020 13:39

florisvdg mentioned this pull request May 6, 2020

Release v0.38.0 #281

Merged

Conversation

jpcoenen commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpcoenen commented Mar 23, 2020

Uh oh!

Uh oh!

mackenbach commented Mar 23, 2020

Uh oh!

jpcoenen commented Mar 23, 2020

Uh oh!

jpcoenen commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ingesting frames

Flushing frames

Uh oh!

mackenbach left a comment • edited by jpcoenen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SimonBarendse left a comment

Choose a reason for hiding this comment

Uh oh!

jpcoenen commented Mar 26, 2020

Uh oh!

Uh oh!

Uh oh!

florisvdg commented Mar 31, 2020

Uh oh!

mackenbach commented Apr 1, 2020

Uh oh!

jpcoenen commented Apr 1, 2020

Uh oh!

mackenbach commented Apr 1, 2020

Uh oh!

jpcoenen commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpcoenen commented Apr 1, 2020

Uh oh!

mackenbach commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mackenbach commented Apr 2, 2020

Uh oh!

jpcoenen commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mackenbach commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

jpcoenen commented Mar 23, 2020 •

edited

Loading

jpcoenen commented Mar 26, 2020 •

edited

Loading

mackenbach left a comment •

edited by jpcoenen

Loading

jpcoenen commented Apr 1, 2020 •

edited

Loading

mackenbach commented Apr 1, 2020 •

edited

Loading

jpcoenen commented Apr 2, 2020 •

edited

Loading

mackenbach commented Apr 2, 2020 •

edited

Loading