Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve false negative ratio by detecting keys with hyphens #49

Closed
domanchi opened this issue Jul 3, 2018 · 7 comments
Closed

Improve false negative ratio by detecting keys with hyphens #49

domanchi opened this issue Jul 3, 2018 · 7 comments
Assignees
Labels
false negatives pending The issue still needs to be reviewed by one of the maintainers.

Comments

@domanchi
Copy link
Contributor

domanchi commented Jul 3, 2018

Certain API keys use hyphens.

e.g. blahblah-aaaa-bbbb-cccc-ddddddd

This currently is not caught by the suite of HighEntropyStringPlugins.

@domanchi domanchi added the good first issue The issue can be tackled by someone who has little to no knowledge about the project. label Jul 3, 2018
@KevinHock
Copy link
Collaborator

There are also some AKIAblahblah keys that do not have high-entropy enough, but I suppose that should be a different issue :)

@KevinHock KevinHock added the help wanted Indicates that we would like someone that’s not a maintainer to work on the issue. label Sep 13, 2018
@mohit-surana
Copy link

mohit-surana commented Oct 24, 2018

I wanted to work on this issue. On doing a little bit of digging, I came up with two proposed solutions and would really appreciate any comments regarding the same:

  1. Can we expect the client to include a hyphen within the charset? If yes, then I believe we just need to use re.escape(charset) instead of charset on this line
  2. If clients should continue to use the existing charset, we need to either enrich just the regex, or both the regex and charset (internally). Unless we append the hyphen to the charset in the constructor, the entropy calculation will not use hyphens. So should hyphens be included in the entropy calculation?

Finally, I believe tests should go here, right?
Thanks!

@domanchi
Copy link
Contributor Author

Those are good questions @mohit-surana! The short answers are:

  1. No (we should be able to support both, holistically)
  2. Yes?

Based on the entropy algorithm, it seems that the more characters in the charset, the higher the entropy can be.

Following this logic, it would suggest that a more liberal charset may require a different entropy configuration level, seeing that the same level may produce more false positives.

However, if this is true, then any additions to the charset would require a completely separate plugin (e.g. adding hyphens and percentage signs -%), and the maintenance of these potential plugins could get very messy.

Any thoughts on this?

@mohit-surana
Copy link

Theoretically, yes. It would increase the false positive rate while reducing false negative rates as well. Ultimately it will be a trade-off between false positive and false negatives. Do we have any statistics regarding the current system's false positive rates?

How can we design good tests to measure the new statistics, that have a large coverage to assess the new FP/FN rates?

As for new plugins, it seems to me that ultimately, making changes to the entropy calculation is a big NO as it may affect current clients. And you can make combinatorial number of plugins if we make one for each small difference. Would it be better to allow clients to pass an additional argument indicating whether they would like to include additional preset/client specified symbols?

Bottom line: If FP increases a lot, we need to have the client make a conscious decision to move into a new version that supports hyphens.

@domanchi
Copy link
Contributor Author

I'm in favor of the additional argument, but I don't know how that might look like with the user interface. Certainly would increase the scope of this issue (and perhaps no longer a "good first issue")! If you still wanted to take it on, we'd more than welcome the contribution!

Otherwise, the AKIA prefixed issue that @KevinHock mentioned may be a good start. Though it doesn't strictly find the AWS secret, it gives a good indication that there might be a secret there, in the same principle as "where there's smoke, there's fire".

As for testing FP/FN rates, we are building a large internal collection of various different secrets that we use to experiment with our new plugins. We can certainly run your plugin on our corpus, and help tweak its default sensitivity.

@mohit-surana
Copy link

Hey @domanchi. I am interested in implementing the additional argument version of the solution. I am a bit caught up with stuff at the moment and I'll get back to it as soon as time permits!

The internal corpus sounds like a really good idea, and in general will help attract more users as well. As for the AKIA prefix, I will need to think further to understand how we can incorporate patterns along with the entropy calculation. Let me get back to you!

@KevinHock KevinHock self-assigned this Jun 24, 2019
@KevinHock KevinHock added accuracy false negatives and removed good first issue The issue can be tackled by someone who has little to no knowledge about the project. help wanted Indicates that we would like someone that’s not a maintainer to work on the issue. labels Jun 24, 2019
@lorenzodb1 lorenzodb1 added pending The issue still needs to be reviewed by one of the maintainers. and removed accuracy labels Jun 13, 2022
@lorenzodb1
Copy link
Member

We're going to close this issue as it hasn't received any update in a very long time. Feel free to re-open it if you think it's still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
false negatives pending The issue still needs to be reviewed by one of the maintainers.
Projects
None yet
Development

No branches or pull requests

4 participants