Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to manipulate data before dumping #227

Closed
wants to merge 4 commits into from
Closed

Allow users to manipulate data before dumping #227

wants to merge 4 commits into from

Conversation

mortenlj
Copy link

@mortenlj mortenlj commented Oct 5, 2018

Dumping the request/response is useful for debugging, but that means
the data usually will end up in logfiles or other places. Since many
requests or responses will contain sensitive data (authentication
tokens, username, cookies etc), it is sometimes necessary to redact some
of that information before dumping it. This implementation provides a
way to supply your own sanitizer, or use the provided sanitizer which
will redact information in a number of known sensitive headers.

Closes #226

Dumping the request/response is useful for debugging, but that means
the data usually will end up in logfiles or other places. Since many
requests or responses will contain sensitive data (authentication
tokens, username, cookies etc), it is sometimes necessary to redact some
of that information before dumping it. This implementation provides a
way to supply your own sanitizer, or use the provided sanitizer which
will redact information in a number of known sensitive headers.

Closes #226
"""Sanitize a request header

:param name: The header name
:type name: `compat.basestring`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this

`compat.basestring` 

notation that is used here and elsewhere? Does Sphinx know this to be some kind of shorthand?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It references the type compat.basestring, which is already in use in this module. However Sphinx doesn't know it, and since I'm using bytes elsewhere I guess the convention is to use Python 3 types. I'll change it to str.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to use

:class:`requests.compat.basestring`

Because that would work in Sphinx

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking I'm not 100% sure what type is expected here. It depends on what types are used in the headers. I believe it is unicode in Python 2 and str in Python 3, but I'm not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, it's roughly Union[six.bytes_type, six.text_type] where on Python 2 those are str and unicode respectively and on Python 3 they are bytes and str.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So requests.compat.basestring is probably the most correct thing to use then?


HTTP_VERSIONS = {
9: b'0.9',
10: b'1.0',
11: b'1.1',
}

#: List of sensitive headers copied from
#: https://github.com/google/har-sanitizer
SENSITIVE_HEADERS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember agreeing to wholesale import a list of headers to automagically sanitize. I thought we had agreed to merely allow users to provide a sanitizer, not give them a built-in one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to provide a method for providing your own, and one implementation that will do the right thing for most people. I imagine most people who need this will need to support this exact use-case: Redact headers commonly used to send sensitive data.

To me, the value of the feature increases tenfold if this implementation is provided, instead of every user of toolbelt having to implement this on their own in every project where this is needed.

Of course, this is your library, if you don't want it I can remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this library is basically on life-support. Meaning that as this list grows, who will maintain it? And if people think this list is comprehensive, they will likely leak sensitive data without meaning to because they think this library is maintaining the list themselves.

Alternatively, what if we do the following:

  1. Update HeaderSanitizer to require the headers set be provided by the user.
  2. Document the SENSITIVE_HEADERS list more thoroughly as a point-in-time snapshot from the repository with a Last-Updated date of today or whatever.
  3. Indicate that if there are updates to the list that weren't made here users should use that themselves and/or send a PR to update this list.

requests_toolbelt/utils/dump.py Outdated Show resolved Hide resolved
class Sanitizer(object):
"""Performs no sanitation"""

def request_header(self, name, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we should sanitize headers not individual pairs. How would someone, for example, strip a header name-value pair if they wanted to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it like this because iterating over the headers and values is non-trivial, and I thought it would be more useful if you didn't have to figure out how to do that every time you want to implement a sanitizer. Also, since headers is not a dict, what should this method return in order to replace it? Or should it modify the headers object in-place (is that possible)?

With regards to stripping a name-value pair completely, I figured that was a very uncommon use-case that might not be worth supporting.

I can change it to work on the full headers, but then I need help understanding the questions above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since headers is not a dict, what should this method return in order to replace it?

So headers is going to be a requests-specific dictionary implementation: CaseInsensitiveDict. We can, do this:

sanitized_headers = headers.copy()
for name in headers:
    if self.should_sanitize_header(name):
        sanitized_headers[name] = REDACTED_VALUE
    elif self.should_strip_header(name):
        del sanitized_headers[name]

This would then allow users to define should_sanitize_header and should_strip_header which only needs to operate on the header name and return True/False. This could be the default implementation for the _header sanitization functions since calling the should_*_header would raise exceptions if not implemented by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get the gist of your suggestion. I will try to make some changes to move in this direction, and we can see where that gets us.

_PrefixSettings = collections.namedtuple('PrefixSettings',
['request', 'response'])


class Sanitizer(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a sanitizer interface of sorts that people subclass and override methods without the methods defaulting to unsafe behaviours. Let's have this be a "base class" that raises NotImplementedError() exceptions and then an NoopSanitizer class that implements this behaviour.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this is unsafe behavior?

Whenever someone needs to implement a sanitizer, they would need to subclass the NoopSanitizer instead of the "interface", because it's going to be extremely rare that you want to override all four methods. I'm not sure I see the benefit of splitting the default behavior out from the interface class, but I can do it if you think it is important.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe-by-default means no sanitization by default. I don't believe that's the default we should aim to make easy for everyone and if they do want to use that behaviour, as you said, they can use the NoopSanitizer and it's far more explicit that they're doing a potentially unsafe thing. If they try to subclass Sanitizer they'll think it "just works" because it doesn't require them to do anything. I do think that forcing them to do something for the 4 methods is the better option because they have to consider what needs sanitization in all 4 cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Above you talk about not wanting a list of headers to sanitize by default, yet here you are talking about making the default do sanitizing because that is safer. If you don't have a list of headers to sanitize, how do you intend to do "safe-by-default" sanitizing?

If we say that the default is to not do anything (as today, aka unsafe-by-default), then a HeaderSanitizer that requires the user to provide a list of headers to sanitize makes sense. The user would the create an instance with a list of headers that they consider sensitive, and pass that as the sanitizer argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, let me be clear:

If the base sanitizer never sanitizes and silently is a NoOp then it's a bad default. The base sanitizer that users should sub-class should provide errors when trying to use the other methods because the user hasn't defined what they want to do. Otherwise Sanitizer in this case is unsafe-by-default. Safe-by-default means making the user think about it in this case.

Further, I'm starting to think this design is too limiting. How can someone say, compose the HeaderSanitizer here and a BodySanitizer elsewhere. They'd have to sub-class and then that becomes a nightmare of copy-pasting code around to make it all play nicely together. Instead, we likely want to accept a list of sanitizers that the user can specify, and call them in the order the user provides.

@@ -71,20 +169,24 @@ def _dump_request_data(request, prefixes, bytearr, proxy_info=None):
bytearr.extend(prefix + b'Host: ' + host_header + b'\r\n')

for name, value in headers.items():
value = sanitizer.request_header(name, value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our sanitizer will be sanitizing the entire headers dictionary, let's move this outside the loop

@@ -115,12 +219,17 @@ def _coerce_to_bytes(data):


def dump_response(response, request_prefix=b'< ', response_prefix=b'> ',
data_array=None):
data_array=None, sanitizer=NOOP_SANITIZER):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this as having a default of None and then instantiate a new NoopSanitizer if it is None. (Same for above.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I don't see the upside to be honest, so it would be appreciated if you could enlighten me.

Downsides of doing as you suggest (as I see it):

  • Hide the actual behavior from the signature, since you no longer see that it uses NoopSanitizer
  • Complicate (slightly) the body of the function, since we now have to do a None-check and instantiate a new NoopSanitizer
  • Performance-drop if we instantiate a NoopSantiizer on every call

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide the actual behavior from the signature, since you no longer see that it uses NoopSanitizer

The behaviour should be documented in the doc-string regardless. I'm not sure the behaviour belongs in the signature at all.

Complicate (slightly) the body of the function, since we now have to do a None-check and instantiate a new NoopSanitizer

I don't believe that sanitizer = sanitizer or NOOP_SANITIZER or even sanitizer = sanitizer or NoopSanitizer() is that big of a complication.

Performance-drop if we instantiate a NoopSantiizer on every call

❯❯❯ python -m timeit -s 'from requests_toolbelt.utils.dump import Sanitizer' -- 'Sanitizer()'    ⏎ feature/redact-sensitive-headers
10000000 loops, best of 3: 0.108 usec per loop

And that's on a sad little old machine. Anything remotely modern or performant should have little trouble with the added instantiation.

Further, comparing it to the rest of the dump_* functions which iterate over potentially large dictionaries, it really can't be such a significant hit as to prefer the minor potential performance improvement over correctness.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see the upside to be honest, but as you say, most of my objections are really minor points. I'll change it.

return data


def dump_all(response, request_prefix=b'< ', response_prefix=b'> '):
def dump_all(response, request_prefix=b'< ', response_prefix=b'> ',
sanitizer=NOOP_SANITIZER):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

NORMAL_HEADERS = (
"Accept", "Accept-Encoding", "Host", "User-Agent", "Accept-Ranges",
"Cache-Control", "Content-Encoding", "Content-Length", "Content-Type",
"Date", "Etag", "Expires", "Last-Modified", "Server", "Vary", "X-Cache"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about custom headers? OpenStack-Identity, X-My-Custom-Header, etc.?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in a test to confirm that we are not redacting all headers. A test that covers all possible custom headers would be impossible, since that is a nearly infinite set. I decided to just use a semi-random selection of common headers.

If we are removing the HeaderSanitizer, then this point is moot anyway, since this test would go away too.

@mortenlj
Copy link
Author

I have made some changes, that I believe are in the right direction. I think there are some comments I have yet to address, but I'm not sure if they are still relevant. I think the discussion might have changed the direction a bit?

Anyway, let me know what you think about these latest changes, and where you want further changes (if any). I want to get this done, but need you to tell me where to go so I don't have to guess.

@jdufresne
Copy link
Member

Hi @mortenlj and @sigmavirus24 this is a feature I'm really interested in. Is there anything I can do to help move this along?

@mortenlj
Copy link
Author

mortenlj commented May 2, 2019

@jdufresne: I've moved on to other things, probably not going to circle back to this for the foreseeable future. Feel free to take over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redacting sensitive headers when dumping request/response
3 participants