Allow users to manipulate data before dumping #227

mortenlj · 2018-10-05T07:41:46Z

Dumping the request/response is useful for debugging, but that means
the data usually will end up in logfiles or other places. Since many
requests or responses will contain sensitive data (authentication
tokens, username, cookies etc), it is sometimes necessary to redact some
of that information before dumping it. This implementation provides a
way to supply your own sanitizer, or use the provided sanitizer which
will redact information in a number of known sensitive headers.

Closes #226

Dumping the request/response is useful for debugging, but that means the data usually will end up in logfiles or other places. Since many requests or responses will contain sensitive data (authentication tokens, username, cookies etc), it is sometimes necessary to redact some of that information before dumping it. This implementation provides a way to supply your own sanitizer, or use the provided sanitizer which will redact information in a number of known sensitive headers. Closes #226

sigmavirus24 · 2018-10-05T16:39:47Z

requests_toolbelt/utils/dump.py

+        """Sanitize a request header
+
+        :param name: The header name
+        :type name: `compat.basestring`


What is this

`compat.basestring`

notation that is used here and elsewhere? Does Sphinx know this to be some kind of shorthand?

It references the type compat.basestring, which is already in use in this module. However Sphinx doesn't know it, and since I'm using bytes elsewhere I guess the convention is to use Python 3 types. I'll change it to str.

Did you want to use

:class:`requests.compat.basestring`

Because that would work in Sphinx

Strictly speaking I'm not 100% sure what type is expected here. It depends on what types are used in the headers. I believe it is unicode in Python 2 and str in Python 3, but I'm not sure.

Generally speaking, it's roughly Union[six.bytes_type, six.text_type] where on Python 2 those are str and unicode respectively and on Python 3 they are bytes and str.

So requests.compat.basestring is probably the most correct thing to use then?

sigmavirus24 · 2018-10-06T23:28:49Z

requests_toolbelt/utils/dump.py


 HTTP_VERSIONS = {
    9: b'0.9',
    10: b'1.0',
    11: b'1.1',
 }

+#: List of sensitive headers copied from
+#: https://github.com/google/har-sanitizer
+SENSITIVE_HEADERS = {


I do not remember agreeing to wholesale import a list of headers to automagically sanitize. I thought we had agreed to merely allow users to provide a sanitizer, not give them a built-in one.

My suggestion was to provide a method for providing your own, and one implementation that will do the right thing for most people. I imagine most people who need this will need to support this exact use-case: Redact headers commonly used to send sensitive data.

To me, the value of the feature increases tenfold if this implementation is provided, instead of every user of toolbelt having to implement this on their own in every project where this is needed.

Of course, this is your library, if you don't want it I can remove it.

My concern is that this library is basically on life-support. Meaning that as this list grows, who will maintain it? And if people think this list is comprehensive, they will likely leak sensitive data without meaning to because they think this library is maintaining the list themselves.

Alternatively, what if we do the following:

Update HeaderSanitizer to require the headers set be provided by the user.

Document the SENSITIVE_HEADERS list more thoroughly as a point-in-time snapshot from the repository with a Last-Updated date of today or whatever.

Indicate that if there are updates to the list that weren't made here users should use that themselves and/or send a PR to update this list.

requests_toolbelt/utils/dump.py

sigmavirus24 · 2018-10-06T23:31:48Z

requests_toolbelt/utils/dump.py

+class Sanitizer(object):
+    """Performs no sanitation"""
+
+    def request_header(self, name, value):


I believe that we should sanitize headers not individual pairs. How would someone, for example, strip a header name-value pair if they wanted to?

I did it like this because iterating over the headers and values is non-trivial, and I thought it would be more useful if you didn't have to figure out how to do that every time you want to implement a sanitizer. Also, since headers is not a dict, what should this method return in order to replace it? Or should it modify the headers object in-place (is that possible)?

With regards to stripping a name-value pair completely, I figured that was a very uncommon use-case that might not be worth supporting.

I can change it to work on the full headers, but then I need help understanding the questions above.

Also, since headers is not a dict, what should this method return in order to replace it?

So headers is going to be a requests-specific dictionary implementation: CaseInsensitiveDict. We can, do this:

sanitized_headers = headers.copy() for name in headers: if self.should_sanitize_header(name): sanitized_headers[name] = REDACTED_VALUE elif self.should_strip_header(name): del sanitized_headers[name]

This would then allow users to define should_sanitize_header and should_strip_header which only needs to operate on the header name and return True/False. This could be the default implementation for the _header sanitization functions since calling the should_*_header would raise exceptions if not implemented by default.

Ok, I think I get the gist of your suggestion. I will try to make some changes to move in this direction, and we can see where that gets us.

sigmavirus24 · 2018-10-06T23:33:04Z

requests_toolbelt/utils/dump.py

 _PrefixSettings = collections.namedtuple('PrefixSettings',
                                         ['request', 'response'])


+class Sanitizer(object):


There should be a sanitizer interface of sorts that people subclass and override methods without the methods defaulting to unsafe behaviours. Let's have this be a "base class" that raises NotImplementedError() exceptions and then an NoopSanitizer class that implements this behaviour.

I'm not sure I understand how this is unsafe behavior?

Whenever someone needs to implement a sanitizer, they would need to subclass the NoopSanitizer instead of the "interface", because it's going to be extremely rare that you want to override all four methods. I'm not sure I see the benefit of splitting the default behavior out from the interface class, but I can do it if you think it is important.

Unsafe-by-default means no sanitization by default. I don't believe that's the default we should aim to make easy for everyone and if they do want to use that behaviour, as you said, they can use the NoopSanitizer and it's far more explicit that they're doing a potentially unsafe thing. If they try to subclass Sanitizer they'll think it "just works" because it doesn't require them to do anything. I do think that forcing them to do something for the 4 methods is the better option because they have to consider what needs sanitization in all 4 cases.

I'm confused. Above you talk about not wanting a list of headers to sanitize by default, yet here you are talking about making the default do sanitizing because that is safer. If you don't have a list of headers to sanitize, how do you intend to do "safe-by-default" sanitizing?

If we say that the default is to not do anything (as today, aka unsafe-by-default), then a HeaderSanitizer that requires the user to provide a list of headers to sanitize makes sense. The user would the create an instance with a list of headers that they consider sensitive, and pass that as the sanitizer argument.

Hi there, let me be clear:

If the base sanitizer never sanitizes and silently is a NoOp then it's a bad default. The base sanitizer that users should sub-class should provide errors when trying to use the other methods because the user hasn't defined what they want to do. Otherwise Sanitizer in this case is unsafe-by-default. Safe-by-default means making the user think about it in this case.

Further, I'm starting to think this design is too limiting. How can someone say, compose the HeaderSanitizer here and a BodySanitizer elsewhere. They'd have to sub-class and then that becomes a nightmare of copy-pasting code around to make it all play nicely together. Instead, we likely want to accept a list of sanitizers that the user can specify, and call them in the order the user provides.

sigmavirus24 · 2018-10-06T23:33:55Z

requests_toolbelt/utils/dump.py

@@ -71,20 +169,24 @@ def _dump_request_data(request, prefixes, bytearr, proxy_info=None):
    bytearr.extend(prefix + b'Host: ' + host_header + b'\r\n')

    for name, value in headers.items():
+        value = sanitizer.request_header(name, value)


Since our sanitizer will be sanitizing the entire headers dictionary, let's move this outside the loop

sigmavirus24 · 2018-10-06T23:35:38Z

requests_toolbelt/utils/dump.py

@@ -115,12 +219,17 @@ def _coerce_to_bytes(data):


 def dump_response(response, request_prefix=b'< ', response_prefix=b'> ',
-                  data_array=None):
+                  data_array=None, sanitizer=NOOP_SANITIZER):


Let's leave this as having a default of None and then instantiate a new NoopSanitizer if it is None. (Same for above.)

Why? I don't see the upside to be honest, so it would be appreciated if you could enlighten me.

Downsides of doing as you suggest (as I see it):

Hide the actual behavior from the signature, since you no longer see that it uses NoopSanitizer

Complicate (slightly) the body of the function, since we now have to do a None-check and instantiate a new NoopSanitizer

Performance-drop if we instantiate a NoopSantiizer on every call

Hide the actual behavior from the signature, since you no longer see that it uses NoopSanitizer

The behaviour should be documented in the doc-string regardless. I'm not sure the behaviour belongs in the signature at all.

Complicate (slightly) the body of the function, since we now have to do a None-check and instantiate a new NoopSanitizer

I don't believe that sanitizer = sanitizer or NOOP_SANITIZER or even sanitizer = sanitizer or NoopSanitizer() is that big of a complication.

Performance-drop if we instantiate a NoopSantiizer on every call

❯❯❯ python -m timeit -s 'from requests_toolbelt.utils.dump import Sanitizer' -- 'Sanitizer()' ⏎ feature/redact-sensitive-headers 10000000 loops, best of 3: 0.108 usec per loop

And that's on a sad little old machine. Anything remotely modern or performant should have little trouble with the added instantiation.

Further, comparing it to the rest of the dump_* functions which iterate over potentially large dictionaries, it really can't be such a significant hit as to prefer the minor potential performance improvement over correctness.

I still don't see the upside to be honest, but as you say, most of my objections are really minor points. I'll change it.

sigmavirus24 · 2018-10-06T23:35:58Z

requests_toolbelt/utils/dump.py

    return data


-def dump_all(response, request_prefix=b'< ', response_prefix=b'> '):
+def dump_all(response, request_prefix=b'< ', response_prefix=b'> ',
+             sanitizer=NOOP_SANITIZER):


Same comment as above.

sigmavirus24 · 2018-10-06T23:36:30Z

tests/test_dump.py

+NORMAL_HEADERS = (
+    "Accept", "Accept-Encoding", "Host", "User-Agent", "Accept-Ranges",
+    "Cache-Control", "Content-Encoding", "Content-Length", "Content-Type",
+    "Date", "Etag", "Expires", "Last-Modified", "Server", "Vary", "X-Cache"


What about custom headers? OpenStack-Identity, X-My-Custom-Header, etc.?

This is used in a test to confirm that we are not redacting all headers. A test that covers all possible custom headers would be impossible, since that is a nearly infinite set. I decided to just use a semi-random selection of common headers.

If we are removing the HeaderSanitizer, then this point is moot anyway, since this test would go away too.

Separate interface definition and No-op implementation Sanitize full headers

mortenlj · 2018-10-16T08:56:29Z

I have made some changes, that I believe are in the right direction. I think there are some comments I have yet to address, but I'm not sure if they are still relevant. I think the discussion might have changed the direction a bit?

Anyway, let me know what you think about these latest changes, and where you want further changes (if any). I want to get this done, but need you to tell me where to go so I don't have to guess.

jdufresne · 2019-05-02T18:39:16Z

Hi @mortenlj and @sigmavirus24 this is a feature I'm really interested in. Is there anything I can do to help move this along?

mortenlj · 2019-05-02T19:04:52Z

@jdufresne: I've moved on to other things, probably not going to circle back to this for the foreseeable future. Feel free to take over.

sigmavirus24 requested changes Oct 6, 2018

View reviewed changes

mortenlj added 3 commits October 10, 2018 12:36

Review fixes, part 1

c26cb3c

Further fixes from review

790d487

Separate interface definition and No-op implementation Sanitize full headers

Merge branch 'master' into feature/redact-sensitive-headers

b11bba0

jdufresne mentioned this pull request May 2, 2019

Allow users to manipulate request/response data before dumping #263

Open

mortenlj closed this Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to manipulate data before dumping #227

Allow users to manipulate data before dumping #227

mortenlj commented Oct 5, 2018

sigmavirus24 Oct 5, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 8, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 9, 2018

mortenlj Oct 10, 2018

sigmavirus24 Oct 6, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 8, 2018

sigmavirus24 Oct 6, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 8, 2018

mortenlj Oct 10, 2018

sigmavirus24 Oct 6, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 8, 2018

mortenlj Oct 10, 2018

sigmavirus24 Oct 16, 2018

sigmavirus24 Oct 6, 2018

sigmavirus24 Oct 6, 2018

mortenlj Oct 8, 2018

sigmavirus24 Oct 8, 2018

mortenlj Oct 10, 2018

sigmavirus24 Oct 6, 2018

sigmavirus24 Oct 6, 2018

mortenlj Oct 8, 2018

mortenlj commented Oct 16, 2018

jdufresne commented May 2, 2019

mortenlj commented May 2, 2019

Allow users to manipulate data before dumping #227

Allow users to manipulate data before dumping #227

Conversation

mortenlj commented Oct 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortenlj commented Oct 16, 2018

jdufresne commented May 2, 2019

mortenlj commented May 2, 2019