Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Python 3, dogpile's own key mangler can't mangle the output of its default key generators #159

Closed
AdamWill opened this issue Aug 9, 2019 · 6 comments

Comments

@AdamWill
Copy link

AdamWill commented Aug 9, 2019

So I only just ran into this and I may be getting the wrong end of a stick, but I don't think so. It seems to me that, in Python 3, dogpile's sha1_mangle_key cannot mangle the keys produced by dogpile's function_key_generator and other key generators - the ones that are used by default for new cache regions.

The output format of function_key_generator and function_multi_key_generator are defined by a kwarg (to_str) whose default value is dogpile.util.compat.string_type...which on Python 3, is str. Which is the unicode string type, like unicode on Python 2.

sha1_mangle_key basically just calls hashlib.sha1() on whatever it's fed...and hashlib.sha1() will not accept "Unicode-objects", which in Python 3 means str instances. It requires them to be encoded to bytes:

Python 3.7.4 (default, Jul 27 2019, 01:48:07) 
[GCC 9.1.1 20190605 (Red Hat 9.1.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from hashlib import sha1
>>> sha1('foo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing
>>> sha1('foo'.encode('utf-8'))
<sha1 HASH object @ 0x7fd436d720c0>
>>> 

what this means is that if you have Python 3 code that uses a dogpile cache region with default key generators, and sets the key mangler as the dogpile-provided sha1_mangle_key, it just doesn't work right. Here's a minimal reproducer:

import dogpile.cache
import dogpile.cache.util

def make_cached_method(cache):

    @cache.cache_on_arguments()
    def cached_method(key):
        print(key)

    return cached_method

ourcache = dogpile.cache.make_region(key_mangler=dogpile.cache.util.sha1_mangle_key)
ourcache.configure(
    "dogpile.cache.dbm",
    expiration_time=300,
    arguments={
        "filename":"file.dbm"
    }
)
ourmethod = make_cached_method(ourcache)
ourmethod('foo')

On Python 2 this works fine. On Python 3 it blows up:

Traceback (most recent call last):
  File "/tmp/test.py", line 21, in <module>
    ourmethod('foo')
  File "</home/adamw/local/tahrir/tahrir-venv/lib/python3.7/site-packages/decorator.py:decorator-gen-1>", line 2, in cached_method
  File "/home/adamw/local/tahrir/tahrir-venv/lib/python3.7/site-packages/dogpile/cache/region.py", line 1272, in get_or_create_for_user_func
    should_cache_fn, (arg, kw))
  File "/home/adamw/local/tahrir/tahrir-venv/lib/python3.7/site-packages/dogpile/cache/region.py", line 823, in get_or_create
    key = self.key_mangler(key)
  File "/home/adamw/local/tahrir/tahrir-venv/lib/python3.7/site-packages/dogpile/cache/util.py", line 123, in sha1_mangle_key
    return sha1(key).hexdigest()
TypeError: Unicode-objects must be encoded before hashing

surely the stock mangler should work with the stock and default key generators?

AdamWill added a commit to AdamWill/tahrir that referenced this issue Aug 9, 2019
This does enough Python 3 porting to make Tahrir run and do some
basic stuff under Python 3 - I've tested creating badges and
series, issuing badges, clicking around in Leaderboard and
Explore, looking at RSS and JSON views of things. This does not
break Python 2 compatibility - I'd rather not do that yet so we
can test things easily both ways and identify any differences.
We could remove Python 2 compat later.

Most of the changes are based on 2to3 suggestions and are pretty
self-explanatory. Some less obvious ones:

* The str_to_bytes and dogpile stuff: well, see
sqlalchemy/dogpile.cache#159 . The
`sha1_mangle_key` mangler that we're using, which is provided by
dogpile, needs input as a bytestring. This is pretty awkward. It
obviously caused *some* problems even in Python 2 (as this app
explicitly uses unicodes in some places), but in Python 3 it's
worse; everywhere you see `str_to_bytes` being called is a place
where I found a crash because we wound up sending a non-encoded
`str` to `sha1_mangle_key` (or, in the case of `email_md5` and
`email_sha1`, to hashlib directly).

* map moved in Python 3; 2to3 suggests handling it with a six
move, but I preferred just replacing all the `map` uses with
comprehensions.

* 2to3 recommended a change to strip_tags, but I noticed it is
not actually used any more. It was used to sanitize HTML input
to the admin route back when it was added, but the admin route
was entirely rewritten later and the use of strip_tags was taken
out. So I just removed strip_tags and its supporting players.

* merge_dicts is used in places where we were merging two dicts
in a single expression by converting them to lists, combining
the lists, and turning the combined list back into a dict again.
You can still do this in Python 3 but you have to add extra
`list()` calls and it gets really ugly. Per
https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression
it's also not resource-efficient, so this seems like a better
approach - it's informed by the code in that SO question but I
wrote the function myself rather than taking one from that page
to avoid technically having a tiny bit of CC-BY-SA code in this
AGPL project.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
AdamWill added a commit to AdamWill/tahrir that referenced this issue Aug 9, 2019
This does enough Python 3 porting to make Tahrir run and do some
basic stuff under Python 3 - I've tested creating badges and
series, issuing badges, clicking around in Leaderboard and
Explore, looking at RSS and JSON views of things. This does not
break Python 2 compatibility - I'd rather not do that yet so we
can test things easily both ways and identify any differences.
We could remove Python 2 compat later.

Most of the changes are based on 2to3 suggestions and are pretty
self-explanatory. Some less obvious ones:

* The str_to_bytes and dogpile stuff: well, see
sqlalchemy/dogpile.cache#159 . The
`sha1_mangle_key` mangler that we're using, which is provided by
dogpile, needs input as a bytestring. This is pretty awkward. It
obviously caused *some* problems even in Python 2 (as this app
explicitly uses unicodes in some places), but in Python 3 it's
worse; everywhere you see `str_to_bytes` being called is a place
where I found a crash because we wound up sending a non-encoded
`str` to `sha1_mangle_key` (or, in the case of `email_md5` and
`email_sha1`, to hashlib directly).

* map moved in Python 3; 2to3 suggests handling it with a six
move, but I preferred just replacing all the `map` uses with
comprehensions.

* 2to3 recommended a change to strip_tags, but I noticed it is
not actually used any more. It was used to sanitize HTML input
to the admin route back when it was added, but the admin route
was entirely rewritten later and the use of strip_tags was taken
out. So I just removed strip_tags and its supporting players.

* merge_dicts is used in places where we were merging two dicts
in a single expression by converting them to lists, combining
the lists, and turning the combined list back into a dict again.
You can still do this in Python 3 but you have to add extra
`list()` calls and it gets really ugly. Per
https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression
it's also not resource-efficient, so this seems like a better
approach - it's informed by the code in that SO question but I
wrote the function myself rather than taking one from that page
to avoid technically having a tiny bit of CC-BY-SA code in this
AGPL project.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
@zzzeek
Copy link
Member

zzzeek commented Aug 10, 2019

surely the stock mangler should work with the stock and default key generators?

well, yes, I'm not sure if you seem like I need to be convinced, this is a mostly forgotten function that was implemented before Python 3 was implemented for dogpile and it has no test coverage, so there's the bug.

@sqla-tester
Copy link
Collaborator

Mike Bayer has proposed a fix for this issue in the master branch:

Encode string key for sha1 on python 3 https://gerrit.sqlalchemy.org/1402

@AdamWill
Copy link
Author

ah, OK. It didn't seem like the app I was working on is doing anything particularly unusual so I was surprised it hadn't been spotted till now, I guess.

On the fix - maybe it would be better to only encode it if it's actually a string type? Seems like you're not using six, but then you could just check if it's a unicode for Python 2 or a str for Python 3 and encode it if so. Otherwise just use it as-is. This would avoid it blowing up if someone has already set something up to pass it an encoded value (like I did, for our project).

@zzzeek
Copy link
Member

zzzeek commented Aug 10, 2019

ah, OK. It didn't seem like the app I was working on is doing anything particularly unusual so I was surprised it hadn't been spotted till now, I guess.

On the fix - maybe it would be better to only encode it if it's actually a string type? Seems like you're not using six, but then you could just check if it's a unicode for Python 2 or a str for Python 3 and encode it if so. Otherwise just use it as-is. This would avoid it blowing up if someone has already set something up to pass it an encoded value (like I did, for our project).

I thought of this but I don't like the performance overhead of isinstance() that much. I would imagine that if you worked around this issue, you just did your own sha1 call, as it's only a one liner. cache keys weren't expected to be bytes. anyway, in this case we'd catch for unicode under python 2 also, I guess.

@zzzeek
Copy link
Member

zzzeek commented Aug 10, 2019

of course under Pyhton 2 you can pass u'' or '' and it works equally well, that's annoying

@AdamWill
Copy link
Author

AdamWill commented Aug 10, 2019

The project I'm working on was actually already working around this problem to some extent even before I started porting it to Python 3 - it explicitly uses u"" literals in some places so it had to care. Specifically, this bit has been there since 2014.

Would just doing a try/except be faster than an isinstance? It's less strictly correct but the difference is pretty academic...

AdamWill added a commit to AdamWill/tahrir that referenced this issue Aug 11, 2019
This does enough Python 3 porting to make Tahrir run and do some
basic stuff under Python 3 - I've tested creating badges and
series, issuing badges, clicking around in Leaderboard and
Explore, looking at RSS and JSON views of things. This does not
break Python 2 compatibility - I'd rather not do that yet so we
can test things easily both ways and identify any differences.
We could remove Python 2 compat later.

Most of the changes are based on 2to3 suggestions and are pretty
self-explanatory. Some less obvious ones:

* The str_to_bytes and dogpile stuff: well, see
sqlalchemy/dogpile.cache#159 . The
`sha1_mangle_key` mangler that we're using, which is provided by
dogpile, needs input as a bytestring. This is pretty awkward. It
obviously caused *some* problems even in Python 2 (as this app
explicitly uses unicodes in some places), but in Python 3 it's
worse; everywhere you see `str_to_bytes` being called is a place
where I found a crash because we wound up sending a non-encoded
`str` to `sha1_mangle_key` (or, in the case of `email_md5` and
`email_sha1`, to hashlib directly).

* map moved in Python 3; 2to3 suggests handling it with a six
move, but I preferred just replacing all the `map` uses with
comprehensions.

* 2to3 recommended a change to strip_tags, but I noticed it is
not actually used any more. It was used to sanitize HTML input
to the admin route back when it was added, but the admin route
was entirely rewritten later and the use of strip_tags was taken
out. So I just removed strip_tags and its supporting players.

* merge_dicts is used in places where we were merging two dicts
in a single expression by converting them to lists, combining
the lists, and turning the combined list back into a dict again.
You can still do this in Python 3 but you have to add extra
`list()` calls and it gets really ugly. Per
https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression
it's also not resource-efficient, so this seems like a better
approach - it's informed by the code in that SO question but I
wrote the function myself rather than taking one from that page
to avoid technically having a tiny bit of CC-BY-SA code in this
AGPL project.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
cverna pushed a commit to fedora-infra/tahrir that referenced this issue Aug 12, 2019
This does enough Python 3 porting to make Tahrir run and do some
basic stuff under Python 3 - I've tested creating badges and
series, issuing badges, clicking around in Leaderboard and
Explore, looking at RSS and JSON views of things. This does not
break Python 2 compatibility - I'd rather not do that yet so we
can test things easily both ways and identify any differences.
We could remove Python 2 compat later.

Most of the changes are based on 2to3 suggestions and are pretty
self-explanatory. Some less obvious ones:

* The str_to_bytes and dogpile stuff: well, see
sqlalchemy/dogpile.cache#159 . The
`sha1_mangle_key` mangler that we're using, which is provided by
dogpile, needs input as a bytestring. This is pretty awkward. It
obviously caused *some* problems even in Python 2 (as this app
explicitly uses unicodes in some places), but in Python 3 it's
worse; everywhere you see `str_to_bytes` being called is a place
where I found a crash because we wound up sending a non-encoded
`str` to `sha1_mangle_key` (or, in the case of `email_md5` and
`email_sha1`, to hashlib directly).

* map moved in Python 3; 2to3 suggests handling it with a six
move, but I preferred just replacing all the `map` uses with
comprehensions.

* 2to3 recommended a change to strip_tags, but I noticed it is
not actually used any more. It was used to sanitize HTML input
to the admin route back when it was added, but the admin route
was entirely rewritten later and the use of strip_tags was taken
out. So I just removed strip_tags and its supporting players.

* merge_dicts is used in places where we were merging two dicts
in a single expression by converting them to lists, combining
the lists, and turning the combined list back into a dict again.
You can still do this in Python 3 but you have to add extra
`list()` calls and it gets really ugly. Per
https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression
it's also not resource-efficient, so this seems like a better
approach - it's informed by the code in that SO question but I
wrote the function myself rather than taking one from that page
to avoid technically having a tiny bit of CC-BY-SA code in this
AGPL project.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants