Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple serialization logic from CacheBackend classes #191

Closed
wants to merge 18 commits into from

Conversation

youtux
Copy link
Contributor

@youtux youtux commented Sep 29, 2020

This PR is a continuation of #190.

The idea is to have CacheBackend subclasses decoupled from the serialization logic. The serializer/deserialzier can be fully configured by the end user, but each subclass can default to something that makes sense for each backend library.

For example, Redis client and file system only works with bytes objects, Memcached clients (pylibmc, python-memcached, bmemcached) can work with all sort of objects, but they have debatable defaults for the pickle protocol they use.
Memory clients can work with all sort of objects too, and they don't need to pickle.

Copy link
Member

@zzzeek zzzeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's not done yet.

let's use serialzier / deserializer. proof of concept is that pickle can be ditched entirely and json can be used, if someone wanted.

@@ -75,7 +79,29 @@ def __init__(self, arguments): # pragma NO COVERAGE
passed to :func:`.make_registry`.

"""
raise NotImplementedError()
# TODO: We can probably use a better name than pickler/unpickler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializer / deserializer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -140,6 +140,13 @@ def release_write_lock(self):
"""

def __init__(self, arguments):
super().__init__(
{
"pickler": pickle.dumps,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I send my own pickler here? seems like not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the idea here is that if "pickle" is not in arguments, then we default it to pickle.dumps:

>>> arguments = {}  # No arguments from user
>>> {
...     "pickler": pickle.dumps,
...     "unpickler": pickle.loads,
...     **arguments
... }
{'pickler': <built-in function dumps>, 'unpickler': <built-in function loads>}

>>> def noop(x): return x
...
>>> arguments = {"pickler": noop, "unpickler": noop}  # User overrides
>>> {
...     "pickler": pickle.dumps,
...     "unpickler": pickle.loads,
...     **arguments
... }
{'pickler': <function noop at 0x102b7a0d0>, 'unpickler': <function noop at 0x102b7a0d0>}

@@ -112,6 +112,11 @@ class GenericMemcachedBackend(CacheBackend):
to the :meth:`set` method."""

def __init__(self, arguments):
# No need to override serializer, as all the memcached libraries will do it.
# Still, we support customizing the serializer/deserializer to use better default pickle protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we want to use our own pickler here don't we ?

Copy link
Contributor Author

@youtux youtux Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this can potentially break backwards compatibility (I have to double check that), but probably in a way that nobody will ever notice.

It seems to me that all the memcached libraries use flags that represent the type of value stored. For example, this is how python-memcached does it. Other memcached implementations (even the C ones) seem to do the same. If we suddently start passing always bytes to memcached client, then it will be suddently start using no flag, to signify that bytes are stored.
This can potentially be a problem for users that expect Client._FLAG_TEXT to be set for some values.
But given the behaviour of memcached to return None on GET if there are errors, it may not be such a big deal after all.

But this is the only reason why I left the default to noop, so to leave the memcached library to determine what to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so then.... we don't support any means of changing the serializer right now? They can default to None for serializer/deserializer but we still need to run our "(de)serializer" inside the get/set methods in the backend right?

{
"pickler": pickle.dumps,
"unpickler": pickle.loads,
**arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you are using some flashy new python 3 syntax thing. that syntax is a little icky since it is in my 2 minutes experience pretty hard to spot, but that's on them not you : )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use ChainMap, as I think it was created exactly for this purpose.

super().__init__(
{
"pickler": pickle.dumps,
"unpickler": pickle.loads,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this syntax is messing with me, because it registers as exactly like the syntax that would fail:

Foo(pickler=pickle.dumps, unpickler=pickle.loads, **arguments)

again my problem not yours...

@youtux
Copy link
Contributor Author

youtux commented Sep 29, 2020

Indeed, it's just a draft.

proof of concept is that pickle can be ditched entirely and json can be used, if someone wanted.

Yes!

Also, actually use the configured serializer in the DBMBackend and MemoryBackend
@youtux youtux marked this pull request as ready for review October 4, 2020 12:36
A nice side effect is that we can use custom serializer/deserializer
…kend-serializer

# Conflicts:
#	tests/cache/test_redis_sentinel_backend.py
@youtux
Copy link
Contributor Author

youtux commented Oct 4, 2020

Tests are now passing (on my machine), and all the backends delegates the serialization logic to the serializer/deserializer attributes.
There are a couple of TODO that I need help with, otherwise the case is ready to be reviewed.

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is sqla-tester setting up my work on behalf of zzzeek to try to get revision 9308200 of this pull request into gerrit so we can run tests and reviews and stuff

@sqla-tester
Copy link
Collaborator

New Gerrit review created for change 9308200: https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@zzzeek
Copy link
Member

zzzeek commented Oct 5, 2020

just FYI if you install and run pre-commit for this repo it will run black +zimports with all our settings and all but E501 will be automatically handled.

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

(3 comments)

I'm not getting the memcached part still and I think this has to be fully generalized to start

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/backends/memcached.py (line 173): need deserilalize here...or noop if it's None
  • dogpile/cache/backends/memcached.py (line 184): need serialize here... or noop if it's None

@@ -50,6 +50,22 @@ class RedisTest(_TestRedisConn, _GenericBackendTest):
}


# TODO: This can probably be a test that we should run for every backend
class RedisCustomSerializerTest(_TestRedisConn, _GenericBackendTest):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

yes this has to be another _GenericBackendTest, and I think having it run for json loads/dumps would be a good proof, it should have some raw cache get that ensures the value comes out as json if deserializer isn't run. this might mean the actual values we're testing have to change a bit to support json

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would this be a test method implemented in the _GenericBackendTest class? Because I can imagine some backends would not want to support serializer/deserializer explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the backends we have do want to make use of this so it should be a reusable test fixture. a backend that doesnt include serialization doesn't have to use it in their test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, I was under the impression that backend tests from dogpile.cache are supposed to be run also by third party plugins, just like sqlalchemy dialects.
Since it's not the case, then I will make the serializer/deserializer test part of the _GenericBackendTest test suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me clarify, this should be a new class called _SeralizerFixture or something like that, it shouldn't be embedded in the existing _GenericBackendTest. it should include new tests that prove that the custom serializer, for which I suggested JSON, is actually being used. it should be able to be plugged in as a mixin to a _GenericBackendTest.

that is, we do want third party dialects to use these fixtures and this one would be "opt in" for a backend that does so, and we would want to confirm it in fact is able to serialize/deserialize from bytes on the backend side, since that's the assumption of the new feature we are doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe pick up the testing part? I feel like I am going to spend a long time just to figure out how the class structure should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my current days are 100% getting SQLAlchemy 1.4 documentation written so that the betas can come out. if you aren't blocked by this and can wait awhile, I still have these in my queue so I'd have to get around to them (or maybe someone else wants to poke at it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward for the first beta!

If you don't mind a more pragmatic approach, I can make test coverage by repeating the tests on each backend (maybe while I do that I find a way to generalize the tests), but at least we won't block this, and tests can be refactored later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not as complicated as you are making it. the steps are:

  1. create a class _GenericSerializerTest

  2. add a few test_XYZ methods to this class which round trip values using a JSON serializer, not pickle, to the backend.

  3. the test_XYZ methods should pull out raw data from each backend, making sure the data is in fact bytes when we don't use the serializer, and that the data is in fact JSON serialiized

  4. inside of test_redis , test_memached, test_dbm, place a stub test case that extends from _GenericSerializerTest as well as _GenericBackendTest

  5. that way all the tests in _GenericBackendTest run against a JSON serializer as a proof of concept

  6. I just looked at _GenericBackendTest and all the values it tests with are strings. We should include some tests that use a json-like structure of lists and dictionaries as the values. The pickle serializers and json serializers can both handle these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think I understand now. Can you have a look if that's how you meant it? I still need to add some tests for set_multi/get_multi.

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is sqla-tester setting up my work on behalf of zzzeek to try to get revision bf3ba6a of this pull request into gerrit so we can run tests and reviews and stuff

@sqla-tester
Copy link
Collaborator

Patchset bf3ba6a added to existing Gerrit review https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

If not configured, it will just be a noop
@youtux
Copy link
Contributor Author

youtux commented Oct 5, 2020

I'm not getting the memcached part still and I think this has to be fully generalized to start

I forgot to add the call to self.serializer/deserialzier. We should make tests to cover that, but I need to know where should those tests be implemented, and if they should run for every backend, or if every backend test suite should decide for itself.

@zzzeek
Copy link
Member

zzzeek commented Oct 21, 2020

yes. sorry , this has not been priority for me right now as i am putting all cycles into getting SQLAlchemy 1.4 out. are you blocked by this issue right now? my understanding is you can use a custom cache backend subclass for now.

@youtux
Copy link
Contributor Author

youtux commented Oct 21, 2020

No worries, it's not a priority for now, but it is going to block the upgrade of some of my projects to python3.8 (maybe other users would experience the same?).
I can use a custom cache backend, yes, but I'd rather use this implementation so that both I can setup both membached and redis backends to use the pickle protocol I need.

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is sqla-tester setting up my work on behalf of zzzeek to try to get revision 06f449c of this pull request into gerrit so we can run tests and reviews and stuff

@sqla-tester
Copy link
Collaborator

Patchset 06f449c added to existing Gerrit review https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@zzzeek
Copy link
Member

zzzeek commented Oct 21, 2020

Ok it looks like you did all that new testing stuff, so this is far along, OK.

@sqla-tester
Copy link
Collaborator

mike bayer (zzzeek) wrote:

openstack-recheck

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@zzzeek
Copy link
Member

zzzeek commented Oct 26, 2020

hey does this supersede #190 ? I'm assuming so and would like to close #190

@youtux
Copy link
Contributor Author

youtux commented Oct 26, 2020

Yes, this supersedes #190

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

(2 comments)

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@@ -75,7 +79,17 @@ def __init__(self, arguments): # pragma NO COVERAGE
passed to :func:`.make_registry`.

"""
raise NotImplementedError()

serializer = arguments.get("serializer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

i'm going to move these up to CacheRegion where they will be similar to KeyMangler, taking the default from the cache backend itself.

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@@ -252,6 +255,7 @@ def f():
time.sleep(0.5)

f()
# TODO: Is this test doing anything useful, given that we return here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

this is a mistake that got committed at some point.

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@sqla-tester
Copy link
Collaborator

mike bayer (zzzeek) wrote:

OK so as you can see, a whole lot happened in my patch here. here is the basic idea:

  1. backends,when they require a serializer, can assume that all values passed in and out are bytes. this simplifies the backends a lot.

  2. the serializer / deserializer are used by the CacheRegion directly, or if None, no serialization is assumed.

  3. when serialization is in use, the CacheRegion applies it only to the payload part, that is, the user-defined part. the other side of it, the "metadata", gets JSON serialized and turned to bytes. these two byte portions are merged together and sent to the backend.

  4. the user-defined serialized value is intended to be visible if you look at the value. this is why it's not a json of a json.

  5. the user defined serialized value is separate from the payload so that people can use any random serialization they want and not have to worry about our metadata.

  6. I did assorted py3 things here notably started putting some type annotations into some new methods so I can keep them straight.

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@sqla-tester
Copy link
Collaborator

mike bayer (zzzeek) wrote:

Workflow+1

hey just a reach out to a few peeps, where we propose a dogpile 1.1 that moves serialization to be a feature of the region. backends now deal with bytes in those cases where an explicit serializer is in use. i think this is a good change but i did it pretty quick yesterday so any concerns would be appreciated.

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

(21 comments)

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/api.py (line 53): call this BackendValue and it is:

BackendValue = Union[bytes, NoValue, NoneType, CachedValue]

  • dogpile/cache/api.py (line 54): also we can add

KeyType = Any

for now, I'd like to say keys are strings in all cases but I'm not sure that's true

  • dogpile/cache/api.py (line 67): serializer: Optional[Callable] = None
    deserializer: Optional[Callable] = None
  • dogpile/cache/api.py (line 142): annotate

def get(self, key: KeyType) -> BackendValue

  • dogpile/cache/api.py (line 152): def get_multi(self, keys: Sequence[KeyType]) -> Sequence[BackendValue]
  • dogpile/cache/api.py (line 163): def set(self, key: KeyType, value: BackendValue) -> None
  • dogpile/cache/api.py (line 175): def set_multi(self, Mapping[KeyType, BackendValue]) -> None
  • dogpile/cache/api.py (line 198): def delete(self, key: KeyType) -> None
  • dogpile/cache/api.py (line 212): def delete_multi(self, keys: Sequence[KeyType]) -> None
  • dogpile/cache/region.py (line 294): typo
  • dogpile/cache/region.py (line 729): this could be a single function self._get_from_backend(key)
  • dogpile/cache/region.py (line 801): this could be a single function self._get_multi_from_backend(keys)
  • dogpile/cache/region.py (line 931): self._get_from_backend(key)
  • dogpile/cache/region.py (line 952): self._set_to_backend(key, value)
  • dogpile/cache/region.py (line 1097): self._get_multi_from_backend(mangled_keys)
  • dogpile/cache/region.py (line 1127): self._set_multi_to_backend(mapping)
  • dogpile/cache/region.py (line 1132): self._set_multi_to_backend({k: v for k, v in values_w_created.items() if should_cache_fn(v)})
  • dogpile/cache/region.py (line 1147): probably rename this to _create_cached_value(), hard to follow these names
  • dogpile/cache/region.py (line 1165): I'm sort of arbitrarily trying to save on method calls here by inlining the serialization and I think I should start out by not doing that, the serialization should be in just one place.
  • dogpile/cache/region.py (line 1184): let's just for now put the format here in a function _serialize_cached_value_parts(self, metadata, payload)

also let's make sure we are using the terms "value" (a CachedValue) and "payload" (the user's data) consistently, or perhaps use the terms "cached_value" and "payload"

  • dogpile/cache/region.py (line 1191): CachePayloadType is not really accurate here. from the backend we can get:
  1. a binary string that has a metadata/payload
  2. a CachedValue
  3. NO_VALUE
  4. None

perhaps roll these up into BackendValue

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Federico Caselli (CaselIT) wrote:

(1 comment)

Just a minor comment

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/region.py (line 1197): metadata, _, payload = value.partition(b'|')

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

(1 comment)

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/region.py (line 1197): nice

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Federico Caselli (CaselIT) wrote:

(1 comment)

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/region.py (line 482): In general I'm not too convinced regarding cast, since it adds a function call at runtime.
    I'm not sure if this was added because mypy is not happy, but maybe we could use "#type ignore" if it cannot infer it?

@sqla-tester
Copy link
Collaborator

mike bayer (zzzeek) wrote:

recheck

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

Copy link
Collaborator

@sqla-tester sqla-tester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mike bayer (zzzeek) wrote:

(1 comment)

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

  • dogpile/cache/region.py (line 482): this is for correct use of mypy. im not too concerned about fn call overhead. overall mypy is pointing us to all the ambiguous stuff in the API like arguments that take three types of values.

@sqla-tester
Copy link
Collaborator

mike bayer (zzzeek) wrote:

Code-Review+2 Workflow+2

View this in Gerrit at https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283

@sqla-tester
Copy link
Collaborator

Gerrit review https://gerrit.sqlalchemy.org/c/sqlalchemy/dogpile.cache/+/2283 has been merged. Congratulations! :)

sbrunner added a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 13, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
sbrunner added a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 13, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
sbrunner added a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 13, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
sbrunner added a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 13, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
sbrunner added a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 14, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
c2c-bot-gis-ci pushed a commit to camptocamp/c2cgeoportal that referenced this pull request Jun 14, 2022
After this pull request sqlalchemy/dogpile.cache#191
Pickle is called on memory cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants