Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get serialization to be configurable with redis backend, others ? #18

Open
sqlalchemy-bot opened this issue Jan 17, 2013 · 45 comments
Open
Labels

Comments

@sqlalchemy-bot
Copy link

Migrated issue, originally created by Michael Bayer (zzzeek)

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

Issue #21 was marked as a duplicate of this issue.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

"serialization, the API is a loads()/dumps() function call, probably passed as any object/module that has "loads"/"dumps" available - even a named tuple. Using ObjectsWithUpperCaseNames is way too zopey for this."

that works. sorry about #21 -- didn't see this one ( was searching issues with wrong terms )

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

coming back to this.

a few more thoughts on making this configurable/modular:

• the ability to store/retrieve raw data in the client. ie, don't wrap it in api.CachedValue or pickle it. if i pass in a string that was encoded via msgpack, let it be. don't pickle a number or string -- that actually uses up a lot of space in the cache! these are all things that an advanced user would want to be able to turn off.

• handling the time relevance/metadata as part of the deserialization. in order to give the best amount of control on the data storage/format, the serializer should be able to encode/omit metadata as it sees fit. however, that would require the serializes to then handle the verification of timeliness (or anything else)

A usecase example is storing a username-to-id mapping in 2 different backends:

Given: Jonathan == 10001

• Storing in a dbm, I would need to store the time-relevance in the CachedValue object. The perfect solution is a pickled object that pairs the data with metadata.

data raw: {'data': 10001, 'metadata': {'expires': '1444097265'}}
data pickled :  (dp0\nS'data'\np1\nI10001\nsS'metadata'\np2\n(dp3\nS'expires'\np4\nS'1444097265'\np5\nss."

• Storing in redis, the server already handles expiry and memory is a concern. The perfect solution is storing the raw value

data raw: 10001

The dbm solution doesn't work well for redis in production though, it takes up 78bytes to achieve the same goal that 5bytes could.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I've started to fork this here:

https://bitbucket.org/jvanasco/dogpile.cream

Instead of implementing as a region, I'm using a global setting in dogpile.cache.money. This is largely working and I think I've found all the edge-cases for how data gets in and out. The next step will be trying to integrate this into a region as a configuration.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

dogpile.cream is working on my staging site and on production web spiders. no errors.

Spent an hour integrating the tests into my normal fork (https://bitbucket.org/jvanasco/dogpile.cache/commits/branch/feature-cachecontrol_region)

I think I can pull this off with a few API changes:

  • dogpile.cache.api.CachedSerializer new implements loads/dumps. configured by region, can overwrite backend defaults. This should probably either only be Pickle or RawData.

  • dogpile.cache.api.CachedValueManager new. this has the actual logic for validating/invalidating data within the CacheRegion. (prototyping in region.py). This will allow end-users to access/enter raw data into a datastore instead of 'cache' objects that contain the timestamp and version..

  • The CacheRegion now proxies all the management of timestamps/etc into the new CachedValueManager. A custom class could implement validation on value that does not use timestamps (ie, time managed by server) or stores the data in another manner

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

Hey Mike and Jon

I'm interested in using dogpile cache as the standard caching plugin for yosai (http://www.github.com/yosaiproject).
In the project, objects are serialized using marshmallow and then encoded with msgpack/json/userdefined

Within yosai's serialization module is a SerializationManager. In this SerializationManager, I add one layer of metadata to the message payload and then encode the message using whatever scheme it has been configured to use, such as MSGPack or JSON.

Through the region configuration process, I'd like to inject a SerializationManager-like object into dogpile cache that would control serialization/deserialization.

What do you think of me submitting a PR to add a serialization management layer to dogpile.cache?

reference: https://github.com/YosaiProject/yosai/blob/master/yosai/serialize/serialize.py

DG

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

I was thinking of just adding a flag to the Redis backend where you can pass in a pickler / serializer object of your choice, or set it to None. The yosai thing sounds like it would be a dogpile cache ProxyBackend (and could be part of yosai).

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

that is, I dont think trivial serialization like pickling, which is already intrinsically part of some client libraries and not in our control like that of all the memcached backends, should be a "middleware" component by default. this is a little bit of TMTOWTDI but just like the key mangler, having small injectables for rudimentary key/value mangling is valuable vs. requiring a mediation object in all case.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

@Dowwie

So basically, I did that here. Feel free to play with my fork/branch

https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/39/fine-grained-support-for-cache/diff

I had a shower-thought on it the other day about consolidating this further, but since forgot :/ It was a good one too.

The basic premise is that I saw the need for 2 things to be split out from dogpile:

  1. Cache Serialization -- allowing every backend do specify a loads/dumps interface (defaults to cPickle)

  2. Cache Value Management -- dogpile doesn't cache raw values; it wraps a cached values into a new object with some metadata for timestamp/version expiry. if this is extended via a class, then developers can customize the format -- either omitting timestamp data (for example, redis and some other potential backends handle this server-side), or using some sort of nested field within the cached value.

BUT

What you're talking about could simply be used with a proxy backend as Mike noted. Mike recently accepted a patch where I illustrated using msgpack as a proxy backend.

The caveat though, is that dogpile will cache: Pickle.dumps(CachedValue(msgpack.serialize(val)))

In order to cache msgpack.serialize(value), an approach (like in my fork) is needed.

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

Pickling a well-serialized object is cringe worthy. Further, over the last few years, I haven't heard a single engineer recommend using pickle for serialization. This is just an assumption based on anecdotal evidence, but it seems that "the industry" has moved on. As Alex Gaynor said, pickles are for delis-- not software. So, why not just remove pickle entirely?

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

because all the memcached clients have pickle bolted into them by default. pickles are still extremely relevant.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

pickling a well-serialized object is cringe-worthy. it is also unwanted when dealing with raw data (such as caching a raw number vs a pickled python object containing a number and metdata).

but dogpile is caching library, not a "well serialized object" caching library. it makes sense to support pickle, because often times you may want to cache an unserialized python object. since pickle is the historical default, it needs to be maintained for backwards compatibility.

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

I'll do as recommended and serialize using a customized ProxyBackend. I just don't want any pickling.

Turning off pickling involves a straightforward update. A default of pickle=True would ensure others wouldn't be impacted by the change.

Mike are you good with moving forward on this approach?

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

@Dowwie - we should let the loads/dumps functions be directly injectable, so you can stick json.loads/dumps in there if you wanted, or pickle with a lower protocol, whatever. None I guess would be the setting to turn off loads/dumps.

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

setter injected or constructor?

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

the cache backends take arguments in their constructor which gets configured through the region through the "arguments" parameter to configure().

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I don't believe this approach can work as intended.

cache.region (the core logic file) wraps the "value" (which is serialized by any proxy backends) into an instance of the CachedValue object class, and that instance is passed into the backend (e.g. backends/redis.py)

reference:

If you turn off pickling in the backend, you're still stuck with a Python instance (CachedValue) being passed to the backend. That instance contains the data serialized by the custom proxy_backend AND some metadata that the current implementation of region.py requires on deserialization.

reference:

Just to be clear about this: backend/version.py (currently) always receives a Python object, so it requires a (de)serialization hook to work for most backends -- redis included. (There are 2 backends that don't require this -- the memory backend inherently doesn't need to pickle, and the memcached driver accepts python objects because it pickles them itself)

The logic flow is like this:

[raw value] -> [proxy backend munging] -> [region.py wraps in CachedValue w/metadata] -> [backend/driver.py serializes]

Without restructuring the logic as I've proposed, the easiest way to achieve @Dowwie's goals would be to re-implement the redis backend in a custom package that performs msgpack serialization on the backend value (instead of the pre-backend value munging that proxy_backend offers); that could also just fake the entire CachedValue wrapper or store the metadata within the msg_pack format.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

If you turn off pickling in the backend, you're still stuck with a Python instance (CachedValue) being passed to the backend. That instance contains the data serialized by the custom proxy_backend AND some metadata that the current implementation of region.py requires on deserialization.

ah. go you, for reading my source code I never look at ! this is true.

But still. CachedValueManager? Why do we need another object and why such a J2EE-like architecture? CachedValue is our payload. If i dont want to use pickle, then that indicates I want to do the work of writing some other kind of serialization for CachedValue. There is still a loads()/dumps() function. It's just the approach of allowing "None" is no longer necessary, because you need a serializer no matter what. If you want to serialize CachedValue using json, that's fine.

This just places some extra burdens on the production of loads()/dumps() functions for those people who have some issue with pickling. They have to handle the CachedValue wrapper.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I have spent way too much time recently reverse-engineering your code!

They way I implemented this, it's 100% backwards compatible. The backends use a serialization class, but they default to pickle or none (whatever is current). If you want to configure a region to use JSON instead, then you just pass that in during config.

The reason why I came up with CachedValueManager is two-fold:

  1. What if I don't want to store a CachedValue object -- I just want to store the raw data? Backends like Redis support daemon-side cache invalidation and timeouts.

  2. What if I want to customize the cache management logic? The current CachedValue implementation is just a timestamp + version creation and check. This is a great "baseline" bit of logic for cache management, but people often have more elaborate needs. The core benefit/utility/brilliance of the dogpile.cache package is in the handling of the Lock mechanism. Using a "CachedValueManager", virtually any caching logic can be bootstrapped onto dogpile's locking system and caching infrastructure.

The general goal was that this approach allows for the 80% of "basic needs" dogpile users to see no difference in usage, but the 20% of "advanced needs" users would be able to build out much of the custom work and edge-cases that are often requested.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

the backend "timeout" value is not compatible with the dogpile approach, because when the backend "times out", the value is gone - we can't return it to clients while the "new" value is generated.

CacheRegion can be passed a CachedValue subclass as a parameter also.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

  • If the backend times out, then it's a cache miss. When using backends like Redis, the secondary check in dogpile is often a considerable waste of cache space. The performance hit of using dogpile instead of a direct redis client is negligible. The hit on data size is significant though -- and means running larger (or more) virtual servers to keep everything in memory.

CachedValue is directly imported and used by the region (I don't think you got to writing/commiting the feature you're thinking of). All of the validation logic is performed within the CacheRegion as well. Are you possibly thinking about customizing with the should_cache_fn argument?

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

What are the key issues being debated here? @jvanasco 's written a fine solution

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

I dont agree the mechanism of CacheRegion is so trivially swappable and also CachedValueManager is too much up-front interface / OOP for an anticipated use case. I propose a plugin for the CachedValue class, not using it at all is not an option; just use redis directly if you don't want the features of CacheRegion (e.g. dogpile protection). The entire point of CacheRegion is that it does not expose a so-called "cache miss", by returning the previously cached value when expiration is detected.

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

But also, the use case you're looking for, "I don't want pickle", as always ; the RedisBackend accepts a loads() / dumps() pair, that's it. You need to determine what to do with the given CachedValue object.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

  1. My solution is far from fine -- I'll be the first to admit it. It's an early stab at solving this. The intent of my PR was to spark dialog and find a way that leads to the right solution as this is complex.

  2. @zzzeek Another core benefit of dogpile (perhaps more important) is that it establishes the Lock on value creation (whether there is a miss or expiration) so the create function will only be called once. While the angle I propose would lessen the utility of still-using the old value, the Lock is still incredibly useful and important. The Python-based expiration is also never happening if someone is already using redis or riak as a backend with the server timeouts enabled (i saw a riak plugin, though I forget where).

The idea of turning off the python-side cache management was also influenced by the fact that this would allow users to hot-swap backends and fine-tune for performance by only changing the dogpile config. redis / riak / memcached could be switched and profiled quickly. Depending on an application's usage, memory (and the ability to hold an entire data-set in working memory) may be much more important that speed and concurrency. It's not pleasant ripping out and re-implementing caching libraries when you need to test this (sadly know from experience ;/)

  1. The RedisBackend doesn't accept a loads()/dumps() pair. Everything is hardcoded into pulling pickle from dogpile.compat and then piping the CachedValue into it.

So now I remembered by shower-thought--

If we implement my idea for dogpile.cache.api.CachedSerializer that would be a good first step. CachedSerializer is just an object class that supports loads/dumps; the implementation I prototyped just has an instance of the class passed into the region during configuration (and defaults to classes that handle the current behavior). I chose an object instance, because that means passing in a single argument (instead of two) and I like nicely defined interfaces (pyramid's use of zope interface has grown on me).

The reason why this would be good first step, is that it allows for undocumented/unsupported/offlabel munging of the data. A developer could repack the CachedValue data into an internal structure and rebuild it as needed... or fake the CachedValue as not having an expiry.

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

@zzzeek Makes sense to keep the CacheRegion largely intact. It is the bread and butter of this project.

Why should backends be concerned about load/dump (serialization)? Instead, a backend should expect cache-ready, serialized objects as input. Do we agree on this point?

If we agree to above two points, then we're leading towards a CacheSerializer managing serialization / de-serialization, and consequently CacheValue creation. Am I correct?

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

Why should backends be concerned about load/dump (serialization)?

because, they already are. Here is the source of the most common client, python-memcached: https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L928

that particular memcached client, as it was first, set the stage for all the other ones. What is extremely interesting to note here is that they are taking advantage of the memcached protocol itself to store additional "flags" which indicate how the value is serialized; these flags are only available on the wire protocol and are not exposed by the client.

So the memcached backends provide very elaborate, wire-protocol-bound, serialization functionality already, hardcoded to pickle.

To be fair, while I don't know how well they document this, they are using those flags to indicate if the value was in fact a Python object to start with and is pickled, and if a plain string was passed, it is not pickled; so there is a way to affect whether or not this occurs.

But the need for serialization itself as well as the best way to do it is a backend-specific affair. When dogpile was written, the memory, file and memcached backends were it. One backend needed no serialization (memory), one had none (file), and another already provided it (memcached). The job of serialization was clearly something that changes per-backend.

If we removed calls to "pickle" from the file / redis backends, those backends will now fail to function correctly with the cache region. If we put pickling into the CacheRegion itself, it would need to be turned on by default, which means it is firing unnecessarily for the memory and memcached backends. The backends should be able to work with a region without additional flags added to the region itself, other than the backend itself. Behaviors that need to travel with the backend should be configured with it.

The overarching theme of this thread is one of vague suggestions of "undocumented/unsupported/offlabel munging of the data" and "virtually any caching logic can be bootstrapped onto dogpile's locking system and caching infrastructure", I can't work with those. Adding small flags, sure. But rearchitecting the object model, that's a huge job, and one that has to be derived from a set of very specific use cases.

Throughout this thread, I still don't know what either of you want to do. Not serializing isn't an option, and "Alex Gaynor said pickle is dead" is not really sufficient :). @jvanasco 's use cases are even more obscure. Just building out a more exploded API and declaring "now it can do anything!" also is not enough for me to go on (and also seems kind of out of scope for, "we want better control over serialization").

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

I eat huge jobs for breakfast. I'll return with my munged fork in a few.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

First off - I think it's important to note that technology has changed a lot since the earlier dogpile releases. Memcached used to (in general) be the only option for caching daemons, now it is often considered the worst -- with redis, riak and some other systems being much faster and feature-filled. Processors and memory are also generally much higher, so the chance for blocking tends to be down quite a bit.

Dogpile may have intended to be the caching system built around the dogpile problem, but it has grown to be much more. Many people are attracted to it because it defines a robust API with hot-swappable backends.

Unfortunately, my application has grown to the point where dogpile has become a huge bottleneck for me. The manner in which dogpile wraps data into a CachedValue object makes my "mapping" region (name to id) about 18x larger, and most of my other objects 3-4x larger. Dogpile essentially creates a situation where I need a dogpile lock addressed, because my caching backend can only hold 1/6 the data and everything is expiring based on FIFO rules.

I do not want to rip out every bit of dogpile in my app and replace with a direct redis client (I lose hot-swap options meaning I have to run a configured redis on every environment.). That means I can either maintain a forked dogpile (which I'd like to avoid, but it a better option) or figure out some patch/api change that will give me enough hooks to accomplish what I need. Being able to specify serializers for each backend does that -- as it creates a "hook" at the right spot.

That being said, let me just clarify my suggestion:

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

@Dowwie

dogpile.cream was a first stab at an override. it works, but you should ignore it. I have a fork in a feature branch under dogpile.cache called feature-cachecontrol_region -- https://bitbucket.org/jvanasco/dogpile.cache/branch/feature-cachecontrol_region -- that is a much better place to look.

There are 2 dogpile contexts, and @zzzeek is primarily talking about one of them. Imagine there is a really expensive function that takes 5 minutes to generate a value. In the span of 15 seconds, 15 unique requests are made for that value, once each second. It could cripple a server to handle all 15.

The first dogpile context is the "unprimed cache". If there is no value in the cache: the first request creates a "Lock" on creating the value. The lock keeps the next 14 people from generating a value, but makes them wait for a result.

The second dogpile context is "stale data". If there is a value in the cache but it has expired, the first request creates a new value, while the next 14 requests use the stale data.

"Stale data" is possible because the CachedValue is actually a custom tuple with 2 properties:

 value = actual_value
 metadata = {'ct': foo,  # cache timestamp
                     'v': 1,   # dogpile API version
                     }

While redis can have a timeout on values, they are not required and a value may not necessarily have a timeout (keys can have no expiry). Memcached, dbm, memory, and other datastores don't have native timeouts either. Redis, Riak, and a few others "newer" daemons support this (increasingly popular) concept.

In order to support the second context (and generally have timeouts on backends that do not have a native timeout), dogpile needs that expiry timestamp in the cached value.

My big issue right now is just raw memory. This old Instagram post illustrates a similar issue [they used hashing to minimize the data stored based on prefixes, my concern is just on the wrapper], but goes into the concerns of server resources. http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value

So my thought is this: if there is a documented loads/dumps hook in the backend, then a developer can deconstruct/rebuild a CachedValue as needed and that solves most problems.

Custom serializers for redis might do this:

  class MsgpackSerializer(DogpileSerializer):
        """`object_hooks` would be needed on the msgpack serializer to deal with the CachedValue object"""
        def dumps(self, cached_value):
              return msgpack.dumps(cached_value)

        def loads(self, repacked):
              return msgpack.loads(repacked)

  class MsgpackRepackSerializer(DogpileSerializer):
        """recode the metadata into the data struct, then pull it out"""
        def dumps(self, cached_value):
              repacked = cached_value.value
              repacked._dogpile = cached_value.metadata
              return msgpack.dumps(repacked)

        def loads(self, repacked):
              unpacked = msgpack.loads(repacked)
              metadata = repacked.pop('_dogpile')
              cached_value = CachedValue(repacked, metadata)
              return cached_value

  class RawDataSerializer(DogpileSerializer):
        """This defats the dogpile lock's Stale Data protections, but lets a very advanced user store lots of data.  Most users would never think of this, so it's generally safe."""
        def dumps(self, cached_value):
              return cached_value.value

        def loads(self, repacked):
              unpacked = repacked
              cached_value = CachedValue(repacked, fake_timestamp())
              return cached_value

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

The manner in which dogpile wraps data into a CachedValue object makes my "mapping" region (name to id) about 18x larger, and most of my other objects 3-4x larger.

If you have an approach where you'd like to store the information of the CachedValue in a different way, I'm still not understanding why a ProxyBackend is not acceptable. that's what your recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format does. The "CachedValue makes my data 18x larger" phrase implies that you have very small data values.

I've read your two latest comments, not taking into account the various proposals / code examples because I really need to see only the problem to be solved, and that's the only problem case I can see described. Does the recipe you posted no longer solve this problem?

I'm not reading every word of these very long comments very carefully yet because I still don't have knowledge of the problem to be solved. The two problems so far I see are 1. dowwie doesnt want to use pickle because someone said "pickle was dead" and 2. you don't like the way CachedValue serializes because it takes up too much memory.

I'm not seeing how "allow a pickle function and/or None to be passed to redis" doesn't solve problem #1 and how the recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format doesn't solve problem #2 (and in fact, if you can turn off the pickle aspect of the backend, lets you do anything you want with the data, full blown). Putting aside questions of purity, where things should be, what cache backends are cool now, etc.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I'll try to be concise:

  1. proxy_backend does not work because it operates on the raw value - not the "CachedValue". The output of a proxy_backend is wrapped by region.py into a CachedValue via the _value method. That Python object is then passed into the actual backend. There exists no hook within the backends or Region._value to interact with the data.

  2. I am simply proposing that we allow a serialization option to be passed into all backends. It could be dumps/loads functions, but I suggest a class instance (and the code above illustrates an implementation) that implements a loads/dumps interface because that means only 1 argument. My objection was to simply disabling pickle, because the value received by the backend is an object.

In terms of the other stuff. For a simple mapping of usernames to ids (which we use in URL resolution and is our most used function), a "raw value" of 10001 is a about 1/18 the size of that number wrapped into a CachedValue and pickled.

@sqlalchemy-bot
Copy link
Author

Darin Gordon (Dowwie) wrote:

@jvanasco thank you for taking the time to share background. Your explanation belongs in the docs!

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

c6913eb

@sqlalchemy-bot
Copy link
Author

Michael Bayer (zzzeek) wrote:

That Python object is then passed into the actual backend.

I'm not following - the ProxyBackend gets the CachedValue object. Just added a test for this to verify I'm not crazy. See the above cset. Did you mean something else?

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I'm explaining this wrong, and am also bit confused/going-crazy-on-this after weeks of obsessing over the internals. The sheer number of tests/branches/approaches I have to get around the Pickling issue is intimidating.

You are correct and proxy_backend does get a CachedValue instance -- not raw data. Operating on value.payload to strip the CachedValue does work. My tests were picking up the Pickling value on the backend.

As my goal was to remove Pickle, I realized that MsgPack worked better as a "Serializer" than a "Proxy". Implementing msgpack as the storage format using a proxy required still required disabling pickling on the backend. As a custom serializer passed into backends though, it is all accomplished with a single hook.

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

I split out the proposed serializer code into https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/42/feature-custom_serializer

@sqlalchemy-bot
Copy link
Author

jvanasco (jvanasco) wrote:

Several years later... I solved all my needs by simply creating a new backend and overwriting a few methods. I recommend that approach.

@zzzeek
Copy link
Member

zzzeek commented Jun 28, 2019

another year later. do we still want to be able to pass in loads/dumps callables? I guess putting them on a Serializer class is, while a little more J2EE than I am these days, more consistent with how other thigns in dogpile have been done. the serilaizer gets the CachedValue and your serializer does with it what it wants, throws it away, adds it to your JSON or whatever, doesn't matter.

@jvanasco
Copy link
Member

I think it is still an important feature to allow a choice of serializers. PIckle is a great default, but being locked into it isn't so great. Being able to store things in msgpack made one Redis instance about 25% more efficient for me (and some other tricks made it even more efficient!)

Design wise... there are a lot of ways that I would prefer to implement this, but I think the most backwards compatible would be changing the regions to accept a loads/dumps args that default to None, pass those onto the backend, and let the backend fall back to Pickle if nothing is configured.

@zzzeek
Copy link
Member

zzzeek commented Jun 28, 2019

OK yes, this is what I favored throughout this discussion as I re-read it, I believe also moving to github makes these kinds of things easier to navigate and read as having to go to other repos on bitbucket is basically something I had a negative emotional reaction towards, I apologize. Pull requests that illustrate code changes are best, but i guess in this case we just need to implement, right? cc @4383

@jvanasco
Copy link
Member

as having to go to other repos on bitbucket is basically something I had a negative emotional reaction towards

IMHO that is how everyone who doesn't work for Bitbucket feels when using their system. Sorry for overloading you with so many options, trying to give you an easier choice was really just complicating things.

Generally speaking though... the backends would accept loads/dumps like this:

https://github.com/jvanasco/dogpile_backend_redis_advanced/blob/master/dogpile_backend_redis_advanced/cache/backends/redis_advanced.py#L155-L156

then just call self.loads() and self.dumps()

Looking at the Region code in dogpile, I don't think that would actually need to change; the backend arguments aren't filtered or anything.

@zzzeek
Copy link
Member

zzzeek commented Jun 28, 2019

yeah it's not a hard change but they just have to deal with the CachedValue thing explicitly. They can of course not store that info or store it somewhere else or use some kind of range based approach , that's all about the serialization strategy.

@zzzeek
Copy link
Member

zzzeek commented Jun 28, 2019

we can include a basic JSON plugin which will make some basic JSON of the CachedValue then call loads() / dumps() on the value they give us. Folks can use that as a starting point for other kinds of json / serialization or whatever. a msgpack plugin also, sure.

@jvanasco
Copy link
Member

IMHO, the easiest way to handle that is to bundle the (de)serialization into the backend's loads/dumps functions.

These example in a public repo is msgpack, but I do the same with json in private projects.

msgpack - dumps is fine as/is, but loads needs to wrap the tuple into a CachedValue

https://github.com/jvanasco/dogpile_backend_redis_advanced/blob/master/tests/cache/test_redis_backend.py#L252-L272

This example will amuse and possibly anger you... The CachedValue's metadata is tossed before serialization. When reading it out, it is recreated with the current time so the value never expires as far as dogpile is concerned.

https://github.com/jvanasco/dogpile_backend_redis_advanced/blob/master/tests/cache/test_redis_backend.py#L252-L272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants