-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get serialization to be configurable with redis backend, others ? #18
Comments
jvanasco (jvanasco) wrote: "serialization, the API is a loads()/dumps() function call, probably passed as any object/module that has "loads"/"dumps" available - even a named tuple. Using ObjectsWithUpperCaseNames is way too zopey for this." that works. sorry about #21 -- didn't see this one ( was searching issues with wrong terms ) |
jvanasco (jvanasco) wrote: coming back to this. a few more thoughts on making this configurable/modular: • the ability to store/retrieve raw data in the client. ie, don't wrap it in • handling the time relevance/metadata as part of the deserialization. in order to give the best amount of control on the data storage/format, the serializer should be able to encode/omit A usecase example is storing a username-to-id mapping in 2 different backends: Given: Jonathan == 10001 • Storing in a dbm, I would need to store the time-relevance in the CachedValue object. The perfect solution is a pickled object that pairs the data with metadata.
• Storing in redis, the server already handles expiry and memory is a concern. The perfect solution is storing the raw value
The dbm solution doesn't work well for redis in production though, it takes up 78bytes to achieve the same goal that 5bytes could. |
jvanasco (jvanasco) wrote: I've started to fork this here: https://bitbucket.org/jvanasco/dogpile.cream Instead of implementing as a region, I'm using a global setting in |
jvanasco (jvanasco) wrote: dogpile.cream is working on my staging site and on production web spiders. no errors. Spent an hour integrating the tests into my normal fork (https://bitbucket.org/jvanasco/dogpile.cache/commits/branch/feature-cachecontrol_region) I think I can pull this off with a few API changes:
|
Darin Gordon (Dowwie) wrote: Hey Mike and Jon I'm interested in using dogpile cache as the standard caching plugin for yosai (http://www.github.com/yosaiproject). Within yosai's serialization module is a SerializationManager. In this SerializationManager, I add one layer of metadata to the message payload and then encode the message using whatever scheme it has been configured to use, such as MSGPack or JSON. Through the region configuration process, I'd like to inject a SerializationManager-like object into dogpile cache that would control serialization/deserialization. What do you think of me submitting a PR to add a serialization management layer to dogpile.cache? reference: https://github.com/YosaiProject/yosai/blob/master/yosai/serialize/serialize.py DG |
Michael Bayer (zzzeek) wrote: I was thinking of just adding a flag to the Redis backend where you can pass in a pickler / serializer object of your choice, or set it to None. The yosai thing sounds like it would be a dogpile cache ProxyBackend (and could be part of yosai). |
Michael Bayer (zzzeek) wrote: that is, I dont think trivial serialization like pickling, which is already intrinsically part of some client libraries and not in our control like that of all the memcached backends, should be a "middleware" component by default. this is a little bit of TMTOWTDI but just like the key mangler, having small injectables for rudimentary key/value mangling is valuable vs. requiring a mediation object in all case. |
jvanasco (jvanasco) wrote: So basically, I did that here. Feel free to play with my fork/branch https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/39/fine-grained-support-for-cache/diff I had a shower-thought on it the other day about consolidating this further, but since forgot :/ It was a good one too. The basic premise is that I saw the need for 2 things to be split out from dogpile:
BUT What you're talking about could simply be used with a proxy backend as Mike noted. Mike recently accepted a patch where I illustrated using msgpack as a proxy backend. The caveat though, is that dogpile will cache: In order to cache |
Darin Gordon (Dowwie) wrote: Pickling a well-serialized object is cringe worthy. Further, over the last few years, I haven't heard a single engineer recommend using pickle for serialization. This is just an assumption based on anecdotal evidence, but it seems that "the industry" has moved on. As Alex Gaynor said, pickles are for delis-- not software. So, why not just remove pickle entirely? |
Michael Bayer (zzzeek) wrote: because all the memcached clients have pickle bolted into them by default. pickles are still extremely relevant. |
jvanasco (jvanasco) wrote: pickling a well-serialized object is cringe-worthy. it is also unwanted when dealing with raw data (such as caching a raw number vs a pickled python object containing a number and metdata). but dogpile is caching library, not a "well serialized object" caching library. it makes sense to support pickle, because often times you may want to cache an unserialized python object. since pickle is the historical default, it needs to be maintained for backwards compatibility. |
Darin Gordon (Dowwie) wrote: I'll do as recommended and serialize using a customized ProxyBackend. I just don't want any pickling. Turning off pickling involves a straightforward update. A default of pickle=True would ensure others wouldn't be impacted by the change. Mike are you good with moving forward on this approach? |
Darin Gordon (Dowwie) wrote: setter injected or constructor? |
Michael Bayer (zzzeek) wrote: the cache backends take arguments in their constructor which gets configured through the region through the "arguments" parameter to configure(). |
jvanasco (jvanasco) wrote: I don't believe this approach can work as intended.
reference: If you turn off pickling in the backend, you're still stuck with a Python instance (CachedValue) being passed to the backend. That instance contains the data serialized by the custom proxy_backend AND some metadata that the current implementation of reference:
Just to be clear about this: The logic flow is like this: [raw value] -> [proxy backend munging] -> [region.py wraps in CachedValue w/metadata] -> [backend/driver.py serializes] Without restructuring the logic as I've proposed, the easiest way to achieve @Dowwie's goals would be to re-implement the redis backend in a custom package that performs msgpack serialization on the backend value (instead of the pre-backend value munging that proxy_backend offers); that could also just fake the entire CachedValue wrapper or store the metadata within the msg_pack format. |
Michael Bayer (zzzeek) wrote:
ah. go you, for reading my source code I never look at ! this is true. But still. CachedValueManager? Why do we need another object and why such a J2EE-like architecture? CachedValue is our payload. If i dont want to use pickle, then that indicates I want to do the work of writing some other kind of serialization for CachedValue. There is still a loads()/dumps() function. It's just the approach of allowing "None" is no longer necessary, because you need a serializer no matter what. If you want to serialize CachedValue using json, that's fine. This just places some extra burdens on the production of loads()/dumps() functions for those people who have some issue with pickling. They have to handle the CachedValue wrapper. |
jvanasco (jvanasco) wrote: I have spent way too much time recently reverse-engineering your code! They way I implemented this, it's 100% backwards compatible. The backends use a serialization class, but they default to pickle or none (whatever is current). If you want to configure a region to use JSON instead, then you just pass that in during config. The reason why I came up with CachedValueManager is two-fold:
The general goal was that this approach allows for the 80% of "basic needs" dogpile users to see no difference in usage, but the 20% of "advanced needs" users would be able to build out much of the custom work and edge-cases that are often requested. |
Michael Bayer (zzzeek) wrote: the backend "timeout" value is not compatible with the dogpile approach, because when the backend "times out", the value is gone - we can't return it to clients while the "new" value is generated. CacheRegion can be passed a CachedValue subclass as a parameter also. |
jvanasco (jvanasco) wrote:
|
Michael Bayer (zzzeek) wrote: I dont agree the mechanism of CacheRegion is so trivially swappable and also CachedValueManager is too much up-front interface / OOP for an anticipated use case. I propose a plugin for the CachedValue class, not using it at all is not an option; just use redis directly if you don't want the features of CacheRegion (e.g. dogpile protection). The entire point of CacheRegion is that it does not expose a so-called "cache miss", by returning the previously cached value when expiration is detected. |
Michael Bayer (zzzeek) wrote: But also, the use case you're looking for, "I don't want pickle", as always ; the RedisBackend accepts a loads() / dumps() pair, that's it. You need to determine what to do with the given CachedValue object. |
jvanasco (jvanasco) wrote:
The idea of turning off the python-side cache management was also influenced by the fact that this would allow users to hot-swap backends and fine-tune for performance by only changing the dogpile config. redis / riak / memcached could be switched and profiled quickly. Depending on an application's usage, memory (and the ability to hold an entire data-set in working memory) may be much more important that speed and concurrency. It's not pleasant ripping out and re-implementing caching libraries when you need to test this (sadly know from experience ;/)
So now I remembered by shower-thought-- If we implement my idea for The reason why this would be good first step, is that it allows for undocumented/unsupported/offlabel munging of the data. A developer could repack the CachedValue data into an internal structure and rebuild it as needed... or fake the CachedValue as not having an expiry. |
Darin Gordon (Dowwie) wrote: @zzzeek Makes sense to keep the CacheRegion largely intact. It is the bread and butter of this project. Why should backends be concerned about load/dump (serialization)? Instead, a backend should expect cache-ready, serialized objects as input. Do we agree on this point? If we agree to above two points, then we're leading towards a CacheSerializer managing serialization / de-serialization, and consequently CacheValue creation. Am I correct? |
Michael Bayer (zzzeek) wrote:
because, they already are. Here is the source of the most common client, python-memcached: https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L928 that particular memcached client, as it was first, set the stage for all the other ones. What is extremely interesting to note here is that they are taking advantage of the memcached protocol itself to store additional "flags" which indicate how the value is serialized; these flags are only available on the wire protocol and are not exposed by the client. So the memcached backends provide very elaborate, wire-protocol-bound, serialization functionality already, hardcoded to pickle. To be fair, while I don't know how well they document this, they are using those flags to indicate if the value was in fact a Python object to start with and is pickled, and if a plain string was passed, it is not pickled; so there is a way to affect whether or not this occurs. But the need for serialization itself as well as the best way to do it is a backend-specific affair. When dogpile was written, the memory, file and memcached backends were it. One backend needed no serialization (memory), one had none (file), and another already provided it (memcached). The job of serialization was clearly something that changes per-backend. If we removed calls to "pickle" from the file / redis backends, those backends will now fail to function correctly with the cache region. If we put pickling into the CacheRegion itself, it would need to be turned on by default, which means it is firing unnecessarily for the memory and memcached backends. The backends should be able to work with a region without additional flags added to the region itself, other than the backend itself. Behaviors that need to travel with the backend should be configured with it. The overarching theme of this thread is one of vague suggestions of "undocumented/unsupported/offlabel munging of the data" and "virtually any caching logic can be bootstrapped onto dogpile's locking system and caching infrastructure", I can't work with those. Adding small flags, sure. But rearchitecting the object model, that's a huge job, and one that has to be derived from a set of very specific use cases. Throughout this thread, I still don't know what either of you want to do. Not serializing isn't an option, and "Alex Gaynor said pickle is dead" is not really sufficient :). @jvanasco 's use cases are even more obscure. Just building out a more exploded API and declaring "now it can do anything!" also is not enough for me to go on (and also seems kind of out of scope for, "we want better control over serialization"). |
Darin Gordon (Dowwie) wrote: I eat huge jobs for breakfast. I'll return with my munged fork in a few. |
jvanasco (jvanasco) wrote: First off - I think it's important to note that technology has changed a lot since the earlier dogpile releases. Memcached used to (in general) be the only option for caching daemons, now it is often considered the worst -- with redis, riak and some other systems being much faster and feature-filled. Processors and memory are also generally much higher, so the chance for blocking tends to be down quite a bit. Dogpile may have intended to be the caching system built around the dogpile problem, but it has grown to be much more. Many people are attracted to it because it defines a robust API with hot-swappable backends. Unfortunately, my application has grown to the point where dogpile has become a huge bottleneck for me. The manner in which dogpile wraps data into a CachedValue object makes my "mapping" region (name to id) about 18x larger, and most of my other objects 3-4x larger. Dogpile essentially creates a situation where I need a dogpile lock addressed, because my caching backend can only hold 1/6 the data and everything is expiring based on FIFO rules. I do not want to rip out every bit of dogpile in my app and replace with a direct redis client (I lose hot-swap options meaning I have to run a configured redis on every environment.). That means I can either maintain a forked dogpile (which I'd like to avoid, but it a better option) or figure out some patch/api change that will give me enough hooks to accomplish what I need. Being able to specify serializers for each backend does that -- as it creates a "hook" at the right spot. That being said, let me just clarify my suggestion:
|
jvanasco (jvanasco) wrote:
There are 2 dogpile contexts, and @zzzeek is primarily talking about one of them. Imagine there is a really expensive function that takes 5 minutes to generate a value. In the span of 15 seconds, 15 unique requests are made for that value, once each second. It could cripple a server to handle all 15. The first dogpile context is the "unprimed cache". If there is no value in the cache: the first request creates a "Lock" on creating the value. The lock keeps the next 14 people from generating a value, but makes them wait for a result. The second dogpile context is "stale data". If there is a value in the cache but it has expired, the first request creates a new value, while the next 14 requests use the stale data. "Stale data" is possible because the
While redis can have a timeout on values, they are not required and a value may not necessarily have a timeout (keys can have no expiry). Memcached, dbm, memory, and other datastores don't have native timeouts either. Redis, Riak, and a few others "newer" daemons support this (increasingly popular) concept. In order to support the second context (and generally have timeouts on backends that do not have a native timeout), dogpile needs that expiry timestamp in the cached value. My big issue right now is just raw memory. This old Instagram post illustrates a similar issue [they used hashing to minimize the data stored based on prefixes, my concern is just on the wrapper], but goes into the concerns of server resources. http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value So my thought is this: if there is a documented loads/dumps hook in the backend, then a developer can deconstruct/rebuild a CachedValue as needed and that solves most problems. Custom serializers for redis might do this:
|
Michael Bayer (zzzeek) wrote:
If you have an approach where you'd like to store the information of the CachedValue in a different way, I'm still not understanding why a ProxyBackend is not acceptable. that's what your recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format does. The "CachedValue makes my data 18x larger" phrase implies that you have very small data values. I've read your two latest comments, not taking into account the various proposals / code examples because I really need to see only the problem to be solved, and that's the only problem case I can see described. Does the recipe you posted no longer solve this problem? I'm not reading every word of these very long comments very carefully yet because I still don't have knowledge of the problem to be solved. The two problems so far I see are 1. dowwie doesnt want to use pickle because someone said "pickle was dead" and 2. you don't like the way CachedValue serializes because it takes up too much memory. I'm not seeing how "allow a pickle function and/or None to be passed to redis" doesn't solve problem #1 and how the recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format doesn't solve problem #2 (and in fact, if you can turn off the pickle aspect of the backend, lets you do anything you want with the data, full blown). Putting aside questions of purity, where things should be, what cache backends are cool now, etc. |
jvanasco (jvanasco) wrote: I'll try to be concise:
In terms of the other stuff. For a simple mapping of usernames to ids (which we use in URL resolution and is our most used function), a "raw value" of |
Michael Bayer (zzzeek) wrote:
→ c6913eb |
Michael Bayer (zzzeek) wrote:
I'm not following - the ProxyBackend gets the CachedValue object. Just added a test for this to verify I'm not crazy. See the above cset. Did you mean something else? |
jvanasco (jvanasco) wrote: I'm explaining this wrong, and am also bit confused/going-crazy-on-this after weeks of obsessing over the internals. The sheer number of tests/branches/approaches I have to get around the Pickling issue is intimidating. You are correct and As my goal was to remove Pickle, I realized that MsgPack worked better as a "Serializer" than a "Proxy". Implementing msgpack as the storage format using a proxy required still required disabling pickling on the backend. As a custom serializer passed into backends though, it is all accomplished with a single hook. |
jvanasco (jvanasco) wrote: I split out the proposed serializer code into https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/42/feature-custom_serializer |
jvanasco (jvanasco) wrote: Several years later... I solved all my needs by simply creating a new backend and overwriting a few methods. I recommend that approach. |
another year later. do we still want to be able to pass in loads/dumps callables? I guess putting them on a Serializer class is, while a little more J2EE than I am these days, more consistent with how other thigns in dogpile have been done. the serilaizer gets the CachedValue and your serializer does with it what it wants, throws it away, adds it to your JSON or whatever, doesn't matter. |
I think it is still an important feature to allow a choice of serializers. Design wise... there are a lot of ways that I would prefer to implement this, but I think the most backwards compatible would be changing the regions to accept a |
OK yes, this is what I favored throughout this discussion as I re-read it, I believe also moving to github makes these kinds of things easier to navigate and read as having to go to other repos on bitbucket is basically something I had a negative emotional reaction towards, I apologize. Pull requests that illustrate code changes are best, but i guess in this case we just need to implement, right? cc @4383 |
IMHO that is how everyone who doesn't work for Bitbucket feels when using their system. Sorry for overloading you with so many options, trying to give you an easier choice was really just complicating things. Generally speaking though... the backends would accept loads/dumps like this: then just call Looking at the |
yeah it's not a hard change but they just have to deal with the CachedValue thing explicitly. They can of course not store that info or store it somewhere else or use some kind of range based approach , that's all about the serialization strategy. |
we can include a basic JSON plugin which will make some basic JSON of the CachedValue then call loads() / dumps() on the value they give us. Folks can use that as a starting point for other kinds of json / serialization or whatever. a msgpack plugin also, sure. |
IMHO, the easiest way to handle that is to bundle the (de)serialization into the backend's These example in a public repo is msgpack, but I do the same with json in private projects. msgpack - This example will amuse and possibly anger you... The CachedValue's metadata is tossed before serialization. When reading it out, it is recreated with the current time so the value never expires as far as dogpile is concerned. |
Migrated issue, originally created by Michael Bayer (zzzeek)
The text was updated successfully, but these errors were encountered: