-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilesBackend and RangedFileLock #142
Conversation
OK, so #141 was just about the file lock, so I missed that the file backend was also part of this. does the ranged file lock apply to the existing DBM backend as well ? For the file backend here, storing each key in three files (or two) looks pretty heavy. I would think that a production grade version of something like this would have a lot of in-memory caching going on to reduce on all those file reads, essentially, that it would be a real key/value store. What is the goal of this backend ? |
The RangedFileLock can be used for the existing DBM, but I didn't try it and it may require some tweaking. I can reduced the number of files used per entry:
The goal of this backend is to optimize the memory usage in case of big files being generated by the user-space function. This is how my userland code looks like: app = Flask(__name__)
ExportBase = declarative_base()
CACHE_EXPIRATION = datetime.timedelta(seconds=30)
file_region = dogpile.cache.make_region().configure(
'dogpile.cache.files',
expiration_time=CACHE_EXPIRATION,
arguments={
'base_dir': '/tmp/export-cache',
'expiration_time': CACHE_EXPIRATION,
'file_movable': True,
# 'cache_size': 1024 * 1024 * 1024, # 1 Gb
'distributed_lock': True,
}
)
@app.route('export', methods=['GET'])
def export_get(uid):
f = make_export_temp_file(
uid=uid,
some_arg=policy.check('some_policy'),
)
response = flask.send_file(f, mimetype=SQLITE3_TYPE)
return response
@file_region.cache_on_arguments()
def make_export_temp_file(operation_uid, some_arg):
# If you call this function bypassing dogpile cache, make sure to delete the file, since it won't be deleted
# automatically
# TODO: Implement a NamedTemporaryFile wrapper so that calling .close() won't error
# in case the file does not exist
f = tempfile.NamedTemporaryFile(delete=False)
make_export_file(uid=uid, some_arg=some_arg, file_path=f.name)
return f
def make_export_file(uid, some_arg, file_path):
engine = create_engine("sqlite:///" + file_path)
ExportBase.metadata.create_all(engine)
session = sessionmaker(bind=engine)()
# Inserting a lot of stuff into the sqlite file...
session.commit()
engine.dispose() |
OK, so I see a lot of conditionals there with do we want to pickle the datafile or not, I think this has to shoot for the use case you have exactly and nothing else - you have very large files, let's assume like a server that returns files which can be in the multiple gigs (like software downloads), so we definitely want to leave them as is. It looks like in that case, you are correctly returning the open file handle and not reading it all into memory, so that's good. If I'm reading correctly, that was the purpose of the "type" file, so lets get rid of that. Finally, for the "metadata" file, which I now agree should be there, that should just be a JSON file, not pickled. dogpile's use of pickle is so that normal Python objects don't need to define JSON serialization, but it would be better if they did, pickle is really just not very good. also this really needs to be broken into two separate PRs as originally the purpose of the ranged file lock was so that we woudln't have to lock for the entire dbm file, and whether or not that's something people find useful (it probably would not see that much real world use since folks really use memcached and redis for most caching), the lock should establish itself as functional for any arbitrary cache backend. so to sum up:
how does that sound ? |
note also that I normally really prefer that folks create backends as their own third party projects, which after all is why the backend system in dogpile is designed to be extermely easy to extend and release plugins towards. That you are proposing it for inclusion in the main library means it's going to be there for years and people will be coming to us with issues and requests for it. so the bar is very high, i normally don't accept additional backends, in theory all the backends should be third party projects even. |
Agreed, "type" file must go, it was just a rushed implementation artifact. I'd keep the pickling of the metadata, so that the userland code does not need to pass JSON-serializable objects, but anything conforming to the "standard" python marshalling protocol.
I don't think that this type of lock should be functional for any arbitrary cache backend. Here's my reasoning: in a typical deployment there are multiple machines serving requests, each one with it's own non-shared file system. Each machine has multiple python processes serving the requests.
That can be done, but we I don't think we make it as the default now for DBMBackend (in case you have blue-green deployment, some instances may use the old locking mechanism, new instances would use the new one, and they are not necessarely compatible).
This is not completely true, as file backend is a drop-in replacement for DBMBackend, in the sense that it still works with normal values that are not files.
I'd keep it pickled, see my reasoning before.
This backend can receive/return both file handles and any other picklable-object (just like any other backend). I think it is quite nice that it can handle both and it is not an issue, since file objects are not picklable.
I understand your point as a general rule. However I think this could be an exceptional case, in that the proposed backend has a number of benefits over the current DBMBackend solution:
Given this reasoning, would you consider merging this backend into the codebase (once it’s up to your specs, of course)? |
I'm sick today, so don't have much energy to go through the whole thing, but IIRC the metadata is a standard fixed format that comes from the CacheRegion, it isn't user generated. |
Sure, no problem. But I think that all the backends should work even with the userland code (that is, |
but your backend here is actually hardcoding to CachedValue and separating it into its two components explicitly. no other backend does this. there should be a cleaner separation of concerns here, that is, create a non-dogpile service that is just a "store a file with metadata" kind of service, then create a dogpile backend separately that links to that. this is all stuff that would be better as an independent project because I really have no resources to maintain dogpile.cache over the bare essentials right now, I can't take on support for a whole new system. |
also the DBM backend is not supposed to be that amazing. it's supposed to be a barebones, "demo" kind of backend, just like the memory backend is very basic as well. The industrial strength backends that are less trivial than a simple memcached or redis baackend should be first class, independently maintained projects, because again I don't have the resources to maintain the persistence mechanics on this end. This is similar to how dogpile was ultimately derived from its predecessor, Beaker, which had elaborate SQL backends. I got rid of those too because they were too complicated to maintain, I instead made it very easy for someone who wants a SQL database as a cache backend to make their own. |
Got it, I will create an external project then. |
The plugin is available: https://pypi.org/project/dogpile-filesystem/ |
This PR is the implementation of the solution proposed in #141.
Specifically, I implemented the RangedFileLock class that allows to use 1 lock file for multiple keys (available only on platforms that have
fcntl
, just like FileLock).I also introduced a new backend,
FilesBackend
. This backend will work just like DBMBackend, with the following differences:This code is by no means clean yet, but I just wanted early feedback.
Also the test suite is not complete and it fails when testing the mutex returned by
FilesBackend.get_mutex()
, since this is a re-entrant mutex in the context of the process.I would like to know what you think @zzzeek.