Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilesBackend and RangedFileLock #142

Closed
wants to merge 1 commit into from
Closed

FilesBackend and RangedFileLock #142

wants to merge 1 commit into from

Conversation

youtux
Copy link
Contributor

@youtux youtux commented Jan 25, 2019

This PR is the implementation of the solution proposed in #141.
Specifically, I implemented the RangedFileLock class that allows to use 1 lock file for multiple keys (available only on platforms that have fcntl, just like FileLock).
I also introduced a new backend, FilesBackend. This backend will work just like DBMBackend, with the following differences:

  • it can also be used to store files directly, without the need of keeping the content of the file in memory.
  • it does not use a DBM file, it just stores to 3 files for each key (we can actually just store 2 of them)

This code is by no means clean yet, but I just wanted early feedback.
Also the test suite is not complete and it fails when testing the mutex returned by FilesBackend.get_mutex(), since this is a re-entrant mutex in the context of the process.

I would like to know what you think @zzzeek.

@zzzeek
Copy link
Member

zzzeek commented Jan 25, 2019

OK, so #141 was just about the file lock, so I missed that the file backend was also part of this. does the ranged file lock apply to the existing DBM backend as well ?

For the file backend here, storing each key in three files (or two) looks pretty heavy. I would think that a production grade version of something like this would have a lot of in-memory caching going on to reduce on all those file reads, essentially, that it would be a real key/value store. What is the goal of this backend ?

@youtux
Copy link
Contributor Author

youtux commented Jan 25, 2019

The RangedFileLock can be used for the existing DBM, but I didn't try it and it may require some tweaking.

I can reduced the number of files used per entry:

  • if the value being cached is a "normal" value (i.e. not a file), then I can pickle everything into a single file.
  • if the value being cached is a file, then I need to use at least 2 files: the file itself, and a metadata file.

The goal of this backend is to optimize the memory usage in case of big files being generated by the user-space function.
In my situation, I have a slow function generating a SQLite file and I don't want to store it in memory, since it can get quite big. I want instead to return a file object, and I expect the caching backend to use that file object directly without loading its contents in memory.

This is how my userland code looks like:

app = Flask(__name__)

ExportBase = declarative_base()

CACHE_EXPIRATION = datetime.timedelta(seconds=30)
file_region = dogpile.cache.make_region().configure(
    'dogpile.cache.files',
    expiration_time=CACHE_EXPIRATION,
    arguments={
        'base_dir': '/tmp/export-cache',
        'expiration_time': CACHE_EXPIRATION,
        'file_movable': True,
        # 'cache_size': 1024 * 1024 * 1024,  # 1 Gb
        'distributed_lock': True,
    }
)

@app.route('export', methods=['GET'])
def export_get(uid):
    f = make_export_temp_file(
        uid=uid,
        some_arg=policy.check('some_policy'),
    )
    response = flask.send_file(f, mimetype=SQLITE3_TYPE)
    return response


@file_region.cache_on_arguments()
def make_export_temp_file(operation_uid, some_arg):
    # If you call this function bypassing dogpile cache, make sure to delete the file, since it won't be deleted
    # automatically
    # TODO: Implement a NamedTemporaryFile wrapper so that calling .close() won't error
    #       in case the file does not exist
    f = tempfile.NamedTemporaryFile(delete=False)
    make_export_file(uid=uid, some_arg=some_arg, file_path=f.name)
    return f


def make_export_file(uid, some_arg, file_path):
    engine = create_engine("sqlite:///" + file_path)
    ExportBase.metadata.create_all(engine)

    session = sessionmaker(bind=engine)()
    # Inserting a lot of stuff into the sqlite file...
    session.commit()
    engine.dispose()

@zzzeek
Copy link
Member

zzzeek commented Jan 25, 2019

OK, so I see a lot of conditionals there with do we want to pickle the datafile or not, I think this has to shoot for the use case you have exactly and nothing else - you have very large files, let's assume like a server that returns files which can be in the multiple gigs (like software downloads), so we definitely want to leave them as is. It looks like in that case, you are correctly returning the open file handle and not reading it all into memory, so that's good. If I'm reading correctly, that was the purpose of the "type" file, so lets get rid of that. Finally, for the "metadata" file, which I now agree should be there, that should just be a JSON file, not pickled. dogpile's use of pickle is so that normal Python objects don't need to define JSON serialization, but it would be better if they did, pickle is really just not very good.

also this really needs to be broken into two separate PRs as originally the purpose of the ranged file lock was so that we woudln't have to lock for the entire dbm file, and whether or not that's something people find useful (it probably would not see that much real world use since folks really use memcached and redis for most caching), the lock should establish itself as functional for any arbitrary cache backend.

so to sum up:

  1. one pr for RangedFileLock that includes that it works for dbm backend

  2. file backend is explicitly for very large files that are un-processed

  3. store metadata file as json

  4. backend receives / returns filehandles, not strings

how does that sound ?

@zzzeek
Copy link
Member

zzzeek commented Jan 25, 2019

note also that I normally really prefer that folks create backends as their own third party projects, which after all is why the backend system in dogpile is designed to be extermely easy to extend and release plugins towards. That you are proposing it for inclusion in the main library means it's going to be there for years and people will be coming to us with issues and requests for it. so the bar is very high, i normally don't accept additional backends, in theory all the backends should be third party projects even.

@youtux
Copy link
Contributor Author

youtux commented Jan 25, 2019

OK, so I see a lot of conditionals there with do we want to pickle the datafile or not, I think this has to shoot for the use case you have exactly and nothing else - you have very large files, let's assume like a server that returns files which can be in the multiple gigs (like software downloads), so we definitely want to leave them as is. It looks like in that case, you are correctly returning the open file handle and not reading it all into memory, so that's good. If I'm reading correctly, that was the purpose of the "type" file, so lets get rid of that. Finally, for the "metadata" file, which I now agree should be there, that should just be a JSON file, not pickled. dogpile's use of pickle is so that normal Python objects don't need to define JSON serialization, but it would be better if they did, pickle is really just not very good.

Agreed, "type" file must go, it was just a rushed implementation artifact. I'd keep the pickling of the metadata, so that the userland code does not need to pass JSON-serializable objects, but anything conforming to the "standard" python marshalling protocol.

also this really needs to be broken into two separate PRs as originally the purpose of the ranged file lock was so that we woudln't have to lock for the entire dbm file, and whether or not that's something people find useful (it probably would not see that much real world use since folks really use memcached and redis for most caching), the lock should establish itself as functional for any arbitrary cache backend.

I don't think that this type of lock should be functional for any arbitrary cache backend. Here's my reasoning: in a typical deployment there are multiple machines serving requests, each one with it's own non-shared file system. Each machine has multiple python processes serving the requests.
The file lock is consistent only in the domain of the machine, and it does not span accross multiple machines. Memcached or Redis lock instead are centralized, meaning that the domain is the whole deployment.

so to sum up:

  1. one pr for RangedFileLock that includes that it works for dbm backend

That can be done, but we I don't think we make it as the default now for DBMBackend (in case you have blue-green deployment, some instances may use the old locking mechanism, new instances would use the new one, and they are not necessarely compatible).

  1. file backend is explicitly for very large files that are un-processed

This is not completely true, as file backend is a drop-in replacement for DBMBackend, in the sense that it still works with normal values that are not files.

  1. store metadata file as json

I'd keep it pickled, see my reasoning before.

  1. backend receives / returns filehandles, not strings

This backend can receive/return both file handles and any other picklable-object (just like any other backend). I think it is quite nice that it can handle both and it is not an issue, since file objects are not picklable.

note also that I normally really prefer that folks create backends as their own third party projects, which after all is why the backend system in dogpile is designed to be extermely easy to extend and release plugins towards. That you are proposing it for inclusion in the main library means it's going to be there for years and people will be coming to us with issues and requests for it. so the bar is very high, i normally don't accept additional backends, in theory all the backends should be third party projects even.

I understand your point as a general rule. However I think this could be an exceptional case, in that the proposed backend has a number of benefits over the current DBMBackend solution:

  • finer-grained locking based on the key
  • ability to deal with normal values and with file objects. In the case of file objects, this would provide faster performance.
  • automatic pruning of entries when the max cache size is exceeded

Given this reasoning, would you consider merging this backend into the codebase (once it’s up to your specs, of course)?

@zzzeek
Copy link
Member

zzzeek commented Jan 25, 2019

I'm sick today, so don't have much energy to go through the whole thing, but IIRC the metadata is a standard fixed format that comes from the CacheRegion, it isn't user generated.

@youtux
Copy link
Contributor Author

youtux commented Jan 28, 2019

I'm sick today, so don't have much energy to go through the whole thing, but IIRC the metadata is a standard fixed format that comes from the CacheRegion, it isn't user generated.

Sure, no problem. But I think that all the backends should work even with the userland code (that is, Backend.set() should be able to handle any marshallable value, not only CachedValue).

@zzzeek
Copy link
Member

zzzeek commented Jan 30, 2019

I'm sick today, so don't have much energy to go through the whole thing, but IIRC the metadata is a standard fixed format that comes from the CacheRegion, it isn't user generated.

Sure, no problem. But I think that all the backends should work even with the userland code (that is, Backend.set() should be able to handle any marshallable value, not only CachedValue).

but your backend here is actually hardcoding to CachedValue and separating it into its two components explicitly. no other backend does this. there should be a cleaner separation of concerns here, that is, create a non-dogpile service that is just a "store a file with metadata" kind of service, then create a dogpile backend separately that links to that.

this is all stuff that would be better as an independent project because I really have no resources to maintain dogpile.cache over the bare essentials right now, I can't take on support for a whole new system.

@zzzeek
Copy link
Member

zzzeek commented Jan 30, 2019

also the DBM backend is not supposed to be that amazing. it's supposed to be a barebones, "demo" kind of backend, just like the memory backend is very basic as well. The industrial strength backends that are less trivial than a simple memcached or redis baackend should be first class, independently maintained projects, because again I don't have the resources to maintain the persistence mechanics on this end. This is similar to how dogpile was ultimately derived from its predecessor, Beaker, which had elaborate SQL backends. I got rid of those too because they were too complicated to maintain, I instead made it very easy for someone who wants a SQL database as a cache backend to make their own.

@youtux
Copy link
Contributor Author

youtux commented Feb 4, 2019

Got it, I will create an external project then.
Thank you for your feedback and your time.

@youtux youtux closed this Feb 4, 2019
@youtux
Copy link
Contributor Author

youtux commented Sep 10, 2020

The plugin is available: https://pypi.org/project/dogpile-filesystem/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants