Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make disk_cache persist across runs #195

Merged
merged 4 commits into from
Jul 30, 2021

Conversation

lakesare
Copy link
Contributor

@lakesare lakesare commented Jul 5, 2021

No description provided.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 86.761% when pulling ddb2d79 on lakesare:make_disk_cache_more_persistent into 7e8635e on scidash:master.

@coveralls
Copy link

coveralls commented Jul 5, 2021

Coverage Status

Coverage increased (+0.06%) to 86.867% when pulling bed2fe1 on lakesare:make_disk_cache_more_persistent into 7e8635e on scidash:master.

if isinstance(from_kwargs, Path) or isinstance(from_kwargs, str):
location = str(from_kwargs)
else:
location = Path(tempfile.mkdtemp()) / "cache"
Copy link
Contributor Author

@lakesare lakesare Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgerkin, as mentioned in the comment, we'll need to add config.set_to_disk() for the commented-out version to work.

Before I commence with that, however, how about another idea.
We can use Path(os.getcwd()) / 'cache', which will create the cache file in the folder the user executes their script from.
This way we don't need to mess with the config, and the user keeps direct control over the size of their cache file.
Do you see any drawbacks to this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds appealing, but most applications hide their cache somewhere that the user is unlikely to mess with it. Maybe this is undesirable for something like sciunit where the modal user is also a developer. But I could still imagine that cache contents get accidentally git committed, for example.

Also, the user might execute their script from different locations (cwd could be a directory full of notebooks one time and the project root another time) and then we end up with unnecessary computation and two caches!

So I'd prefer to have tempfile decide where the cache goes, and if a user wants to explicitly pass their current directory as the location they can, or if they want to change the default locations used by tempfile they can do that.

Some decision has to be made about how relative paths will be handled. I don't like appending to os.getcwd() but perhaps appending to the .sciunit directory in the user's home directory? That is where the config is already stored I believe.

We could also have explicit cache purging or cache pruning operations (prune cache entries older than N days, for example, though this would require a parallel cache to store dates).

Copy link
Contributor Author

@lakesare lakesare Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some decision has to be made about how relative paths will be handled.

Why, when a user passes their "use_disk_cache": "./relative/path"?

explicit cache purging & delete-cache-entries-every-n-days

Agreed about the cache purging function.

About delete-cache-entries-every-n-days -
tempfile already tries to store files in folders that tend to be cleaned occasionally.

Storing cache location in {home}/.sciunit/config.json

To explain my concerns over this loss of control over the cache location -

  • I'd prefer my cached sim results not to disappear (on reboot or in n hours, what if it took me 30 minutes to run the sim),
  • I'd like to keep an eye on the cache file size (and to be able to clean my cache without reading the docs),
  • I'd like my cache to be per-project and not per-user (what if I .deleteCache() and that deletes the cache of another project of mine too).

This is not the position I'd normally take with cache, this is the position I take knowing netpyne can casually take 10 minutes to run the sim 🤔

In either case, the user can still do the local cache file. If you confirm you certainly prefer the {home}/.sciunit/config.json & tempfile default - let's go this way!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I prefer that, and it will persist until the user deletes. It please do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To set the record straight - we don't want tempfile to handle the cache location anymore (and thus we don't need to store any paths in $HOME/.sciunit/config.json), we want a persistent cache in $HOME/.sciunit/cache.

Yes, let’s do the .sciunit location (something like $HOME/.sciunit/cache) for disk_cache=True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could still allow it with some special path location argument (e.g. "temp") but I'm not sure what the use case would be for this, so I think you can get rid of it.

sciunit/models/backends.py Outdated Show resolved Hide resolved
if isinstance(from_kwargs, Path) or isinstance(from_kwargs, str):
location = str(from_kwargs)
else:
location = Path(tempfile.mkdtemp()) / "cache"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds appealing, but most applications hide their cache somewhere that the user is unlikely to mess with it. Maybe this is undesirable for something like sciunit where the modal user is also a developer. But I could still imagine that cache contents get accidentally git committed, for example.

Also, the user might execute their script from different locations (cwd could be a directory full of notebooks one time and the project root another time) and then we end up with unnecessary computation and two caches!

So I'd prefer to have tempfile decide where the cache goes, and if a user wants to explicitly pass their current directory as the location they can, or if they want to change the default locations used by tempfile they can do that.

Some decision has to be made about how relative paths will be handled. I don't like appending to os.getcwd() but perhaps appending to the .sciunit directory in the user's home directory? That is where the config is already stored I believe.

We could also have explicit cache purging or cache pruning operations (prune cache entries older than N days, for example, though this would require a parallel cache to store dates).

sciunit/models/backends.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rgerkin rgerkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See slack and below.

@lakesare lakesare marked this pull request as ready for review July 20, 2021 07:42
@lakesare lakesare requested a review from rgerkin July 20, 2021 07:48
Copy link
Contributor

@rgerkin rgerkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes suggested

myModel = Model()
backend = Backend()
backend.model = myModel
backend.init_backend(use_disk_cache="/some/good/path", use_memory_cache=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this fake path possibly work? Can you instead use the tempfile module to just explicitly create an real temporary path? I assume that init_backend doesn't actually try to write to this location (so it doesn't matter if it is real or not), but in that case it isn't much of a test of whether a manual location can work. You should instead have each of these cases actually try writing data to that cache, checking if it is there, etc.

'''A class for creating and deleting a folder in "./unit_test/".
'''

path = Path(__file__).parent / "unit_test" / "delete_after_tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are moving to doing everything inside HOME/.sciunit, so maybe the temp test folder should be there as well, i.e. HOME/.sciunit/unit_test. A user could theoretically be running code from a folder in which they don't have write access, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user could theoretically be running code from a folder in which they don't have write access, for example.

And by the user you mean the sciunit developer who runs the tests? (we only use this util for tests)

Would you prefer your temporary test files to be in HOME/.sciunit, somewhere outside of your git tracker?
If something bad happens and test files don't get deleted I'd certainly like to see them as my unstaged files.
If something doesn't work I'd want to uncomment the line that deletes the file, and check the file myself (without going to my HOME folder).
Do these concerns sound unrelatable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are probably arguments both ways so I think it is fine as you have it.

@rgerkin rgerkin merged commit 42f7aa1 into scidash:master Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants