Skip to content

Conversation

@pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 6, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Related to #2147.

New behavior

  • Checksums available on the remote are indexed (cached) locally in .dvc/tmp/index/<filename>.idx where filename will be the SHA256 digest of the remote URL.
  • On any dvc command which queries remote status (status -c/push/pull/fetch/gc -c) index will be updated to include any new checksums which are found on the remote (or are pushed successfully to the remote)
  • --drop-index can be used to delete the index for the remote (and force re-indexing)

Implementation details

  • remote.cache_exists()
    1. Validate our index by querying for .dir checksums on the remote. If a .dir checksum is in our index, but not the remote (i.e. someone else has run gc -c), clear the entire index.
    2. Look for checksums in the index - If all of the checksums being queried are already in the local index, the remote will not be queried at all.
    3. Any checksums which are not in the local index will be queried on the remote as usual, and the local index will be updated to include any new remote cache entries. (So doing a full list/traverse of the remote will is the same as (re)indexing the entire remote)
  • pull/fetch: If any index related mismatch/error occurs (i.e. trying to download a file that is in our index but not on the remote), clear the entire index.
  • gc -c: Queries and re-indexes the entire remote, regardless of what was in the index before running gc -c.
  • Index file will be read once at the start of relevant dvc commands, and written to once at most (only if the index was modified during the dvc command)
  • RemoteIndex class/module can be easily extended/modified in the future to support any other protocols which support the usual dump()/load() interface (pickle, json, msgpack, etc)

Todo

  • Figure out an actual storage solution (currently written in flat binary file format)
  • add --drop-index CLI option to push/pull/status so user can force re-indexing
  • Use .dir checksum existence on remote to validate/invalidate index
  • push: push .dir checksums last
  • gc: remove .dir checksums first
  • unit tests for index module
  • functional tests for new behavior
  • docs

Other ideas:

  • Only index .dir checksums to keep local size to a minimum? This was discussed in perf: remote indexesΒ #2147, if we modify push behavior for .dir checksums to push the directory file contents first, and the actual .dir checksum last, we can treat the dir checksum as an index on its own - if the .dir checksum is on exists on the remote, it follows that the directory contents is also on the remote
  • Index file size as well as checksums? (also related to Add summary to dvc status -c [target] Β #3568)

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 6, 2020

Regarding the actual index storage format, I looked into using bcolz which was suggested in the remote indexes discussion issue, but it's only available as a source distribution when installing w/pip (no binary wheels for any platform), and trying to build fails on my machine (osx catalina). There are conda binaries available though.

@shcheklein
Copy link
Contributor

One question: what happens if someone else runs dvc gc? do we check for this somehow. (I remember we had some ideas about directories at least - using .dir file as a flag).

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

One question: what happens if someone else runs dvc gc? do we check for this somehow. (I remember we had some ideas about directories at least - using .dir file as a flag).

Right now we will just clear the entire local index the first time we try to pull something from the remote that isn't actually there. Taking advantage of the .dir files in some way is planned though

@shcheklein
Copy link
Contributor

So, what about push (and e.g. status) in this case? Will it skip files that it thinks (according to the local index cache) that already exists remotely?

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

As is right now, yes it will skip pushing files that are in the local index (and we would not see that someone else's gc -c operation had removed them from the remote).

One solution to this would be to at least check for .dir files on the remote no matter what. But in the event that there are enough .dir checksums to in our query to make us do the full list/traverse of the remote, we wouldn't actually be improving anything in terms of performance.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

One thought I had was that if .dir checksums were stored separately from file checksums in our cache (so that .dir's were under their own prefix), we would be able to fetch them without having to potentially query the entire remote cache, but that would be a pretty significant design change

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

Discussed this with @efiop , potentially breaking backwards-compatibility and reorganizing our cache structure is something for future discussions

In the meantime we can add a --drop-index (or similarly named) CLI option to allow users to force re-indexing. So in the event that the user knows some other team member has run gc -c, the user can run status -c/push/pull --drop-index and force us to re-index the remote cache.

@shcheklein thoughts?

@shcheklein
Copy link
Contributor

It feels to me that for directories it's totally fine to use exists? It's quite unusual edge-case to have 1M directories for example. It also means 1M DVC-files?

I would not worry about performance of checking existences of .dir files.

I would worry for us not saving some data.

@shcheklein
Copy link
Contributor

In the meantime we can add a --drop-index (or similarly named) CLI option to allow users to force re-indexing. So in the event that the user knows some other team member has run gc -c, the user can run status -c/push/pull --drop-index and force us to re-index the remote cache.

It all feels vary too risky to me. We are playing with data here and need to be conservative.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 7, 2020

CI expected to fail pending merge of #3604

pmrowla added 2 commits April 7, 2020 16:53
- use pytest fixtures
- use sha256 digest
- s/fd/fobj/
- check more specific IO errors
- `used_cache()`/`get_used_cache()` in repo/stage/output now return tuples
  of (dir_cache, file_cache) for directories rather than one flat/merged
  cache
- if local index contains directory checksums, they will always be
  checked on the remote. If an expected .dir checksum is missing, the
  local index will be invalidated/cleared
@efiop efiop added p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Apr 7, 2020
try:
with open(self.path, "rb") as fobj:
self._checksums = pickle.load(fobj)
except PermissionError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we catch only PermissionErrors? not IO in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was an earlier review comment to catch more specific exceptions, I'll fix this to also catch general IOError as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should at least show some message (let's not use WARNING though, please - they look ugly) in some cases? To avoid situation when some logic is broken or not enough permissions (can it happen at all?) and we silently degrade performance significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only time IO errors would actually happen here is in the event we can't read/write to the .dvc/tmp directory at all, whether it's due to permissions, lack of disk space, or some other reason (which I think would cause errors elsewhere in dvc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this so that on an IO error we explicitly state that the remote will be re-indexed, and the message includes the original exception/error message.

It's still logged as an error though, should this be an info message since it's not fatal?

pmrowla added 2 commits April 9, 2020 14:14
- only write changes if index was actually modified since last save/load
"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote cache.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - let's not use cache .. @jorgeorpinel could please review messages, btw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We prefer "remote storage" or just "remote" as of now.

"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote cache.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... and here

- format contains simple header w/checksum counts
- checksums are packed as 4-bytes and then compressed w/gzip
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 9, 2020

Current flat file format:

  • Basic flat file (with a header) where MD5 checksums are stored as 16-byte binary arrays.
  • .dir suffix is not stored in the index file itself for .dir checksums
    • Suffix will be stripped on writing the index file, and appended on reading the index file
    • The # of dir and file checksums is included in the index file header, and the first dir_checksum_count checksums in the file are .dir checksums, and all remaining checksums are regular file checksums.
# (all integer types are little-endian)
#
# Header
# ------
#   protocol_version (32-bit uint): index file protocol version
#   dir_checksum_count (64-bit uint): number of .dir checksums in data section
#   file_checksum_count (64-bit uint): number of file checksums in data section
#   crc - 32-bit CRC32 checksum of entire data section
#
# Data section
# ----------------------
#   array of <dir_checksum_count> 128-bit MD5 .dir checksums
#   array of <file_checksum_count> 128-bit MD5 file checksums

For reading and writing an index with 1M checksums, python timeit reports:

dump(): 0.27s    # write
load(): 0.60s    # read

Note regarding the read time: index is stored in memory as python sets, load() time includes the overhead from reading checksums into sets. When reading/appending into a flat list, load time is under .3s.

@skshetry skshetry self-requested a review April 9, 2020 11:27
Comment on lines +163 to +167
pull_parser.add_argument(
"--drop-index",
action="store_true",
default=False,
help="Drop local index for the specified remote.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about "specified remote" since sync commands don't always receive a -r arg. Maybe just remove the word "specified".

Also, "drop index" may not mean anything to a casual user, should this be a little more descriptive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same in the other 2 sync commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about --clear-index or --reset-index?

jobs=None,
remote=None,
show_checksums=False,
drop_index=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add drop_index to docstring Args?
Same in other functions/methods.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, sorry for mixed review style.

Comment on lines +121 to +125
# We can safely save here, as existing corrupted files will
# be removed upon status, while files corrupted during
# download will not be moved from tmp_file
# (see `RemoteBASE.download()`)
self.repo.state.save(cache_file, checksum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment getting long πŸ˜‹



def dump(dir_checksums, file_checksums, fobj, protocol=None):
"""Write specified checksums to the open file object ``fobj``."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what are double back quotes signaling here? Just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old habit from writing sphinx/rst docstrings, I'll clean them up

protocol = DEFAULT_PROTOCOL
if protocol not in SUPPORTED_PROTOCOLS:
raise DvcException(
"unsupported remote index protocol version: {}".format(protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"unsupported remote index protocol version: {}".format(protocol)
"unsupported remote index protocol version: '{}'".format(protocol)

AFAIR we use ' around most dynamic output.
There's a few more instances of this.

Comment on lines +143 to +148
Args:
repo: repo for this remote index.
name: name for this index. If name is provided, this index will be
loaded from and saved to ``.dvc/tmp/index/{name}.idx``.
If name is not provided (i.e. for local remotes), this index will
be kept in memory but not saved to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in a constructor docstring instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik standard python conventions allow constructor args to be documented in either __init__ or in the class declaration like here. I put them here because that's how it was done in some other dvc classes (see dvc/cache.py::Cache)

@shcheklein
Copy link
Contributor

@pmrowla looks great, but a little bit complicated :)

could you update the description, please to get the up to date list of things that we do with this PR?

pmrowla added 2 commits April 10, 2020 13:29
- there are some places in dvc where we already have separated dir/file
  checksums and some places that we do not, use update()/replace() where
  we already have them separated, and update_all()/replace_all() when we
  have to do the filtering ourself in RemoteIndex
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 10, 2020

@pmrowla looks great, but a little bit complicated :)

could you update the description, please to get the up to date list of things that we do with this PR?

@shcheklein it's been updated.

I removed the compression stuff from the index file format since it's not needed, so the index file read/write code is simplified now.

Comment on lines 768 to 769
if removed:
self.index.invalidate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, this is unnecessary.

gc -c will always fetch the full remote listing via all() (regardless of what is in our index). So we can treat gc -c as a full re-indexing of the remote. After gc -c our index should contain the full remote listing minus the unused checksums which were removed.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 10, 2020

After discussion with @efiop, we decided that a flat format (whether binary or json/msgpack/etc) won't be ideal for dealing with large file/checksum counts. And we will also ideally want random access support for cases like running push/pull with a small number of files.

A portable database format like sqlite may work for our needs. Will do some testing to see what kind of runtime performance and index file size we would get with sqlite.

repo.cache.local.cache_dir = self.cache.local.cache_dir
with repo.state:
cache = NamedCache()
used_cache = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure I understand why this is needed.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 13, 2020

This is going to be split into multiple PRs, the .dir checksum push/pull/gc behavior changes will come first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants