Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to disable binary validation. #62

Closed
wants to merge 1 commit into from

Conversation

pelson
Copy link

@pelson pelson commented Jan 9, 2018

Thanks for the awesome tool!

I'm doing some work in line with #60, and I really am not that fussed about validating the binaries that are mirrored. I'm happy to let users of the mirrored channel to identify issues if the binaries aren't as expected (conda will inform them).

The result is a huge speedup when re-running a synch (now almost instantaneous, even for large repos).

@codecov
Copy link

codecov bot commented Jan 9, 2018

Codecov Report

Merging #62 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   96.88%   96.91%   +0.02%     
==========================================
  Files           2        2              
  Lines         257      259       +2     
==========================================
+ Hits          249      251       +2     
  Misses          8        8
Impacted Files Coverage Δ
conda_mirror/conda_mirror.py 96.87% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd64ae1...5073ac5. Read the comment docs.

@ericdill
Copy link
Contributor

ericdill commented Jan 9, 2018

Hi @pelson! Long time no see :)

Thanks for using my "awesome" tool :) It's nice that the community finds it useful.

This turned into a much longer response than I had intended so I'll put a tl;dr at the end.

long version

Thanks for the PR. I've got a few comments. Also I have not looked at this code base in many months so there's a possibility that I am wrong about the things I say here :)

  1. I am totally open to the idea of not doing md5 validation. I realize that it is a slow, often painful, process to do this for a large repository such as conda-forge or anaconda (or, I suspect, bioconda too).

  2. All package removals get laundered through the _validate_packages function so if we skip that function entirely then the blacklist will not be respected for removing any packages that the user no longer wants on their local mirror for whatever reason (e.g., for me it is usually licensing reasons). Whether or not this is the best design for how to handle the blacklist and whitelist is up for debate, but that's what we've got right now.

  3. _validate does three things: (1) make a cursory attempt to ensure that the tar file is valid (this catches a surprising amount of download bugs), (2) check to make sure the size of the file matches what is listed in the repodata file and (3) compute the md5 hash of the local packages and compare that with what is listed in the repodata.json file. In practice the md5 computation is the exceedingly slow part and doing that every time you run conda mirror (nightly, if you're like me) can be painful.

If we skip validation entirely then conda-mirror will not be able to catch packages that have been updated upstream but have the same filename. FWICT this happens when Anaconda has updated the metadata of the conda package but did not want to "break" backwards compatibility by bumping the package build number (whether or not they should be doing that is a whole other matter...). And this happens with distressing regularity.

tl;dr

From a practical sense, I would suggest having the --no-validation flag only skip the md5 validation. I would also strongly urge the new downloads to be validated. Admittedly, this will make the first-time download slow but it will catch issues before your users do. If this is the path that you want to go down then something like this will skip md5 validation:

    if not dry_run:
        if not validate:
            # Removing the md5 hashes from all packages will disable binary validation
            # by implicitly skipping the md5 check in ``_validate`` which takes a very very
            # long time. We still want to respect the blacklist if the user has one set and
            # we do want to make sure we are capturing packages that are
            # removed upstream or have their sizes changed
            for pkg_name, pkg_info in desired_repodata.items():
                del pkg_info['md5']

        # Only validate if we're not doing a dry-run (and we want validation)
        validation_results = _validate_packages(desired_repodata, local_directory, num_threads)
        summary['validating-existing'].update(validation_results)

(I'm like 95% sure that the above code will work. I have not tested it locally).

Finally!

This is an open-source tool. I know there have been a few people asking for the ability to skip package validation (@Tigeraus and @pp-mo). If what the community wants is the ability to do none of the validation that is done in _validate then this is almost the right approach but we should adjust this block of code to remove the blacklisted packages to ensure that the blacklist is still functioning as advertised. If we do that then this other block of code is probably a no-op.

@pelson
Copy link
Author

pelson commented Jan 10, 2018

Thanks for the detailed response @ericdill. Hope all is well with fatherhood 😄👶

I started by implementing md5 only validation, but it turned out that we are paying a surprisingly high cost in simply getting the index.json out of the .tar.bz2. (benefit 3.(1) was catching download issues is lost if you don't do it though).

I completely agree with wanting to validate on first download. I'll take a look at doing that. I wouldn't want to lose the md5 sum from the channel index though. My implicit assumption is that conda will check that when it downloads the package from my mirrored channel.

Hold fire, I'll take a look at doing this today.

FWICT this happens when Anaconda has updated the metadata of the conda package but did not want to "break" backwards compatibility by bumping the package build number (whether or not they should be doing that is a whole other matter...). And this happens with distressing regularity.

I thought they had stopped doing this many moons ago because of the overwhelmingly negative feedback they got from us about it. I share the distress if they are still doing it - it makes life a whole lot harder if you can't actually rely on the build number... Are you able to find out when you last saw this by any chance?

@ericdill
Copy link
Contributor

Hope all is well with fatherhood

Fatherhood is amazing. Also exhausting. But mostly amazing 😁 . Thanks for asking!

I completely agree with wanting to validate on first download. I'll take a look at doing that.

To keep validating on the first download just back out the diff from 698-702.

I wouldn't want to lose the md5 sum from the channel index though

The local channel index is not affected by removing the md5's from the desired_repodata dict. The call to _desired_repodata here is the last place the desired_repodata data structure is used. To create the local repodata.json (and the .bz2) we use the upstream repodata.json and then prune all packages that are not available locally every time the mirror is run (code here)

Hold fire, I'll take a look at doing this today.

Noted. Ping me again when you want more thoughts :)

Are you able to find out when you last saw this by any chance?

I guess it has been a while. The last time I had to blacklist a package because it was continually failing its size check on download was 2017-05-24. So for all the packages we are mirroring I have not seen this problem crop up again. My mistake for fear mongering 🙀 .

@ericdill
Copy link
Contributor

@pelson did you want me to step in and push this across the finish line?

@pelson
Copy link
Author

pelson commented Jan 30, 2018

@pelson did you want me to step in and push this across the finish line?

Honestly, yes please! 😄

To be completely transparent though, our release engineering folks recently turned on conda channel proxying in artifactory, and so far it has been a pretty good choice. Basically means we only end up copying the binaries we use, not the whole channel.

@parente
Copy link
Contributor

parente commented Jan 30, 2019

I'm going to close this PR with the honest statement that I don't have the time to help revive it and complete it. If someone would like to get it working with the latest master with appropriate tests, I'll have a look and merge it.

FWIW, --no-validate-target was added to the code base since this PR was submitted which does provide some runtime relief.

@parente parente closed this Jan 30, 2019
@pelson pelson deleted the no-validation branch February 1, 2019 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants