Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make downloading more robust #22

Merged
merged 2 commits into from Sep 20, 2016
Merged

Conversation

iainlane
Copy link
Collaborator

Two things

  • We should catch more than just timeout errors
  • There's some thread unsafety when multiple threads try to download the same file - we need to synchronise this.

I'm not sure if the package index is actually accessed by multiple threads - is this bit needed?

Iain Lane added 2 commits September 19, 2016 17:03
Otherwise random things cause downloading to fail when it might work on
a retry.
If multiple threads try to download the same file, then one of them will
get an incomplete file since `downloadFile` will return early if the
dest exists, which is does while the download is in progress.
@ximion
Copy link
Owner

ximion commented Sep 20, 2016

The package-index is accessed from multiple threads, but usually when that happens it already downloaded all the stuff and has a cache of data. So it's not strictly needed.
Be aware that with more synchronized blocks, the performance will drop significantly, because this means that this part of the code will run exclusively, with no other tasks (not even trivial ones) being run in parallel.

@@ -112,7 +114,9 @@ public:
"Translation-%s.%s".format(lang, "%s"));

try {
fname = downloadIfNecessary (rootDir, tmpDir, fullPath);
synchronized (this) {
fname = downloadIfNecessary (rootDir, tmpDir, fullPath);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this to only synchronize if we actually download anything? Because otherwise this will slow down things even if the file is already downloaded and present on disk.

Copy link
Collaborator Author

@iainlane iainlane Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, hardly; the code basically just returns the filename (after checking if it exists) if it's not remote or if it is downloaded already.

If you insist, then I have to look at the filename in debpkgindex.d which makes it more ugly (currently remote file handling is all taken care of in utility methods).

@iainlane
Copy link
Collaborator Author

iainlane commented Sep 20, 2016

On Tue, Sep 20, 2016 at 06:43:37AM -0700, Matthias Klumpp wrote:

The package-index is accessed from multiple threads, but usually when that happens it already downloaded all the stuff and has a cache of data. So it's not strictly needed.
Be aware that with more synchronized blocks, the performance will drop significantly, because this means that this part of the code will run exclusively, with no other tasks (not even trivial ones) being run in parallel.

(argh, you can't use markdown from email?!?!)

Only this synchronized block on this object will be locked out, which is exactly what is required. When downloading the file for a Package, you have to wait for it to be downloaded before you can do anything with it.

If it's so bad to have code which reduces to something like

synchronized (this) {
    if file.exists () {
        return file;
    } else { 
        return null;
    }
}

(I can't believe that the overhead is worth worrying about) in the already-downloaded case, then I could set debFname = path after downloadFile returns.

Iain Lane [ iain@orangesquash.org.uk ]
Debian Developer [ laney@debian.org ]
Ubuntu Developer [ laney@ubuntu.com ]

@ximion
Copy link
Owner

ximion commented Sep 20, 2016

Jap, I forgot that synchronized (this) will only lock the object itself.
This shouldn't be too noticeable.
https://tour.dlang.org/tour/en/multithreading/synchronization-sharing

One thing still exists that I would like to know: When does it happen that the package is accessed twice so that the same file is downloaded twice in parallel? That in itself shouldn't happen...

@iainlane
Copy link
Collaborator Author

I think I only ever saw it for arch:all packages. Can it happen for those in particular?

@ximion ximion merged commit 8b9d17a into ximion:master Sep 20, 2016
@ximion
Copy link
Owner

ximion commented Sep 20, 2016

Yes, this makes sense for arch:all, since those packages are in all architecture-specific files, and we process all of them in parallel - so there will be duplicates for arch:all. They get filtered out later, and we don't process them completely again, since we will detect that the metadata is the same.

But still, this makes me wonder if we should maybe filter the arch:all packages out and deal with them explicitly...
Might even be an optimization with quite some impact.

@iainlane iainlane deleted the more-robust-downloading branch September 20, 2016 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants