Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processes bidding for the same partially downloaded file ... #485

Open
Albretch opened this issue Oct 22, 2012 · 10 comments
Open

processes bidding for the same partially downloaded file ... #485

Albretch opened this issue Oct 22, 2012 · 10 comments

Comments

@Albretch
Copy link

@Albretch Albretch commented Oct 22, 2012

~
at the risk of giving you enough reasons to start hating me ;-) ...
~
[use case]
~
the only way I have seen you can speed up downloads for many media files is by running a number of processes at the same time. So I have a script to get a lot of youtube URIs, which I then sort, split in smaller files and reorder (one process processes the list from a-z the other from z-a ...) in order to make sure that if a feed is partially downloaded, options are the other process will pick it up and finish the download
~
[/use case]
~
Now, if a process is downloading a file (so it temporarily suffixes it .part) and another process gets a chance to download the same file, it will 'think' (sorry for the anthropomorphizing) that it was left partially downloaded by some previous process (and not that is being downloaded by a current one)
~
Short of checking with the OS to see if any process is currently writing to that file, I think, a delay of, say, 5 seconds without any increase of data would be a good bet that file is not being currently worked on, otherwise yo "say something" and keep going
~
thanks
lbrtchx

@phihag
Copy link
Contributor

@phihag phihag commented Oct 22, 2012

A five-second delay would be wrong, and we can't really check whether another process is writing to the file. What we can do is use cooperative locking on the .part file. As always, patches are welcome, but I see no reason against implementing this feature, probably even by default.

@Plaque-fcc
Copy link

@Plaque-fcc Plaque-fcc commented Oct 22, 2012

Yeah, already hating! (%

Possibly, there is a simple obvious way to feed a single youtube-dl with
all of them, then let it find out all the number of files to download
and do it in a specified number of threads.

+to «you may hate me, too». ;D

@Plaque-fcc
Copy link

@Plaque-fcc Plaque-fcc commented Oct 22, 2012

Yes, five seconds delay is quite specific to not everyone's bandwidth,
rate limits etc. Monopoly blocking is not a panaceia, though, as soon
as there may be an NFS/CIFS/WebDAV/other limited FS.

Won't it be possible to, say, support double-threading, for instance,
like cURL can? I mean all everything like paralleling several
youtube-dl instances, but using a single one.

@Tailszefox
Copy link
Contributor

@Tailszefox Tailszefox commented Oct 22, 2012

I think it could be a good idea to have a parameter that would tell how many threads you want to launch. If it's only to download one file, it would be split evenly, and if it's to download something like a playlist, it could be made to download multiple videos at once.

Of course that would imply checking if it's possible given the video's host, how to report the progress with multiple videos at once, seeing if it would be wise to put a hard limit (so someone doesn't end up opening 100 connections to YouTube), that sort of thing, but that could be interesting.

@Plaque-fcc
Copy link

@Plaque-fcc Plaque-fcc commented Oct 22, 2012

I say, I am ready for testing, thinking and proposal on this if there
will be a branch for it; although my bandwidth is low, it is a test
case itself. And, of course, I'd love to play with multithreading
youtube-dl, it's a good feature but not as simple as it seems at
first glance.

@Albretch
Copy link
Author

@Albretch Albretch commented Oct 23, 2012

A five-second delay would be wrong ...
~
Yeah! That was just some brainstorming say
~
right now I see an easy way to deal with ".part" left behind by
youtube-dl most probably because of networking or OS problems
~
You could get the time at which the very python process starts
youtube-dl as the first line in youtube-dl's code, then if the last
mod time of the ".part" file is before that process was started, the
process should resume downloading otherwise it should not mind that
file and keep going. Users can always go
~
$ find <download_dir> -type f -iname "
.part" -exec ls -l '{}' ;
~
and check partially downloaded files.
~
Dealing with strategies about how to download some files (say from a
play list), how many processes are run and restarting youtube-dl could
be done through scripting/left to user's discretion. Perhaps some
"best practices" scripts should be kept around
~
lbrtchx

@phihag
Copy link
Contributor

@phihag phihag commented Oct 23, 2012

Let's restrict this discussion to running multiple youtube-dl instances in parallel, ok? In-youtube-dl parallelism is a whole different idea. And I don't see any significant downside to simply locking the .part files.

@Tailszefox
Copy link
Contributor

@Tailszefox Tailszefox commented Oct 23, 2012

Looking around, it seems there's no easy and portable way to lock a file out of the box. I've seen multiple solutions but they all use OS-specific calls or need additional packages. Would that be okay to use or would it bloat the code too much?

I've also seen people using the approach of creating another file next to the .part one, something like (filename).lock. In that case the file wouldn't be truly locked, but youtube-dl could be made to check if such a file exists or not and if it does, it will assume it's already used and skip it. It's the kind of system used by Firefox for example, it creates a .parentlock file when a profile is currently used so it can't be used by another instance. Of course there's the issue of if it crashes, as it risks leaving a .lock file orphan, and there's the risk of bloating the filesystem with useless temporary files.

If anyone has another technique that would be both portable and comes as a standard with Python, that would be perfect. If not, either making OS-specific calls or using lockfile would be the best bet I think, but I'd like to know if that would be acceptable before trying to implement it.

@Albretch
Copy link
Author

@Albretch Albretch commented Oct 23, 2012

... youtube-dl could be made to check if such a file exists or not and if it does, it will assume it's already used and skip it.
~
I think checking for the existence of the file and the time that very
running process started, make both good and simple heuristics
~
lbrtchx

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Aug 13, 2016

I'd like to hijack this thread. Since #1562, there is a class locked_file in youtube_dl/utils.py. Implementing this should be much easier now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.