Parallel downloads and youtube playlist regex fix #129

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
7 participants
@ghost

ghost commented Jul 9, 2011

Hi,

I have added a feature to download multiple videos at same time, which will help in downloading a playlist faster. Please review the diff and merge. Comments/suggestions are welcome.

Ravi added some commits Jul 9, 2011

Ravi
Added parallel mode to download multiple videos concurrently
Changed youtube playlist regex to match the new format of playlist url's
Ravi
Add playlist feature
Refactor threading code and change how shutdown is handled
@ghost

ghost commented Jul 11, 2011

Also added a new feature to save playlist to a m3u file.

Collaborator

phihag commented Jul 11, 2011

join is interruptible with a non-None timeout argument.

Also, try not to write Java programs in Python: Prefer to iterate over values instead of indexes. I'd propose

while True:
   for t in threads:
      t.join(2**32)
   if all(not t.isAlive() for t in threads):
      break

Also, why is the exit signal for the thread loop a dictionary? Couldn't it be just None or _EXIT_LOOP, previously defined to be "exit_loop"?

youtube-dl
@@ -2913,6 +2966,10 @@ if __name__ == '__main__':
facebook_ie = FacebookIE()
generic_ie = GenericIE()
+ playlistfile=None
+ if ( opts.saveplaylist):
+ playlistfile=open("playlist.m3u","w")
@phihag

phihag Jul 11, 2011

Collaborator

Having handles in the options complicates things. Is that really necessary? And won't this overwrite playlist.m3u even if there is no playlist?

@ghost

ghost Jul 11, 2011

I though this one is a better design compared to the one I used for the threads. Another option is to put the file-handle as a global variable just like I did with threads. But, I am thinking global variables are bad design and am using this and maybe move that queue object to options like this. Tell me which one to use in both the cases. I'll use that.

youtube-dl
+ parser.add_option('-P','--parallel',
+ type="int",dest='parallel',help='Number of parallel downloads',default=0)
+ parser.add_option('-s', '--save-playlist',
+ action='store_true', dest='saveplaylist', help='Save file list to a playlist file')
@phihag

phihag Jul 11, 2011

Collaborator

If we make this a store argument instead of a store_true, we get to set the name of the playlist (a useful feature).

@ghost

ghost Jul 11, 2011

I didn't want the user to worry about and type a lot of stuff and wanted to keep it simple. So went with this option. And yes this will overwrite the playlist file every time it was run. My thinking was may be in different directories. Now, best thing might have been to extract playlist id and generate file name from that. But, there is no standard way to extract playlist id from playlist extractors. So, I'll go with the playlist file name option here.

youtube-dl
+ """
+
+ def __init__(self):
+ threading.Thread.__init__(self)
@phihag

phihag Jul 11, 2011

Collaborator

This does literally nothing but hide constructor options. Maybe we can convert the entire Thread into just a single function, and call it like

t = threading.Thread(target=queueConsumer, daemon=True)
t.start()
Collaborator

phihag commented Jul 11, 2011

One more thing: Can the code be converted to PEP8 or at least PEP8-with-tabs (the current state)? For example, no a space after if (, and spaces before and after every assignment? (Sorry if I'm overly pedantic, but I'm wary of code smell and the broken window syndrome)

@ghost

ghost commented Jul 13, 2011

Made all the changes and removed unused code.

@ghost ghost closed this Jul 13, 2011

@ghost ghost reopened this Jul 22, 2011

@ghost

ghost commented Jul 22, 2011

Hey,

I closed this request by accident. Any update on this? Need any more changes to the code?

Collaborator

phihag commented Jul 22, 2011

There's still a lot of possible bugs. For example, the automatic numbering of downloads could subtly fail, and whether it does may depend on the Python interpreter (i.e. GIL or not). Also, the lack of locking implies that file I/O and status output are thread-safe, which may not be the case. In fact, you're explicitely calling flush after writing to the playlist file, which indicates you already ran into race conditions.

Don't get me wrong, I'd love multithreading (currently, I have many instances of youtube-dl running in parallel), but we should really separate interface and multithreaded actual downloading code.

@ghost

ghost commented Jul 22, 2011

For example, the automatic numbering of downloads could subtly fail, and

whether it does may depend on the Python interpreter (i.e. GIL or not).

It shouldn't. Generating numbers, filenames etc. is still done in single
thread. all the process except actual download(do_real_download, which isn't
dependent on anything ) is the same as before. Just for download, this
passes the info to a downloader thread pool and to do_read_download

Also, the lack of locking implies that file I/O and status output are
thread-safe, which may not be the case.

Each thread has separate file handles that are independent. Nothing should
happen. And stdout is thread-safe

In fact, you're explicitely calling flush after writing to the playlist

file, which indicates you already ran into race conditions.

Again, this is run from single thread(do_download) and is not a problem. The
reason I put flush in there is because time between getting each file name
could be large and wanted the user to see the fileneme as soon as it is
available.

Collaborator

phihag commented Jul 22, 2011

Oh, you're totally right, the multithreading issues are indeed non-existant, except for stdout which is afaik thread-safe on some platforms and not on others (plus, a combined output 4/42 files downloaded (13% overall) would be nice, but in no way required).

Two more nitpicks:

  • At line 3047, there should be a while True and if all(not t.is_alive() for t in threads): break for academic correctness. youtube-dl is unlikely to run for 2**32 seconds, but time can fly by when your timing source is broken.
  • Please don't add .m3u if it's not present. Also, passing a handle in the option dictionary conflicts with my intended changes of JSON option storage and lib-ification, so I'll cherry-pick the parallelization for now.
youtube-dl
@@ -47,6 +49,8 @@ std_headers = {
simple_title_chars = string.ascii_letters.decode('ascii') + string.digits.decode('ascii')
+downloadqueue = Queue.Queue()
@phihag

phihag Jul 22, 2011

Collaborator

Tried to merge, some issues coming up:
This should definitely not be a global variable, but a reference you pass around to FileDownloaders. Otherwise, any tests will be extremely flaky.

@ghost

ghost Jul 23, 2011

Got rid of the global variable and made it a static variable. Try now.

youtube-dl
@@ -602,6 +606,10 @@ class FileDownloader(object):
# Extract information from URL and process it
ie.extract(url)
+ #parallel downloader needs dummy at the end to signal end of queue
@phihag

phihag Jul 22, 2011

Collaborator

Couldn't see that in the diff, but these lines are called in a loop over information extractors, which is almost certainly wrong. Why can't you stop the threads where you create and wait for them, i.e. in the main?

@ghost

ghost Jul 22, 2011

Couldn't see that in the diff, but these lines are called in a loop over
information extractors, which is almost certainly wrong. Why can't you stop
the threads where you create and wait for them, i.e. in the main?

The threads should know when to stop. This can't be when the queue is empty
because initially when I create the queue, it is empty. I add the items to
the thread later. So in case queue is empty I wait on queue until some
items are available. So I need special marker for end of queue.

@phihag

phihag Jul 22, 2011

Collaborator

The threads should know when to stop. This can't be when the queue is empty
because initially when I create the queue, it is empty. I add the items to
the thread later. So in case queue is empty I wait on queue until some
items are available. So I need special marker for end of queue.

I'm 100 % with you there. But what happens if I use --parallel 2 and 3 URLs?
As far as my limited understanding of the code goes, this code would insert:

  • url1
  • None, None
  • url2
  • None, None
  • url3
  • None, None

Assuming I'm not completely missing the point (which could very well be), these 5 lines should be moved out of the loop and into the main. Also, what happens if I call download twice?

@ghost

ghost Jul 22, 2011

  • url1
  • None, None
  • url2
  • None, None
  • url3
  • None, None

So if I give -P 3, this creates 3 threads in thread pool, each waiting on
the queue for url's. Now assuming info extractor added 10 urls,
downloadqueue will be:
[ url1, url2... url10 , None,None,None]

After some thread completed 10th url, it will read None and exit, remaining
2 will read remaining and exit respectively. In main I don't have the list
of url's yet and I can not add them at the end of the queue. ie.extract is
calling downloader and would have added url's to queue.

Also got rid of global queue and handle in options. Will commit the code
when I reach home later.

Ravi Tja

@phihag

phihag Jul 22, 2011

Collaborator

So if I give -P 3, this creates 3 threads in thread pool, each waiting on
the queue for url's. Now assuming info extractor added 10 urls,
downloadqueue will be:
[ url1, url2... url10 , None,None,None]

Could you elaborate on why that's the case, i.e. why this is the right place to terminate the threads? For me, the loop looks like this (assuming the first IE always matches):

for url in [url1, url2, url3, ...]:
    queue.put(realurl(url))
    for i in range(3):
        queue.put(None)

so the resulting queue should be

[url1, None, None, None, url2, None, None, None, url3, ...]
@ghost

ghost Jul 22, 2011

I didn't notice the for-url loop above. I was thinking all ie.extract will
add all url's. I have to move to after for-url loop ,just before return.

youtube-dl
+ """
+ while True:
+ d = downloadqueue.get()
+ if (d['filename'] == None):
@phihag

phihag Jul 22, 2011

Collaborator

Personally, I'd prefer if d['filename'] is None: here. This looks so Java-y

vStone commented Jul 27, 2011

When an URL in a threaded/parallel download fails, youtube-dl should wait for those that are still running to finish. Currently, it just drops out leaving a lot of unfinished downloads.

Contributor

rbrito commented Sep 27, 2011

Well, while not exactly a solution to parallel downloads the way it is proposed, I just implemented a prototype version of a way to use the aria2c downloader (which is really awesome) in a branch of mine.

Please, check out https://github.com/rbrito/youtube-dl/tree/use-other-downloaders and let me know what you think.

It is modelled after the rtmpdump download and is really preliminary, but I think that this way is better than bloating youtube-dl, as we can give an external, well-made downloader ("do one thing and do it well") the list of URLs to grab.

Of course, this is not a self-contained solution which may be used by people on some operating systems (e.g., windows), but...

Regards.

@phihag phihag referenced this pull request Dec 23, 2011

Closed

More than one download #259

tukkek commented Nov 9, 2013

I didn't comment on this when I saw it earlier today but since now I'm in a issue streak here it goes: it would be great to have more downloading options.

From what I understand the multi-threading here was not accepted due to the two nitpicks? They seem pretty minor to me - if the patch works why not commit and solve these issues later?

Having said that, I think @rbrito's is the best solution, conceptually, since delegating downloads takes the burden off youtube-dl having to become a more complex download manager, plus giving users the ability to choose their personally preferred or better suited downloader. I haven't seen his actual code though.

A great idea would be to have a replacement-string like the one in -o that would call the external downloader and wait for it to finish execution each time a file would have to be downloaded. This would be useful since now just using the URLs given by -g doesn't offer the ability to easily format filenames like with -o.

Some other similar open reports: #182 #350 #527

Contributor

TRox1972 commented Jun 12, 2016

Should this be closed? The file in question does no longer exist.

Collaborator

yan12125 commented Jun 13, 2016

The parallel part is still useful, and I guess it's not difficult to rebase.

@TRox1972 TRox1972 referenced this pull request Aug 2, 2016

Closed

multiple videos at same time #10214

0 of 4 tasks complete

Any updates on this?

Collaborator

yan12125 commented Jun 11, 2017

The original author ("Ravi") of this PR has left Github, so I'm going to close this. Feel free to open a new PR with an implementation written from scratch or based on Ravi's work.

@yan12125 yan12125 closed this Jun 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment