Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YouTube pagination limit and metadata churn? #22650

Closed
amcgregor opened this issue Oct 9, 2019 · 4 comments
Closed

YouTube pagination limit and metadata churn? #22650

amcgregor opened this issue Oct 9, 2019 · 4 comments

Comments

@amcgregor
Copy link

@amcgregor amcgregor commented Oct 9, 2019

Checklist

  • I'm asking a question
  • I've looked through the README and FAQ for similar questions
  • I've searched the bugtracker for similar questions including closed ones
  • I've searched the source code for possible clues (there aren't many exit-early conditions in those loops…)
  • I've searched the internet (and Stack Overflow, Reddit, etc.) for similar questions

Question

Is it possible to limit the scope of the youtube backend's paged search for new videos?

The sheer number of paged requests is quite substantial in comparison to the number of new videos discovered on each run (1-4), which are always present on the first page. I ask not because an actual problem is being exhibited because of this (though I do have rate limit concerns), but because it's actually spending more time fetching pages than fetching video content.

If not, could this be added? --max-pages or similar? When archiving a still-living YouTube channel, I'd like to keep the amount of churn to a minimum. (Why pull in 14 pages, when 1 will do? ;)

Question

Is it normal to spend large amounts of time re-writing metadata, thumbnails, and subtitles on already-downloaded videos?

I've noted that all discovered videos are re-written on-disk to re-apply metadata and subtitles, it seems, even if they already have metadata and subtitles present. Orders of magnitude more time is spent doing rewrites of already tagged media than both paged and media fetching combined. You can see this for yourself by running the example invocation below over any channel or playlist with more than one page.

I am re-testing with the --download-archive option, to see if this alters the rewriting behavior in any way. (Maybe actual tracking is needed, as it isn't detecting metadata presence?)

Example Invocation

I'm using the following invocation for the purpose of local archiving:

youtube-dl --no-call-home --ignore-errors --restrict-filenames \
    --no-mark-watched --yes-playlist \
    --continue --no-overwrites \
    --write-description --write-info-json --write-thumbnail --write-sub \
    --add-metadata --embed-thumbnail --embed-subs \
    --merge-output-format mp4 --sub-format best --youtube-skip-dash-manifest \
    --format 137+140/bestvideo[ext=mp4]+bestaudio[ext=m4a] \
    -o "%(playlist)s/%(upload_date)s--%(id)s--%(title)s--%(resolution)s.%(ext)s" \
    $@

Example Log

[download] Downloading video 9 of 41
[youtube] PajD4X2wu50: Downloading webpage
WARNING: video doesn't have subtitles
[info] Video description is already present
[info] Video description metadata is already present
[youtube] PajD4X2wu50: Thumbnail is already present
[download] Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4 has already been downloaded and merged
*** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ***
[ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4'
[ffmpeg] There aren't any subtitles to embed
[atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4"
*** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ***
[download] Downloading video 10 of 41
[youtube] f8FAJXPBdOg: Downloading webpage
[info] Video description is already present
[info] Writing video subtitles to: Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt
[info] Video description metadata is already present
[youtube] f8FAJXPBdOg: Thumbnail is already present
[download] Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4 has already been downloaded and merged
*** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ***
[ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4'
[ffmpeg] Embedding subtitles in 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4'
Deleting original file Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt (pass -k to keep)
[atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4"
*** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ***
@amcgregor amcgregor added the question label Oct 9, 2019
@amcgregor
Copy link
Author

@amcgregor amcgregor commented Oct 9, 2019

Adding --download-archive _archive.ids to the arglist seems to have corrected my primary "churn" issue of rewriting to apply metadata. Edited to add: but this has changed the meaning of --max-downloads, which before would stop after churning through the first N already downloaded, now it means "download N more videos than you already have".

@dstftw dstftw closed this Oct 9, 2019
@dstftw dstftw added the duplicate label Oct 9, 2019
@amcgregor
Copy link
Author

@amcgregor amcgregor commented Oct 9, 2019

So… no answer as to the first part, regarding downloading unnecessary playlist pages?

In my archival case, the first page is truly the only one that needs to be requested. 10-100 (averaging 24) extra pages, times 157 channels… a little shy of 4,000 extra HTTP requests, on each pass through, plus all the comparisons against the archive of IDs for videos guaranteed to be there.

Adds up in time, and request limits.

@amcgregor
Copy link
Author

@amcgregor amcgregor commented Oct 9, 2019

Ah, #3794 from 2014, which does replicate the title of this request, has nothing to do with actual limitation on the number of pages being requested, and more to do with a bug regarding a seeming upper bound on the number of videos collected in total. (A "limitation" in "YouTube channel pagination", not "channel pagination limit". ;)

@amcgregor
Copy link
Author

@amcgregor amcgregor commented Oct 9, 2019

  • Confused.
  • Saddened.
  • Mildly curious what @dstftw thinks this is actually a duplicate of.
  • Has received any form of feedback whatsoever.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.