Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having youtube-dl Grab From Source Instead of URL #10511

Closed
EnginePod opened this issue Aug 31, 2016 · 16 comments
Closed

Having youtube-dl Grab From Source Instead of URL #10511

EnginePod opened this issue Aug 31, 2016 · 16 comments

Comments

@EnginePod
Copy link

@EnginePod EnginePod commented Aug 31, 2016

  • I've verified and I assure that I'm running youtube-dl 2016.08.31

Before submitting an issue make sure you have:

  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Feature request (request for a new functionality)

Description of your issue, suggested solution and other information

This might be the second or third time that I request this feature because it would be so extremely useful to have.
There are tons of reasons to why this would be useful to have, but the main reason is that this way I could have youtube-dl extract video URLs on member restricted pages such as Facebook videos set to private.

YouTube videos are fine as I can use the username and password parameters to log in, but for other sites (like Facebook and Instagram among others) I'm shit out of luck so I have to resort to installing software to sniff the URLs.

If there was just a simple parameter like --get-from-source C:\location\to\website_source.html which then dumped the video links then I might be free from bundled adware at the end of the day.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 31, 2016

Duplicate of #5768.

@dstftw dstftw closed this Aug 31, 2016
@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

I went through the #5768 and what @jaimeMF wrote doesn't completely make sense and I'll explain why.
Here are the reasons he posted to why this feature could not be added

Reason 1) It would unnecessarily complicate the code, we would need to modify the extractors so that they work both by giving an url and the webpage. (I gess that you are problably only interested on youtube, but we support more sites).

Answer 1) Not really, it would be extremely simple to add as it would only be a matter of adding an if-statement checking if the --get-from-source parameter was set. If it is set then it doesn't download the page source, but uses the data from the parameter. Example:

if(get-from-source == true){

   /* Use specified source */

}

else{

   /* Download the page source as usual */

}

The main page source is only downloaded once so this if-statement would also only have to be used once. Most hosts also don't restrict the direct video link to the video so youtube-dl would have no issue working with the source as a usual request after that point.

Reason 2) What happens if we decide to use another page for getting the info? You would get error messages and we would need to explain what has changed, and this is internal behaviour so we shouldn't have to.

Answer 2) Like I explained in my previous answer, nothing would change other than adding that if-statement so no additional errors would have to be caught. Everything would be exactly as a normal request only that the user was allowed to input the source.

Reason 3) You would need to specify an extractor, which is another source of bug reports because people would use the wrong extractor and we sometimes split extractors or add a special one for handling some urls.

Answer 3) There's a point here, but it would be very easy to solve as well. The user would simply add the URL along with the parameter just like any other command. For example youtube-dl http://web.com/video --get-from-source

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 31, 2016

  1. Even if it were like in your fancy pseudocode (it's actually not, there are lots of places that rely on URL heavily, lots of complicated codes that will require heavy refactoring) this would require such branching in every extractor for every downloaded source used during extraction process. That's a lot of mess for nothing thus not feasible and senseless to maintain.
  2. Do you even understand the problem? Extraction scenario may involve downloading arbitrary set of webpages/assets/APIs/whatever that is technically an implementation detail end user should not even bother about or be aware of. All of this set items you will have to provide as input to extractor in correct order. But before this extractor should somehow point you out which files from which locations in which order it accepts as input and how to obtain all this stuff. Moreover this set of files may change during time since websites' layouts and structures change leading to extraction scenarios' changes. Each such change will result in different set of input files with different constraints. Needless to say providing such input set may not be even possible at all due to sessioning, authentication, different kind of protections, URL signing, expiry and so on.
@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

You're not understanding how this would work and once again I'll explain why.

  1. I was under the impression that youtube-dl first had the core files download the URL and then passed it on to the specific extractor, since nearly every single extractor downloads the URL anyway (as you pointed out yourself). This is a flaw in the youtube-dl structure as the same code/process is repeated over every single extractor (hundreds of files), but it's still possible to fix by removing all the requests to the main URL, moving the part to the core files and then handing it over to the extractors through a variable.
  2. To start off, everything would be exactly as it is working now if the parameter wasn't used. Second off I think you've missed the whole point of how this works or you didn't read my answer properly.

I think we can agree that currently youtube-dl usually works something like this:

User provide a URL -> Youtube-dl -> Downloads URL source
-> Downloads other assets which were LOADED THROUGH THE PAGE SOURCE

With the new page source parameter you would simply only change the "Downloads URL source" part which leaves the rest of the process untouched. It would look like this:

User provide a URL -> Youtube-dl -> page source PROVIDED THROUGH PARAMETER
-> Downloads other assets which were LOADED THROUGH THE PAGE SOURCE

Now as I said in my PREVIOUS answer, youtube-dl would in most cases get the direct link RIGHT AWAY (this is the case for Facebook and Instagram and many more as most pages have the video URL right on the page) from the page source WITHOUT having to download any additional assets or it would download a few assets first. Now like I said, the assets usually are NOT restricted by the user session as they're on static file servers which in very rare cases just check the IP.

@remitamine
Copy link
Collaborator

@remitamine remitamine commented Aug 31, 2016

I was under the impression that youtube-dl first had the core files download the URL and then passed it on to the specific extractor, since nearly every single extractor downloads the URL anyway (as you pointed out yourself). This is a flaw in the youtube-dl structure as the same code/process is repeated over every single extractor (hundreds of files), but it's still possible to fix by removing all the requests to the main URL, moving the part to the core files and then handing it over to the extractors through a variable.

we don't always request the url passed by the user, sometimes we use other sources to extract information, these are some examples of the extractors that do this: CBSIE, CWTVIE, CTVIE, TouTvIE, WatIE...

@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

@remitamine: I was expecting this to be mentioned sooner or later. My answer is that this is very, very rare and almost never happens. Take a look at the size of the sites that you mentioned for example and besides 1 extra request for these rare cases won't really cause any performance issues.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 31, 2016

I was under the impression that youtube-dl first had the core files download the URL and then passed it on to the specific extractor

Do you even read my post? Core does not and should not be aware of which URL or set of URLs extractor needs and which actions this extractor should take to obtain them, which headers and cookies to provide and so on.

since nearly every single extractor downloads the URL anyway (as you pointed out yourself)

No it's not. And I did not say that.

This is a flaw in the youtube-dl structure as the same code/process is repeated over every single extractor (hundreds of files), but it's still possible to fix by removing all the requests to the main URL, moving the part to the core files and then handing it over to the extractors through a variable.

No it's not. Even if this subset of dumb extractors that pull only the original URL without any HTTP request tweaking, without custom headers and cookies is generalized there are still extractors that does not follow this pattern and can't be generalized this way.

User provide a URL -> Youtube-dl -> Downloads URL source
-> Downloads other assets which were LOADED THROUGH THE PAGE SOURCE

No, not necessarily.

With the new page source parameter you would simply only change the "Downloads URL source" part which leaves the rest of the process untouched.

How come? What if somebody wants to provide two first pages? Solution should be generic.

User provide a URL -> Youtube-dl -> page source PROVIDED THROUGH PARAMETER
-> Downloads other assets which were LOADED THROUGH THE PAGE SOURCE

Fetching page source sets some cookies required for further extraction that are not available as bare file => extraction won't work.
Page source contains some expired hash, that is used in subsequent requests => extraction won't work.
etc.
Does not scale.

youtube-dl would in most cases get the direct link RIGHT AWAY

No it's not. Even assuming this to be true for a second, what about other extractors? Say, those that don't download source URL at all. Then the way you suggest it to be implemented does not fit them at all and this feature won't be available for them that leads to even more confusion.

assets usually are NOT restricted by the user session

Everything else may require.

So this does not worth the effort.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Aug 31, 2016

@dstftw has explained things well. Here are just some interesting statistics:

  • The number of extractors that downloads the user-input URL: 475
$ grep -r 'self\._download_webpage(url' youtube_dl/extractor/*.py | wc -l        
475
  • The number of total extractors: 900 (the first line in this file is the title)
$ wc -l docs/supportedsites.md
901 docs/supportedsites.md
@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

@yan12125: no reason to continue discussing this then as editing 900 files would take way too much time.
Unfortunate that this feature couldn't be added though, it would be really useful to get the download links on pages that youtube-dl can't reach due to login restrictions.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Aug 31, 2016

Well, the point is not that we have lots of extractors, but your solution applies to few websites. If you want some website supporting logging in, open new issues for each one.

@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

This is still completely doable and the solution does actually apply 475 websites/extractors as you mentioned, but then again you'd have to add the if-statement to every single one of those extractors.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Aug 31, 2016

I count extractors that downloads the web page, not extractors that only downloads the web page. I didn't find a quick way to count the latter, so I skip it. It should be much less than 475.

@EnginePod
Copy link
Author

@EnginePod EnginePod commented Aug 31, 2016

@dstftw: now over to your reply:

Do you even read my post? Core does not and should not be aware of which URL or set of URLs extractor needs and which actions this extractor should take to obtain them, which headers and cookies to provide and so on.

Your post mentioned the core AFTER I wrote that, but it doesn't matter much as the solution could be applied to each extractor individually although it would be quite some amount of work.

How come? What if somebody wants to provide two first pages? Solution should be generic.

The solution is generic, the only difference is that this would only work with a single URL.
If they want to download a video located on two URLs then they'd simply have to run two commands with youtube-dl.

Fetching page source sets some cookies required for further extraction that are not available as bare file => extraction won't work. Page source contains some expired hash, that is used in subsequent requests => extraction won't work. etc. Does not scale.

I've explained this part twice now.
The source that has been fed through the parameter is almost always the only one that requires cookies sent through headers, the other assets don't require the login or any other cookies.

There is no reason for the page source to contain an expired hash as the source would be recently taken.
And like I said most websites don't require authentication for the direct video URL or the assets on that page while they do for the video page so there wouldn't be any issues with the cookies (try it for yourself).

`Everything else may require.

So this does not worth the effort.`

"I've explained this part twice now.
The source that has been fed through the parameter is almost always the only one that requires cookies sent through headers, the other assets don't require the login or any other cookies."

There is also no reason for there


Again, this is completely doable, the only thing in the way is that the if statement would have to be added to every single extractor that downloads the page source and like @yan12125 pointed out it's a few hundred files.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Sep 1, 2016

The source that has been fed through the parameter is almost always the only one that requires cookies sent through headers, the other assets don't require the login or any other cookies.

This is the key point whether this is doable or not. Please give one example URL that you think this is doable.

And like I said most websites don't require authentication for the direct video URL or the assets on that page while they do for the video page so there wouldn't be any issues with the cookies (try it for yourself).

Likely, it's your work to give examples to prove that your proposal works, not ours.

@EnginePod
Copy link
Author

@EnginePod EnginePod commented Sep 1, 2016

It is indeed my duty to provide examples, but I didn't bother as I don't think anyone is interested in editing hundreds of files just to add this parameter (although it'd be extremely useful to have).

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Sep 1, 2016

Your post mentioned the core AFTER I wrote that

I pointed out that extraction process is an implementation detail that should be known only to extractor itself in the very first reply. You are just speculating with terms in order to continue stubborn arguing with facts.

The solution is generic, the only difference is that this would only work with a single URL.

Generic solution offers arbitrary control. By definition. With generic solution one should be able to control any depth of extraction process not only special case with N=1.

The source that has been fed through the parameter is almost always the only one that requires cookies sent through headers, the other assets don't require the login or any other cookies.

"almost always" and "usually" does not mean always. These are concrete extraction scenarios of concrete extractors in youtube-dl codebase.

There is no reason for the page source to contain an expired hash as the source would be recently taken.

No, you can't make such assumptions. You are also not allowed to make any assumptions on IP and way this page was obtained thus making it again impossible in case of IP-bound source page.

It is indeed my duty to provide examples, but I didn't bother as I don't think anyone is interested in editing hundreds of files just to add this parameter (although it'd be extremely useful to have).

The point is not adding this parameter but adding support for these unsupported URLs (that you haven't even provided so far) directly, adding support for authentication for extractors that does not support it, adding support for password protected videos and so on.

@ytdl-org ytdl-org locked and limited conversation to collaborators Sep 1, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.