Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
As a last resort, why not have GenericIE check every iframe? (vanityfair.com) #12692
Comments
Yes, this heuristic works most of the time in the wild.
This is the main concern.
As you've already mentioned, this will result in a lots of false positives (in terms of checking non-media iframes) most of the time since average webpage contains several iframes most of which are not direct media URLs and not even embedded media.
Cause this takes time and will slow down the extraction and in most cases will have no effect. |
|
Well, doing something is better than doing nothing…
Well, the proposal here is to only do it after all other heuristics fail ("last resort"). But, which of these options seems best to you?:
Or some combination of these? What do you think? |
Not always.
This does not matter in case of processing a batch or random URLs most of which are not likely to contain media. By default only 1 can be applied. For the rest I've already suggested |
OK, I will submit a PR.
Ah, I did not see that before, in #6752. It was even more aggressive, not just iframes. Which of the above numbered would be appropriate for such an option? Or 7., from #6752? I'm not so sure on the naming, but I don't have better proposals. Also, incidently, by far the dominant iframe search regexp starts with
It seems worth noting that this is defective, since it can catch tags like Maybe they should all be abstracted out to use a common constant ( I dunno. |
http://www.vanityfair.com/hollywood/2017/04/louis-ck-snl-monologue-white-privilege has an iframe that goes directly to an mp4:
and
StreamableIEdoesn't know anything about this sort of URL_VALID_URL = r'https?://streamable\.com/(?:e/)?(?P<id>\w+)'(and there's no discussion of it at https://streamable.com/documentation)
So the generic extractor just fails. But of course the raw .mp4 in the
<iframe src>retrieves just fine (with Youtube-Dl or anything else):My first thought to address this was that as a last ditch,
GenericIEshould check for raw video file links in iframes (maybe by file extension, or byHEAD-ing each URL).But then on further reflection, it seems like there could be plenty of stuff in any of the iframes of a page that Youtube-DL should check. Or iframes within iframes. (It's iframes all the way down.).
So is there a reason we don't do this?:
It certainly works in this case, but I imagine it could cause some problems in others? Maybe?
And I see #6216 exists, but it's a lot more complicated and and it's untouched for a year.
Can someone explain why Youtube-DL doesn't recurse through all the iframes?
I'm happy to submit a pull request, but I suspect there's something wrong with this strategy, so I thought it was better to ask first.
And I guess maybe the above patch should be wrapped in
if not found:or perhaps usefoundinstead ofmatches; I'm not sure what the distinction is trying to convey…Thank you.