Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a last resort, why not have GenericIE check every iframe? (vanityfair.com) #12692

Open
johnhawkinson opened this issue Apr 9, 2017 · 4 comments

Comments

@johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented Apr 9, 2017

http://www.vanityfair.com/hollywood/2017/04/louis-ck-snl-monologue-white-privilege has an iframe that goes directly to an mp4:

<div
  class="embed iframe-embed-component "
  data-type="iframe"
  data-url="https://cdn-e2.streamable.com/video/mp4/wfatk.mp4?token=1492919903_c3707bc86745b37c149c2be0da56b98a275b2b74"
  >
    <iframe height="360" width="640"
      src="https://cdn-e2.streamable.com/video/mp4/wfatk.mp4?token=1492919903_c3707bc86745b37c149c2be0da56b98a275b2b74"
    >
    </iframe>
</div>

and StreamableIE doesn't know anything about this sort of URL
_VALID_URL = r'https?://streamable\.com/(?:e/)?(?P<id>\w+)'
(and there's no discussion of it at https://streamable.com/documentation)

So the generic extractor just fails. But of course the raw .mp4 in the <iframe src> retrieves just fine (with Youtube-Dl or anything else):

curl -I 'https://cdn-e2.streamable.com/video/mp4/wfatk.mp4?token=1492919903_c3707bc86745b37c149c2be0da56b98a275b2b74'
HTTP/1.1 200 OK
x-amz-id-2: deVubJEfE2aanNg7M+Ix3q5KMWx/yTUGX2Hi70bvprc5DRwmMDL+MHlvtQlDLFx58z0KLiStRrE=
x-amz-request-id: 7DD9384BF10DF36C
Last-Modified: Sun, 09 Apr 2017 03:58:16 GMT
ETag: "1f5e254725c8eb94c09a9e3abbf515e6-31"
Cache-Control: max-age=315360000
Content-Type: video/mp4
Server: AmazonS3
Via: 1.1 varnish
Fastly-Debug-Digest: 7877314f0efbb23d37cdde0e9077f4f5133c9d5549b37d1e9bf1f4c07fdc4adc
Access-Control-Allow-Origin: *
Content-Length: 157744192
Accept-Ranges: bytes
Date: Sun, 09 Apr 2017 05:13:50 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-iad2138-IAD, cache-bos8234-BOS
X-Cache: HIT, HIT
X-Cache-Hits: 0, 0
X-Timer: S1491714830.028254,VS0,VE0

My first thought to address this was that as a last ditch, GenericIE should check for raw video file links in iframes (maybe by file extension, or by HEAD-ing each URL).

But then on further reflection, it seems like there could be plenty of stuff in any of the iframes of a page that Youtube-DL should check. Or iframes within iframes. (It's iframes all the way down.).

So is there a reason we don't do this?:

diff --git a/youtube_dl/extractor/generic.py b/youtube_dl/extractor/generic.py
index 658533cf6..91b4129bc 100644
--- a/youtube_dl/extractor/generic.py
+++ b/youtube_dl/extractor/generic.py
@@ -2678,6 +2678,12 @@ class GenericIE(InfoExtractor):
             if embed_url:
                 return self.url_result(embed_url)
 
+        # Maybe the video is actually in an iframe we don't have special knowledge of.
+        # It could be anywhere, so let's just check all iframes.
+        matches = re.findall(r'<iframe[^>]+?src="([^"]+)"', webpage)
+        if matches:
+            return self.playlist_from_matches(matches, video_id, video_title)
+        
         if not found:
             raise UnsupportedError(url)
 

It certainly works in this case, but I imagine it could cause some problems in others? Maybe?

And I see #6216 exists, but it's a lot more complicated and and it's untouched for a year.

Can someone explain why Youtube-DL doesn't recurse through all the iframes?
I'm happy to submit a pull request, but I suspect there's something wrong with this strategy, so I thought it was better to ask first.


And I guess maybe the above patch should be wrapped in if not found: or perhaps use found instead of matches; I'm not sure what the distinction is trying to convey…

Thank you.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Apr 9, 2017

check for raw video file links in iframes (maybe by file extension, or by HEAD-ing each URL)
maybe by file extension

Yes, this heuristic works most of the time in the wild.

But then on further reflection, it seems like there could be plenty of stuff in any of the iframes of a page that Youtube-DL should check.

This is the main concern.

matches = re.findall(r'<iframe[^>]+?src="([^"]+)"', webpage)

As you've already mentioned, this will result in a lots of false positives (in terms of checking non-media iframes) most of the time since average webpage contains several iframes most of which are not direct media URLs and not even embedded media.

Can someone explain why Youtube-DL doesn't recurse through all the iframes?

Cause this takes time and will slow down the extraction and in most cases will have no effect.

@johnhawkinson
Copy link
Contributor Author

@johnhawkinson johnhawkinson commented Apr 9, 2017

Well, doing something is better than doing nothing…

As you've already mentioned, this will result in a lots of false positives (in terms of checking non-media iframes) most of the time since average webpage contains several iframes most of which are not direct media URLs and not even embedded media.
> Can someone explain why Youtube-DL doesn't recurse through all the iframes?
Cause this takes time and will slow down the extraction and in most cases will have no effect.

Well, the proposal here is to only do it after all other heuristics fail ("last resort").

But, which of these options seems best to you?:

  1. Check file extension of iframes for ../utils/KNOWN_EXTENSIONS. (This could happen fairly early in GenericIE)

  2. HEAD every iframe (this would happen at the end)

  3. Recurse through all iframes by returning a playlist, as in the above patch

  4. Introduce a commandline option, --search-all-iframes. Probably then the above should happen at the beginning of GenericIE.

  5. Or maybe --search-all-iframes could have 3 values never|last-resort|always

  6. Do nothing.

Or some combination of these? What do you think?

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Apr 9, 2017

Well, doing something is better than doing nothing…

Not always.

Well, the proposal here is to only do it after all other heuristics fail ("last resort").

This does not matter in case of processing a batch or random URLs most of which are not likely to contain media.

By default only 1 can be applied.

For the rest I've already suggested --aggressive-extraction some time ago for those who don't care about extraction speed and false positives.

@johnhawkinson
Copy link
Contributor Author

@johnhawkinson johnhawkinson commented Apr 9, 2017

By default only 1 can be applied.

OK, I will submit a PR.

For the rest I've already suggested --aggressive-extraction some time ago for those who don't care about extraction speed and false positives.

Ah, I did not see that before, in #6752. It was even more aggressive, not just iframes.

Which of the above numbered would be appropriate for such an option? Or 7., from #6752?

I'm not so sure on the naming, but I don't have better proposals.

Also, incidently, by far the dominant iframe search regexp starts with r'<iframe[^>]+:

pb3:extractor jhawk$ sed -nE  's/(iframe.........).*/\1/p' *.py  | sort | uniq -c | sort -nr | head -5
  47             r'<iframe[^>]+src=
  24             r'<iframe[^>]+?src
   7                 r'<iframe[^>]+src=
   5         iframe_url = se
   3         iframe = self._

It seems worth noting that this is defective, since it can catch tags like <iframe-not-really-haha src="whatever"/> and perhaps more concerningly <iframe mobile-postagestamp-src="whatever"/>.
Edit: And the 2nd-most common case (24) catches <iframesrc, I'm not sure why the ? is there at all?

Maybe they should all be abstracted out to use a common constant (IFRAME_RE) and then we should put some effort into making that right (e.g. \b boundary before src)?

I dunno.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.