Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ie/francetv] Fix extractors (#9333)
Closes #9323 Authored by: bashonly
- Loading branch information
Showing
2 changed files
with
47 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bashonly
I want to thank you for this timely commit. If you have time I have a question about how you solved this problem please.
This IE obviously relies on the hidden url
'https://player.webservices.francetelevisions.fr/v1/videos/...
to extract the real video information. However using my browser devtools, no matter which france.tv video I choose or if I refresh reload, I never see this JSON URI fetched in the NETWORK tab which makes no sense to me!? How can I see this URL fetched in devtools? Put another way, how did whoever createe this IE find that JSON url in the first place.I ask because if you hadn't come along this might have taken longer to get fixed. I'd like to be able to solve this myself in the future. My python / javascript skills are quite high. I'm just obviously missing some key ingredient into the underlying JSON fetch here that is blocking me from being effective in providing fixes.
You look like a very active contributor here so I'm hoping this is an easy question for you to answer and won't take up much of your time.
Thanks in advance!
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i figured it out. The IE probably needs to be updated because that hard-coded JSON url used in the IE looks like it's being redirected to a different URL. That's why it still works as well as why I don't see the expected hard coded url being fetched in the NETWORK TAB of devtools.
The new redirect is to:
https://k7.ftven.fr/videos/
Followup. Is there a way to quickly and easily spit out the underlying real video ID from the CLI for any site url? The same goes for the JSON url that is fetched to get the info extract?
I ask because it would make "discovery" of the underlying JSON a lot easier to create a fix rather than having to revisit the CODE and figure these out every time you come back to an IE since all IEs may have different author's and implementations?
Thanks!
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted in case of breakage. But since I am geo-blocked from many links on these sites, I thought it unwise to change something that still worked; best leave that to someone who is in-region.
--print id
will print what yt-dlp extracts as the id if extraction is successful.--print-traffic
will show you all URLs being requested and redirects followed.--write-pages
will write all responses received during extraction to files9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the rapid reply! Those work great! FYI
--print-traffic
shows the actual redirect that I had to figure out by manually pasting links into the browser bar.Some followups.
hostname
parameter to the_extract_video
function and uses a helper function calledsmuggled_data
to find it (see below). Can you give me any insight to why this was needed and what the "smuggle" is all about generally speaking?I think some IEs in ytdlp have this notion of using
self._make_url_result
to turn site video URLs to simplified URNs that consist of an IE name key separated by a colon and the real vid. FranceTv seems to use this as well. I gather the rationale for this is to allow distinct IEs to accommodate new site valid url matches that require different extraction of the real vid but then turn around and hand them off to an existing IE that can actually use the same "backend" to DL the extract info. Does that sound right?Speaking of that URN, I can see from the ytdlp cli output that your commit added what looks to be a
#__youtubedl_smuggle=...
fragment to the new URN scheme and evidently it's need to make things work now per what I said in bullet 2 above...So along with bullet 1 above; What does
smuggle_url
and the fragment above do and why is it needed in the URN at all? Is this needed by a lot of sites / IEs or why only others? I've checked theutilities.py
related helper function declarations and it's not clear what their're purpose is.Thanks!
Pasting the URN creation method definition here for reference:
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate note. Up until your new commit, francetv primarily had HLS streams. Or at least that's what I have been using for years. After the your new commit,
-F
reports exclusively MPD/DASH.Back in the day with the old
youtube_dl
there was a break in france.tv that resulted in a fix that resulted in only all DASH extracts as well similar to your recent commit. The DASH downloads were very slow to the point of being unusable compared to HLS. They also result in QSM subtitle files which don't work vs the VTT files associated and downloaded with HLS.Around the time of the old youtube_dl DASH only fix, ytdlp forked and eventually some new commit fixes were added to bring back the M3U/HLS + VTT streams. I've was using these ytdlp HLS/VTT streams for francetv up until the break about a week ago.
My initial use of francetv with your fix commit seems to be very similar what I described with the old youtube_dl fix years ago. The MPD/DASH only downloads are very slow and the accompanying QSM subtitle files do NOT work vs the VTT that were accompanied by the old HLS. Is it possible that M3u/HLS streams and VTT subtitles are out there and just not being extracted? For whatever reason DASH / QSM is just too slow to be usable and the QSM subs just don't work.
FYI: the subtitles get downloaded like this but aren't actually subtitle files.
Thanks!
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boulderob see and/or try out #9347
Your other 3 questions are all related:
The FranceTV video API endpoint is now requiring the
domain
query parameter, which the extractor had not been including previously -- this was the actual cause of the "HTTP Error 400"s. Before my commit, the other francetv extractors only handed over the video IDs toFranceTVIE
. They used thefrancetv:
prefix, because while the actual URLs were not suitable forFranceTVIE
(e.g.https://francetvinfo.fr/etcetc
was not matched byFranceTVIE._VALID_URL
), the extraction needed to be performed byFranceTVIE
. We usesmuggle_url
to append the hostname to the URL in the URL fragment, because it's currently the only way to hand over arbitrary data (such as the referring URL/hostname) to the resolved extractor.9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashonly as usual thanks for the rapid turnaround time. I'll test #9347 today and reply any findings to that PR itself. But initial inspection of code looks good. Love the cleanup of the old code. It's more manageable and understandable this way.
Regarding the HLS. I was on the right track. I thought by throttling the browser in devtools it might force the k7 URI to fetch in
mobile
mode. I presume my thinking there was right in terms of testing? (This must be something france.tv pages detect when making the call to k7 URL themselves though I'm not sure how or what they test for, perhaps HEADER AGENT or just pure speed?). But I missed that their k7 url was capable of accommodating differentbrowser
types as well.Speaking of, I also presume that the only way to know the HLS discovery is thru trial and error inspection of and changing of the k47 URL query string parameters itself, yes? And the only way to test this is and discover that HLS lives under
safari
browser type is to actually run devtools on safari to see the different M3u8 vs MPD results revealed in the JSON, yes ? And this same logic applies in any IE not just this france.tv specific one?I think if i can figure out the ytdlp / github automated testing protocol (https://github.com/yt-dlp/yt-dlp/pull/9347/checks) I'll be in pretty good shape with the required PR process. One thing that would make things easier is the ability to "code jump" from the PR commits to definitions in other parts of the code base, especially the utility functions. "Go to definition" usually works in the repo proper in github but not always. Unfortunately, I haven't got "Go to definition" to work in the PR commits for some reason outside of "jump within the same source file".
Thanks for the confirmation. There might be something a bit more semantic than
smuggle_url
as far as utility function names go though.9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mobile Safari does not support DASH (and HLS was developed by Apple)
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boulderob For this kind of reverse engineering, browser dev tools are not the only way.
You can always intercept the https traffic and trace the exchange between the browser or a mobile app and the server using tool like mitmproxy. This is very helpful, especially when there is specific mobile App that fetch video information and file from simple api endpoint, vs using complexe API built into the web page.
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bashonly for this new fix.
About the mp4 subtitles "linked" to dash (compared to vtt "linked" to hls), I also remark as @boulderob that there is an issue about it.
With this command
yt-dlp.exe --allow-u --restrict-filenames --no-overwrites --no-continue -f "hls-522+hls-audio-aacl-96-Audio_Description+hls-audio-aacl-96-Audio_Français" --audio-multistreams -o "TEMP\%(title)s_%(id)s.%(ext)s" --ffmpeg-location "C:\ffmpeg\bin\ffmpeg.exe" --write-subs --sub-format "srt/vtt/ass/best" --sub-langs "qsm" --convert-subs "srt" "https://www.france.tv/france-2/les-petits-meurtres-d-agatha-christie/les-petits-meurtres-d-agatha-christie-saison-3/5699088-mortel-karma.html"
This is the yt-dlp log :
This is what I get with ffmpeg :
It's not a big deal as I used hls streams and I will come back to them now, but if you need tests about this subtitle errors, I will can do it.
Regards.
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mp4 subtitles are actual mp4 files with only a ttml subtitles stream (no video or audio streams). yt-dlp does not know what to do with them. But you could manually merge it with your video/audio mp4 using ffmpeg
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashonly
Is there a way to convert the mp4 subtitles to a usable subtitle text format/type without merging it into the video container so that media players can play it separately and automatically? I use the subtitle text to study foreign languages. The convenience of having them just work separately like hls mp4 + vtt does is of great benefit. Merging them requires more post download work and you lose the separation of concerns.
Also, since the qsm.mp4 ttml container is what the DASH mpd has been configured to return for the subtitles, how come a media player in the browser can resolve and play the subtitles back but a standalone media player can't? What is different between the two?
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jean-Daniel
9749ac7#commitcomment-139309111
Thanks for the mitmproxy tip. Much appreciated.
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is getting off-topic for a commit comment. Please open a new issue if there is something that needs to be fixed, or open a question issue if you have a question.
But to resolve the remaining Q:
Not with yt-dlp. It doesn't know what to do with subtitles-only mp4 files. But you can use MP4Box to extract the subtitle stream as webvtt:
It should be noted that the actual subtitles provided by the DASH manifest in mp4 are the same as the webvtt provided by HLS. And with the latest nightly, yt-dlp will download the vtt format by default.
9749ac7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boulderob support for subs in mp4 container is being tracked in #5833