Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could someone kindly add a function to download CC subtitle from site "www.tagesschau.de"? #21427

Open
snowmfx opened this issue Jun 17, 2019 · 7 comments

Comments

@snowmfx
Copy link

@snowmfx snowmfx commented Jun 17, 2019

Checklist

  • [x ] I'm reporting a site feature request
  • I've verified that I'm running youtube-dl version 2019.06.08
  • [x ] I've searched the bugtracker for similar site feature requests including closed ones

Description

WRITE DESCRIPTION HERE

Some video in this site "www.tagesschau.de" has subtitle.
Eg. https://www.tagesschau.de/multimedia/sendung/ts-31835.html .
But I can't download it with latest youtube-dl tool.
I ask a favor.

@mkg20001
Copy link

@mkg20001 mkg20001 commented Jun 17, 2019

I've found that it downloads https://www.tagesschau.de/multimedia/video/untertitel-35735.xml (seems like TTS format) the ID comes from https://www.tagesschau.de/multimedia/sendung/ts-31835.html but can be dynamically aquired from JS using window.location.host + '/' + mc.getSubtitleUrl()....

image

...which gets it from https://www.tagesschau.de/multimedia/video/video-554109~mediajson_broadcastType-TS2000.json property _subtitleUrl

No clue where the JSON comes from, though. Seems to be loading some iframe where it load that JSON

@snowmfx
Copy link
Author

@snowmfx snowmfx commented Jun 18, 2019

Thanks.
I searched word "untertitel" in element - Network. I got this subtitle address.

@mkg20001
Copy link

@mkg20001 mkg20001 commented Jun 19, 2019

@snowmfx That should work as a workarround. I'd suggest you to leave this issue open until it's implemented in the extractor. (I tried to add it, but it seems like the current extractor isn't even using the JSON. I'll try again once I get the time to do it)

@basicmaster
Copy link
Contributor

@basicmaster basicmaster commented Jun 20, 2019

(seems like TTS format)

It is TTML or rather the EBU-TT-D-Basic-DE profile used by the German public broadcasters. If renamed to *.ttml , recent VLC versions render it.

@Reino17
Copy link

@Reino17 Reino17 commented Jun 23, 2019

For the time being you could have a look at Xidel, a HTML/XML/JSON parser (using CSS, XPath, XQuery, JSONiq and pattern templates).
Extract the json-url:

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -e 'json(//@data-ctrl-iframe)/action/default/src'
/multimedia/video/video-554109~ardplayer_autoplay-true_broadcastType-TS2000.html

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -e 'json(//@data-ctrl-iframe)/action/default/replace(src,"(.+~).+(_broadcast.+\.).+","$1mediajson$2json")'
/multimedia/video/video-554109~mediajson_broadcastType-TS2000.json

"Follow" (-f)/open the json-url, parse as json (because it's Content-Type: text/html instead of Content-Type: application/json) and extract the subtitle-url:

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -f 'json(//@data-ctrl-iframe)/action/default/replace(src,"(.+~).+(_broadcast.+\.).+","$1mediajson$2json")' -e 'json($raw)/_subtitleUrl'
/multimedia/video/untertitel-35735.xml

If you want to convert to *.srt (subrip), then Xidel can do this too:

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -f 'json(//@data-ctrl-iframe)/action/default/replace(src,"(.+~).+(_broadcast.+\.).+","$1mediajson$2json")' -f 'json($raw)/_subtitleUrl' --xquery 'for $x at $i in //tt:p[tt:span] return ($i,replace(concat($x/@begin," --> ",$x/@end),"\.",","),$x/tt:span,"")'
1
00:00:03,040 --> 00:00:06,720
Hier ist das Erste Deutsche Fernsehen
mit der tagesschau.

2
00:00:06,880 --> 00:00:11,120
Herzlich willkommen zur Live-
Untertitelung des NDR (16.06.2019)

3
00:00:15,200 --> 00:00:17,240
Heute im Studio: Jan Hofer

[...]

257
00:15:45,320 --> 00:15:47,400
Copyright Untertitel: NDR 2019
@basicmaster
Copy link
Contributor

@basicmaster basicmaster commented Jun 24, 2019

The problem with SubRip here is that the speaker colors all are discarded.

@Reino17
Copy link

@Reino17 Reino17 commented Jun 24, 2019

Then we just have to update the extraction query.
Prettified query:

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" \
-f '
  json(//@data-ctrl-iframe)/action/default/replace(
    src,
    "(.+~).+(_broadcast.+\.).+",
    "$1mediajson$2json"
  )
' \
-f 'json($raw)/_subtitleUrl' \
--xquery '
  let $a:={|
    //tt:style[@tts:color]/{
      @xml:id:@tts:color
    }
  |}
  for $x at $i in //tt:p[tt:span]
  return (
    $i,
    replace(
      concat(
        $x/@begin,
        " --> ",
        $x/@end
      ),
      "\.",","
    ),
    $x/tt:span/(
      if (@style="textWhite") then
        .
      else
        concat(
          "<font color="",
          $a(@style),
          "">",
          .,
          "</font>"
        )
    ),
    ""
  )
'
1
00:00:03,040 --> 00:00:06,720
Hier ist das Erste Deutsche Fernsehen
mit der tagesschau.

2
00:00:06,880 --> 00:00:11,120
<font color="#0000FF">Herzlich willkommen zur Live-</font>
<font color="#0000FF">Untertitelung des NDR (16.06.2019)</font>

3
00:00:15,200 --> 00:00:17,240
Heute im Studio: Jan Hofer

[...]

257
00:15:45,320 --> 00:15:47,400
<font color="#0000FF">Copyright Untertitel: NDR 2019</font>

--xquery 'let $a:={|//tt:style[@tts:color]/{@xml:id:@tts:color}|} return $a'
{
  "textBlack": "#000000",
  "textRed": "#FF0000",
  "textGreen": "#00FF00",
  "textYellow": "#FFFF00",
  "textBlue": "#0000FF",
  "textMagenta": "#FF00FF",
  "textCyan": "#00FFFF",
  "textWhite": "#FFFFFF"
}

When we come across <tt:span style="textBlue"> for instance, with $a(@style) this json object is consulted for the value of the "textBlue" attribute.


Minified query:

$ xidel -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -f 'json(//@data-ctrl-iframe)/action/default/replace(src,"(.+~).+(_broadcast.+\.).+","$1mediajson$2json")' -f 'json($raw)/_subtitleUrl' --xquery 'let $a:={|//tt:style[@tts:color]/{@xml:id:@tts:color}|} for $x at $i in //tt:p[tt:span] return ($i,replace(concat($x/@begin," --> ",$x/@end),"\.",","),$x/tt:span/(if (@style="textWhite") then . else concat("&lt;font color=&quot;",$a(@style),"&quot;&gt;",.,"&lt;/font&gt;")),"")'

Minified query (Windows):

xidel.exe -s "https://www.tagesschau.de/multimedia/sendung/ts-31835.html" -f "json(//@data-ctrl-iframe)/action/default/replace(src,'(.+~).+(_broadcast.+\.).+','$1mediajson$2json')" -f "json($raw)/_subtitleUrl" --xquery "let $a:={|//tt:style[@tts:color]/{@xml:id:@tts:color}|} for $x at $i in //tt:p[tt:span] return ($i,replace(concat($x/@begin,' --> ',$x/@end),'\.',','),$x/tt:span/(if (@style='textWhite') then . else concat('&lt;font color=&quot;',$a(@style),'&quot;&gt;',.,'&lt;/font&gt;')),'')"
@ytdl-org ytdl-org deleted a comment from korbendallaskoop Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.