[seznamzpravy] Add new extractor #14616

che0 · 2017-10-28T18:26:01Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Add new extractor for Seznam Zprávy

This extractor extracts video from Seznam Zprávy, as requested in #14102.

dstftw · 2017-10-28T18:34:17Z

youtube_dl/extractor/seznamzpravy.py

+
+    def _real_extract(self, url):
+        video_id = self._match_id(url)
+        api_url = self._API_URL + 'v1/documents/{}'.format(video_id)


{} won't work on python 2.6.
Each variable used only once should be inlined.

fixed, hopefully this doesn't make the code too complicated

dstftw · 2017-10-28T18:35:19Z

youtube_dl/extractor/seznamzpravy.py

+            resolution = fmtdata.get('resolution')
+            formats.append({
+                'format_id': fmt,
+                'width': int_or_none(resolution[0]) if resolution is not None else None,


Breaks extraction if resolution is not a list.

fixed; poke me if you prefer some other way than TypeError

dstftw · 2017-10-28T18:37:24Z

youtube_dl/extractor/seznamzpravy.py

+                'format_id': fmt,
+                'width': int_or_none(resolution[0]) if resolution is not None else None,
+                'height': int_or_none(resolution[1]) if resolution is not None else None,
+                'url': urljoin(sdn_url, fmtdata['url']),


Breaks extraction if no url key for a format.

dstftw · 2017-10-28T18:37:32Z

youtube_dl/extractor/seznamzpravy.py

+                'url': urljoin(sdn_url, fmtdata['url']),
+            })
+
+        formats.sort(key=lambda x: x['height'])


_sort_formats.

now with _sort_formats

dstftw · 2017-10-28T18:38:11Z

youtube_dl/extractor/seznamzpravy.py

+            'ext': 'mp4',
+            'title': 'Předseda KDU-ČSL Pavel Bělobrádek ve volební Výzvě Seznamu.',
+            'description': 'Předvolební rozhovory s lídry deseti hlavních stran pokračují. Ve Výzvě Jindřicha Šídla odpovídal předseda lidovců Pavel Bělobrádek.',
+        }


All duplicate tests should be removed. All non duplicate tests must have corresponding comments on what they actually test.

Second test can return at least two videos which are not binary identical, so removing the checksum.

dstftw · 2017-10-28T22:11:51Z

youtube_dl/extractor/seznamzpravy.py

+            'ext': 'mp4',
+            'title': 'Svět bez obalu: Rozhovor s Václavem Marhoulem o zahraničních vojenských misích a aktivních zálohách.',
+            'description': 'O nasazení českých vojáků v zahraničí. Marhoul by na mise posílal i zálohy. „Nejdříve se ale musí vycvičit,“ říká.',
+        }


There are two videos on this page.

dstftw · 2017-10-28T22:14:29Z

youtube_dl/extractor/seznamzpravy.py

+            sdn_url = self._download_json(data['caption']['liveStreamUrl'] + self._MAGIC_SUFFIX, video_id)['Location']
+
+        formats = []
+        for fmt, fmtdata in self._download_json(sdn_url, video_id)['data']['mp4'].items():


HLS and DASH formats should also be extracted.

dstftw · 2017-10-28T22:16:09Z

youtube_dl/extractor/seznamzpravy.py

+
+            try:
+                width, height = fmtdata.get('resolution')
+            except TypeError:


Nothing changed, still may break.

dstftw · 2017-10-28T22:16:26Z

youtube_dl/extractor/seznamzpravy.py

+
+        return {
+            'id': video_id,
+            'title': data['captionTitle'],


This should be extracted very first.

Includes extension of generic MPD extractor and few more fixes per dstftw.

che0 · 2017-10-29T20:41:52Z

I attempted to extract HLS and DASH. Generic _parse_mpd_formats did not seem to work, because the MPD has segment URLs, but no SegmentTimeline, so I extended the code to support that. I don't know if this is going to be up to youtube-dl standards, so I'm posting the patch now to get earlier feedback, even though it does not address the "two videos on one page" issue yet.

Example of such MPD is at http://v39-a.sdn.szn.cz/v_39/vmd_5999c902ea707c67d8e267a9/1503250723?fl=mdk,432f65a0|dash2,,0

che0 · 2017-10-29T20:49:00Z

youtube_dl/extractor/seznamzpravy.py

+        'url': 'https://www.seznam.cz/zpravy/clanek/jejich-svet-na-nas-utoci-je-lepsi-branit-se-na-jejich-pisecku-rika-reziser-a-major-v-zaloze-marhoul-35990',
+        'params': {'skip_download': True},
+        # ^ this is here instead of 'file_minsize': 1586, which does not work because
+        #   test_download.py forces expected_minsize to at least 10k when test is running


I'd appreciate some hints on better workarounds for minimal file size in tests. This video has an initializing segment that is 1586 bytes long, but I can't put that here because test_download.py:219 seems to force the size to at least 10k.

breznak · 2017-10-29T20:50:26Z

Thank you very much @che0 for your work on this PR! 👍

Also use primarily title instead of captionTitle

Also removed workaround in seznamzpravy extractor test.

che0 · 2017-11-06T10:44:39Z

I'd like to move a bit forward with this. Is there anything else that should be changed?

dstftw · 2017-11-06T14:11:14Z

test/test_download.py

@@ -216,7 +216,7 @@ def try_rm_tcs_files(tcs=None):
                    expected_minsize = tc.get('file_minsize', 10000)
                    if expected_minsize is not None:
                        if params.get('test'):
-                            expected_minsize = max(expected_minsize, 10000)
+                            expected_minsize = min(expected_minsize, 10000)


dstftw · 2017-11-06T14:39:55Z

youtube_dl/extractor/common.py

+                                fragments.append({
+                                    location_key(segment_url): segment_url,
+                                })
+                            representation_ms_info['fragments'] = fragments


Move to a separate PR. Add a test.

dstftw · 2017-11-06T14:40:19Z

youtube_dl/extractor/seznamzpravy.py

+    def _extract_sdn_formats(self, sdn_url, video_id):
+        sdn_data = self._download_json(sdn_url, video_id)
+        formats = []
+        for fmt, fmtdata in sdn_data.get('data', {}).get('mp4', {}).items():


dstftw · 2017-11-06T14:40:31Z

youtube_dl/extractor/seznamzpravy.py

+            })
+
+        playlists = sdn_data.get('pls', {})
+        dash_rel_url = playlists.get('dash', {}).get('url')


dstftw · 2017-11-06T14:40:55Z

youtube_dl/extractor/seznamzpravy.py

+        if dash_rel_url:
+            formats.extend(self._extract_mpd_formats(urljoin(sdn_url, dash_rel_url), video_id, mpd_id='dash', fatal=False))
+
+        hls_rel_url = playlists.get('hls', {}).get('url')


dstftw · 2017-11-06T15:15:03Z

youtube_dl/extractor/seznamzpravy.py

+            if not sdn_url_part or not title:
+                continue
+
+            entry_id = '%s-%s' % (video_id, num)


That's not an id. Id is something that identifies a video not article.
The whole current extraction approach is incorrect. What should be done is a separate extractor for iframe URL from

<iframe src="https://www.seznam.cz/zpravy/iframe/player?duration=241&serviceSlug=zpravy&src=https%3A%2F%2Fv39-a.sdn.szn.cz%2Fv_39%2Fvmd%2F5999c902ea707c67d8e267a9%3Ffl%3Dmdk%2C432f65a0%7C&itemType=video&autoPlay=false&title=Sv%C4%9Bt%20bez%20obalu%3A%20%C4%8Ce%C5%A1t%C3%AD%20voj%C3%A1ci%20na%20mis%C3%ADch%20(kr%C3%A1tk%C3%A1%20verze)&series=Sv%C4%9Bt%20bez%20obalu&serviceName=Seznam%20Zpr%C3%A1vy&poster=%2F%2Fd39-a.sdn.szn.cz%2Fd_39%2Fc_img_F_I%2FR5puJ.jpeg%3Ffl%3Dcro%2C0%2C0%2C1920%2C1080%7Cres%2C1200%2C%2C1%7Cjpg%2C80%2C%2C1&width=1920&height=1080&cutFrom=0&cutTo=0&splVersion=VOD&contentId=170889&contextId=35990&showAdvert=true&collocation=&autoplayPossible=true&embed=&isVideoTooShortForPreroll=false&isVideoTooLongForPostroll=true&videoCommentOpKey=&videoCommentId=&version=4.0.76&dotService=zpravy&gemiusPrismIdentifier=bVc1ZIb_Qax4W2v5xOPGpMeCP31kFfrTzj0SqPTLh_b.Z7&zoneIdPreroll=seznam.pack.videospot&skipOffsetPreroll=5&sectionPrefixPreroll=%2Fzpravy"

and an article extractor that finds iframes and delegates to iframe extractor (or an addition to generic extractor instead if such videos can be embedded on any site).

This reverts commit 87c82b2.

dstftw · 2017-12-17T11:52:52Z

youtube_dl/extractor/seznamzpravy.py

+
+    def _iframe_result(self, info_dict):
+        video_id = info_dict['id'] or self._raw_id(info_dict['src'])
+        url = 'https://www.seznam.cz/zpravy/iframe/player?%s' % compat_urllib_parse_urlencode({


update_url_query.

dstftw · 2017-12-17T11:55:46Z

youtube_dl/extractor/seznamzpravy.py

+
+
+class SeznamZpravyGenericIE(InfoExtractor):
+    _API_URL = 'https://apizpravy.seznam.cz/'


Should go to SeznamZpravyArticleIE.

dstftw · 2017-12-17T11:56:05Z

youtube_dl/extractor/seznamzpravy.py

+
+class SeznamZpravyGenericIE(InfoExtractor):
+    _API_URL = 'https://apizpravy.seznam.cz/'
+    _MAGIC_SUFFIX = 'spl2,2,VOD'


-> SeznamZpravyIframeIE.

dstftw · 2017-12-17T11:56:13Z

youtube_dl/extractor/seznamzpravy.py

+    _API_URL = 'https://apizpravy.seznam.cz/'
+    _MAGIC_SUFFIX = 'spl2,2,VOD'
+
+    def _extract_sdn_formats(self, sdn_url, video_id):


-> SeznamZpravyIframeIE.

dstftw · 2017-12-17T11:59:35Z

youtube_dl/extractor/seznamzpravy.py

+
+    def _extract_content(self, api_data):
+        entries = []
+        for num, item in enumerate(api_data.get('content', [])):


num unused.

dstftw · 2017-12-17T12:00:36Z

youtube_dl/extractor/seznamzpravy.py

+        return compat_urllib_parse_urlparse(src_url).path.split('/')[-1]
+
+
+class SeznamZpravyIframeIE(SeznamZpravyGenericIE):


SeznamZpravyIframeIE -> SeznamZpravyIE.

che0 · 2017-12-19T15:20:54Z

Now with more metadata, courtesy of @oskar456

breznak · 2018-01-28T21:36:10Z

awesome work @che0 !
mas u me pivo, dik 👏

[seznamzpravy] Add new extractor

ef5e6d9

che0 mentioned this pull request Oct 28, 2017

Add site-support for Zpravy Seznam.cz (CZech news site) #14102

Closed

4 tasks

dstftw requested changes Oct 28, 2017

View reviewed changes

che0 added 2 commits October 28, 2017 21:10

[seznamzpravy] Fixes per dstftw

3747cf1

[seznamzpravy] Removed sometimes-failing test md5

4d87299

Second test can return at least two videos which are not binary identical, so removing the checksum.

dstftw requested changes Oct 28, 2017

View reviewed changes

dstftw added the pending-fixes label Oct 28, 2017

[seznamzpravy] Parse HLS and DASH

defcd75

Includes extension of generic MPD extractor and few more fixes per dstftw.

che0 commented Oct 29, 2017

View reviewed changes

che0 added 4 commits October 29, 2017 23:44

[seznamzpravy] Parse multiple videos

255491d

Also use primarily title instead of captionTitle

[seznamzpravy] Fixed test

d13cb7d

Merge remote-tracking branch 'upstream/master'

b804f15

[test/test_download] In test we download 10000 bytes at max

87c82b2

Also removed workaround in seznamzpravy extractor test.

dstftw requested changes Nov 6, 2017

View reviewed changes

che0 added 2 commits November 24, 2017 23:29

[seznamzpravy] Updated API URL

bf4f780

Revert "[test/test_download] In test we download 10000 bytes at max"

8e189bb

This reverts commit 87c82b2.

che0 mentioned this pull request Nov 25, 2017

[extractor/common] parse MPD that has only Segment URLs #14844

Closed

9 tasks

Merge remote-tracking branch 'upstream/master'

3c211cf

che0 changed the title ~~[seznamzpravy] Add new extractor~~ WIP: [seznamzpravy] Add new extractor Nov 25, 2017

che0 changed the title ~~WIP: [seznamzpravy] Add new extractor~~ [WIP] [seznamzpravy] Add new extractor Nov 25, 2017

che0 added 3 commits November 25, 2017 02:34

[seznamzpravy] use try_get

16ca005

[seznamzpravy] Split to article and iframe extractor

548c008

Merge remote-tracking branch 'upstream/master'

65aedb7

che0 changed the title ~~[WIP] [seznamzpravy] Add new extractor~~ [seznamzpravy] Add new extractor Dec 4, 2017

che0 added 2 commits December 12, 2017 19:06

[seznamzpravy] New URL

6f20386

Merge remote-tracking branch 'upstream/master'

a197ff1

dstftw requested changes Dec 17, 2017

View reviewed changes

che0 and others added 4 commits December 18, 2017 20:10

[seznamzpravy] Fixes per dstftw

7999450

Merge remote-tracking branch 'upstream/master'

d4be9c6

[seznamzpravy] Add more metadata for SDN streams

0b74f2f

[seznamzpravy] A bit shorter duration+tbr extraction

e0f78d0

Merge remote-tracking branch 'upstream/master'

b92da88

dstftw merged commit 27940ca into ytdl-org:master Jan 27, 2018

dstftw added a commit that referenced this pull request Jan 27, 2018

[seznamzpravy] Improve and simplify (closes #14616)

3c3a07e

che0 mentioned this pull request Jan 29, 2018

Ignore missing attributes in MPD manifests. #14648

Closed

3 tasks

dstftw added a commit that referenced this pull request Feb 9, 2018

Credit @che0 for seznamzpravy (#14616) and dvtv (#15442)

cbfbf07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[seznamzpravy] Add new extractor #14616

[seznamzpravy] Add new extractor #14616

che0 commented Oct 28, 2017

dstftw Oct 28, 2017 •

edited

che0 Oct 28, 2017

dstftw Oct 28, 2017

che0 Oct 28, 2017

dstftw Oct 28, 2017

che0 Oct 28, 2017

dstftw Oct 28, 2017

che0 Oct 28, 2017

dstftw Oct 28, 2017

che0 Oct 28, 2017

dstftw Oct 28, 2017

dstftw Oct 28, 2017

dstftw Oct 28, 2017

dstftw Oct 28, 2017

che0 commented Oct 29, 2017 •

edited

che0 Oct 29, 2017 •

edited

breznak commented Oct 29, 2017

che0 commented Nov 6, 2017

dstftw Nov 6, 2017

dstftw Nov 6, 2017 •

edited

che0 Nov 27, 2017

dstftw Nov 6, 2017

dstftw Nov 6, 2017

dstftw Nov 6, 2017

dstftw Nov 6, 2017

dstftw Dec 17, 2017

dstftw Dec 17, 2017

dstftw Dec 17, 2017

dstftw Dec 17, 2017

dstftw Dec 17, 2017

dstftw Dec 17, 2017

che0 commented Dec 19, 2017

breznak commented Jan 28, 2018



		class SeznamZpravyGenericIE(InfoExtractor):
		_API_URL = 'https://apizpravy.seznam.cz/'

		return compat_urllib_parse_urlparse(src_url).path.split('/')[-1]


		class SeznamZpravyIframeIE(SeznamZpravyGenericIE):

[seznamzpravy] Add new extractor #14616

[seznamzpravy] Add new extractor #14616

Conversation

che0 commented Oct 28, 2017

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Add new extractor for Seznam Zprávy

dstftw Oct 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

che0 commented Oct 29, 2017 • edited

che0 Oct 29, 2017 • edited

Choose a reason for hiding this comment

breznak commented Oct 29, 2017

che0 commented Nov 6, 2017

Choose a reason for hiding this comment

dstftw Nov 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

che0 commented Dec 19, 2017

breznak commented Jan 28, 2018

dstftw Oct 28, 2017 •

edited

che0 commented Oct 29, 2017 •

edited

che0 Oct 29, 2017 •

edited

dstftw Nov 6, 2017 •

edited