New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[seznamzpravy] Add new extractor #14616
Conversation
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
def _real_extract(self, url): | ||
video_id = self._match_id(url) | ||
api_url = self._API_URL + 'v1/documents/{}'.format(video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{}
won't work on python 2.6.
Each variable used only once should be inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, hopefully this doesn't make the code too complicated
youtube_dl/extractor/seznamzpravy.py
Outdated
resolution = fmtdata.get('resolution') | ||
formats.append({ | ||
'format_id': fmt, | ||
'width': int_or_none(resolution[0]) if resolution is not None else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaks extraction if resolution
is not a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed; poke me if you prefer some other way than TypeError
youtube_dl/extractor/seznamzpravy.py
Outdated
'format_id': fmt, | ||
'width': int_or_none(resolution[0]) if resolution is not None else None, | ||
'height': int_or_none(resolution[1]) if resolution is not None else None, | ||
'url': urljoin(sdn_url, fmtdata['url']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaks extraction if no url
key for a format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
youtube_dl/extractor/seznamzpravy.py
Outdated
'url': urljoin(sdn_url, fmtdata['url']), | ||
}) | ||
|
||
formats.sort(key=lambda x: x['height']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_sort_formats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now with _sort_formats
youtube_dl/extractor/seznamzpravy.py
Outdated
'ext': 'mp4', | ||
'title': 'Předseda KDU-ČSL Pavel Bělobrádek ve volební Výzvě Seznamu.', | ||
'description': 'Předvolební rozhovory s lídry deseti hlavních stran pokračují. Ve Výzvě Jindřicha Šídla odpovídal předseda lidovců Pavel Bělobrádek.', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All duplicate tests should be removed. All non duplicate tests must have corresponding comments on what they actually test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Second test can return at least two videos which are not binary identical, so removing the checksum.
youtube_dl/extractor/seznamzpravy.py
Outdated
'ext': 'mp4', | ||
'title': 'Svět bez obalu: Rozhovor s Václavem Marhoulem o zahraničních vojenských misích a aktivních zálohách.', | ||
'description': 'O nasazení českých vojáků v zahraničí. Marhoul by na mise posílal i zálohy. „Nejdříve se ale musí vycvičit,“ říká.', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two videos on this page.
youtube_dl/extractor/seznamzpravy.py
Outdated
sdn_url = self._download_json(data['caption']['liveStreamUrl'] + self._MAGIC_SUFFIX, video_id)['Location'] | ||
|
||
formats = [] | ||
for fmt, fmtdata in self._download_json(sdn_url, video_id)['data']['mp4'].items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HLS and DASH formats should also be extracted.
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
try: | ||
width, height = fmtdata.get('resolution') | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changed, still may break.
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
return { | ||
'id': video_id, | ||
'title': data['captionTitle'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be extracted very first.
Includes extension of generic MPD extractor and few more fixes per dstftw.
I attempted to extract HLS and DASH. Generic Example of such MPD is at http://v39-a.sdn.szn.cz/v_39/vmd_5999c902ea707c67d8e267a9/1503250723?fl=mdk,432f65a0|dash2,,0 |
youtube_dl/extractor/seznamzpravy.py
Outdated
'url': 'https://www.seznam.cz/zpravy/clanek/jejich-svet-na-nas-utoci-je-lepsi-branit-se-na-jejich-pisecku-rika-reziser-a-major-v-zaloze-marhoul-35990', | ||
'params': {'skip_download': True}, | ||
# ^ this is here instead of 'file_minsize': 1586, which does not work because | ||
# test_download.py forces expected_minsize to at least 10k when test is running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate some hints on better workarounds for minimal file size in tests. This video has an initializing segment that is 1586 bytes long, but I can't put that here because test_download.py:219 seems to force the size to at least 10k.
Thank you very much @che0 for your work on this PR! 👍 |
Also use primarily title instead of captionTitle
Also removed workaround in seznamzpravy extractor test.
I'd like to move a bit forward with this. Is there anything else that should be changed? |
test/test_download.py
Outdated
@@ -216,7 +216,7 @@ def try_rm_tcs_files(tcs=None): | |||
expected_minsize = tc.get('file_minsize', 10000) | |||
if expected_minsize is not None: | |||
if params.get('test'): | |||
expected_minsize = max(expected_minsize, 10000) | |||
expected_minsize = min(expected_minsize, 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
youtube_dl/extractor/common.py
Outdated
fragments.append({ | ||
location_key(segment_url): segment_url, | ||
}) | ||
representation_ms_info['fragments'] = fragments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to a separate PR. Add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
youtube_dl/extractor/seznamzpravy.py
Outdated
def _extract_sdn_formats(self, sdn_url, video_id): | ||
sdn_data = self._download_json(sdn_url, video_id) | ||
formats = [] | ||
for fmt, fmtdata in sdn_data.get('data', {}).get('mp4', {}).items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_get
.
youtube_dl/extractor/seznamzpravy.py
Outdated
}) | ||
|
||
playlists = sdn_data.get('pls', {}) | ||
dash_rel_url = playlists.get('dash', {}).get('url') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_get
.
youtube_dl/extractor/seznamzpravy.py
Outdated
if dash_rel_url: | ||
formats.extend(self._extract_mpd_formats(urljoin(sdn_url, dash_rel_url), video_id, mpd_id='dash', fatal=False)) | ||
|
||
hls_rel_url = playlists.get('hls', {}).get('url') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_get
.
youtube_dl/extractor/seznamzpravy.py
Outdated
if not sdn_url_part or not title: | ||
continue | ||
|
||
entry_id = '%s-%s' % (video_id, num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not an id. Id is something that identifies a video not article.
The whole current extraction approach is incorrect. What should be done is a separate extractor for iframe URL from
<iframe src="https://www.seznam.cz/zpravy/iframe/player?duration=241&serviceSlug=zpravy&src=https%3A%2F%2Fv39-a.sdn.szn.cz%2Fv_39%2Fvmd%2F5999c902ea707c67d8e267a9%3Ffl%3Dmdk%2C432f65a0%7C&itemType=video&autoPlay=false&title=Sv%C4%9Bt%20bez%20obalu%3A%20%C4%8Ce%C5%A1t%C3%AD%20voj%C3%A1ci%20na%20mis%C3%ADch%20(kr%C3%A1tk%C3%A1%20verze)&series=Sv%C4%9Bt%20bez%20obalu&serviceName=Seznam%20Zpr%C3%A1vy&poster=%2F%2Fd39-a.sdn.szn.cz%2Fd_39%2Fc_img_F_I%2FR5puJ.jpeg%3Ffl%3Dcro%2C0%2C0%2C1920%2C1080%7Cres%2C1200%2C%2C1%7Cjpg%2C80%2C%2C1&width=1920&height=1080&cutFrom=0&cutTo=0&splVersion=VOD&contentId=170889&contextId=35990&showAdvert=true&collocation=&autoplayPossible=true&embed=&isVideoTooShortForPreroll=false&isVideoTooLongForPostroll=true&videoCommentOpKey=&videoCommentId=&version=4.0.76&dotService=zpravy&gemiusPrismIdentifier=bVc1ZIb_Qax4W2v5xOPGpMeCP31kFfrTzj0SqPTLh_b.Z7&zoneIdPreroll=seznam.pack.videospot&skipOffsetPreroll=5§ionPrefixPreroll=%2Fzpravy"
and an article extractor that finds iframes and delegates to iframe extractor (or an addition to generic extractor instead if such videos can be embedded on any site).
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
def _iframe_result(self, info_dict): | ||
video_id = info_dict['id'] or self._raw_id(info_dict['src']) | ||
url = 'https://www.seznam.cz/zpravy/iframe/player?%s' % compat_urllib_parse_urlencode({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_url_query
.
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
|
||
class SeznamZpravyGenericIE(InfoExtractor): | ||
_API_URL = 'https://apizpravy.seznam.cz/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should go to SeznamZpravyArticleIE
.
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
class SeznamZpravyGenericIE(InfoExtractor): | ||
_API_URL = 'https://apizpravy.seznam.cz/' | ||
_MAGIC_SUFFIX = 'spl2,2,VOD' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> SeznamZpravyIframeIE
.
youtube_dl/extractor/seznamzpravy.py
Outdated
_API_URL = 'https://apizpravy.seznam.cz/' | ||
_MAGIC_SUFFIX = 'spl2,2,VOD' | ||
|
||
def _extract_sdn_formats(self, sdn_url, video_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> SeznamZpravyIframeIE
.
youtube_dl/extractor/seznamzpravy.py
Outdated
|
||
def _extract_content(self, api_data): | ||
entries = [] | ||
for num, item in enumerate(api_data.get('content', [])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num
unused.
youtube_dl/extractor/seznamzpravy.py
Outdated
return compat_urllib_parse_urlparse(src_url).path.split('/')[-1] | ||
|
||
|
||
class SeznamZpravyIframeIE(SeznamZpravyGenericIE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SeznamZpravyIframeIE
-> SeznamZpravyIE
.
Now with more metadata, courtesy of @oskar456 |
awesome work @che0 ! |
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Add new extractor for Seznam Zprávy
This extractor extracts video from Seznam Zprávy, as requested in #14102.