Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[seznamzpravy] Add new extractor #14616

Merged
merged 21 commits into from Jan 27, 2018
Merged

[seznamzpravy] Add new extractor #14616

merged 21 commits into from Jan 27, 2018

Conversation

che0
Copy link
Contributor

@che0 che0 commented Oct 28, 2017

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Add new extractor for Seznam Zprávy

This extractor extracts video from Seznam Zprávy, as requested in #14102.


def _real_extract(self, url):
video_id = self._match_id(url)
api_url = self._API_URL + 'v1/documents/{}'.format(video_id)
Copy link
Collaborator

@dstftw dstftw Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} won't work on python 2.6.
Each variable used only once should be inlined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, hopefully this doesn't make the code too complicated

resolution = fmtdata.get('resolution')
formats.append({
'format_id': fmt,
'width': int_or_none(resolution[0]) if resolution is not None else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks extraction if resolution is not a list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed; poke me if you prefer some other way than TypeError

'format_id': fmt,
'width': int_or_none(resolution[0]) if resolution is not None else None,
'height': int_or_none(resolution[1]) if resolution is not None else None,
'url': urljoin(sdn_url, fmtdata['url']),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks extraction if no url key for a format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

'url': urljoin(sdn_url, fmtdata['url']),
})

formats.sort(key=lambda x: x['height'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_sort_formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now with _sort_formats

'ext': 'mp4',
'title': 'Předseda KDU-ČSL Pavel Bělobrádek ve volební Výzvě Seznamu.',
'description': 'Předvolební rozhovory s lídry deseti hlavních stran pokračují. Ve Výzvě Jindřicha Šídla odpovídal předseda lidovců Pavel Bělobrádek.',
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All duplicate tests should be removed. All non duplicate tests must have corresponding comments on what they actually test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Second test can return at least two videos which are not
binary identical, so removing the checksum.
'ext': 'mp4',
'title': 'Svět bez obalu: Rozhovor s Václavem Marhoulem o zahraničních vojenských misích a aktivních zálohách.',
'description': 'O nasazení českých vojáků v zahraničí. Marhoul by na mise posílal i zálohy. „Nejdříve se ale musí vycvičit,“ říká.',
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two videos on this page.

sdn_url = self._download_json(data['caption']['liveStreamUrl'] + self._MAGIC_SUFFIX, video_id)['Location']

formats = []
for fmt, fmtdata in self._download_json(sdn_url, video_id)['data']['mp4'].items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HLS and DASH formats should also be extracted.


try:
width, height = fmtdata.get('resolution')
except TypeError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed, still may break.


return {
'id': video_id,
'title': data['captionTitle'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be extracted very first.

Includes extension of generic MPD extractor and few more
fixes per dstftw.
@che0
Copy link
Contributor Author

che0 commented Oct 29, 2017

I attempted to extract HLS and DASH. Generic _parse_mpd_formats did not seem to work, because the MPD has segment URLs, but no SegmentTimeline, so I extended the code to support that. I don't know if this is going to be up to youtube-dl standards, so I'm posting the patch now to get earlier feedback, even though it does not address the "two videos on one page" issue yet.

Example of such MPD is at http://v39-a.sdn.szn.cz/v_39/vmd_5999c902ea707c67d8e267a9/1503250723?fl=mdk,432f65a0|dash2,,0

'url': 'https://www.seznam.cz/zpravy/clanek/jejich-svet-na-nas-utoci-je-lepsi-branit-se-na-jejich-pisecku-rika-reziser-a-major-v-zaloze-marhoul-35990',
'params': {'skip_download': True},
# ^ this is here instead of 'file_minsize': 1586, which does not work because
# test_download.py forces expected_minsize to at least 10k when test is running
Copy link
Contributor Author

@che0 che0 Oct 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate some hints on better workarounds for minimal file size in tests. This video has an initializing segment that is 1586 bytes long, but I can't put that here because test_download.py:219 seems to force the size to at least 10k.

@breznak
Copy link

breznak commented Oct 29, 2017

Thank you very much @che0 for your work on this PR! 👍

@che0
Copy link
Contributor Author

che0 commented Nov 6, 2017

I'd like to move a bit forward with this. Is there anything else that should be changed?

@@ -216,7 +216,7 @@ def try_rm_tcs_files(tcs=None):
expected_minsize = tc.get('file_minsize', 10000)
if expected_minsize is not None:
if params.get('test'):
expected_minsize = max(expected_minsize, 10000)
expected_minsize = min(expected_minsize, 10000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

fragments.append({
location_key(segment_url): segment_url,
})
representation_ms_info['fragments'] = fragments
Copy link
Collaborator

@dstftw dstftw Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to a separate PR. Add a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _extract_sdn_formats(self, sdn_url, video_id):
sdn_data = self._download_json(sdn_url, video_id)
formats = []
for fmt, fmtdata in sdn_data.get('data', {}).get('mp4', {}).items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_get.

})

playlists = sdn_data.get('pls', {})
dash_rel_url = playlists.get('dash', {}).get('url')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_get.

if dash_rel_url:
formats.extend(self._extract_mpd_formats(urljoin(sdn_url, dash_rel_url), video_id, mpd_id='dash', fatal=False))

hls_rel_url = playlists.get('hls', {}).get('url')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_get.

if not sdn_url_part or not title:
continue

entry_id = '%s-%s' % (video_id, num)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not an id. Id is something that identifies a video not article.
The whole current extraction approach is incorrect. What should be done is a separate extractor for iframe URL from

<iframe src="https://www.seznam.cz/zpravy/iframe/player?duration=241&serviceSlug=zpravy&src=https%3A%2F%2Fv39-a.sdn.szn.cz%2Fv_39%2Fvmd%2F5999c902ea707c67d8e267a9%3Ffl%3Dmdk%2C432f65a0%7C&itemType=video&autoPlay=false&title=Sv%C4%9Bt%20bez%20obalu%3A%20%C4%8Ce%C5%A1t%C3%AD%20voj%C3%A1ci%20na%20mis%C3%ADch%20(kr%C3%A1tk%C3%A1%20verze)&series=Sv%C4%9Bt%20bez%20obalu&serviceName=Seznam%20Zpr%C3%A1vy&poster=%2F%2Fd39-a.sdn.szn.cz%2Fd_39%2Fc_img_F_I%2FR5puJ.jpeg%3Ffl%3Dcro%2C0%2C0%2C1920%2C1080%7Cres%2C1200%2C%2C1%7Cjpg%2C80%2C%2C1&width=1920&height=1080&cutFrom=0&cutTo=0&splVersion=VOD&contentId=170889&contextId=35990&showAdvert=true&collocation=&autoplayPossible=true&embed=&isVideoTooShortForPreroll=false&isVideoTooLongForPostroll=true&videoCommentOpKey=&videoCommentId=&version=4.0.76&dotService=zpravy&gemiusPrismIdentifier=bVc1ZIb_Qax4W2v5xOPGpMeCP31kFfrTzj0SqPTLh_b.Z7&zoneIdPreroll=seznam.pack.videospot&skipOffsetPreroll=5&sectionPrefixPreroll=%2Fzpravy"

and an article extractor that finds iframes and delegates to iframe extractor (or an addition to generic extractor instead if such videos can be embedded on any site).

@che0 che0 changed the title [seznamzpravy] Add new extractor WIP: [seznamzpravy] Add new extractor Nov 25, 2017
@che0 che0 changed the title WIP: [seznamzpravy] Add new extractor [WIP] [seznamzpravy] Add new extractor Nov 25, 2017
@che0 che0 changed the title [WIP] [seznamzpravy] Add new extractor [seznamzpravy] Add new extractor Dec 4, 2017

def _iframe_result(self, info_dict):
video_id = info_dict['id'] or self._raw_id(info_dict['src'])
url = 'https://www.seznam.cz/zpravy/iframe/player?%s' % compat_urllib_parse_urlencode({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_url_query.



class SeznamZpravyGenericIE(InfoExtractor):
_API_URL = 'https://apizpravy.seznam.cz/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go to SeznamZpravyArticleIE.


class SeznamZpravyGenericIE(InfoExtractor):
_API_URL = 'https://apizpravy.seznam.cz/'
_MAGIC_SUFFIX = 'spl2,2,VOD'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> SeznamZpravyIframeIE.

_API_URL = 'https://apizpravy.seznam.cz/'
_MAGIC_SUFFIX = 'spl2,2,VOD'

def _extract_sdn_formats(self, sdn_url, video_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> SeznamZpravyIframeIE.


def _extract_content(self, api_data):
entries = []
for num, item in enumerate(api_data.get('content', [])):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num unused.

return compat_urllib_parse_urlparse(src_url).path.split('/')[-1]


class SeznamZpravyIframeIE(SeznamZpravyGenericIE):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeznamZpravyIframeIE -> SeznamZpravyIE.

@che0
Copy link
Contributor Author

che0 commented Dec 19, 2017

Now with more metadata, courtesy of @oskar456

@dstftw dstftw merged commit 27940ca into ytdl-org:master Jan 27, 2018
dstftw added a commit that referenced this pull request Jan 27, 2018
@breznak
Copy link

breznak commented Jan 28, 2018

awesome work @che0 !
mas u me pivo, dik 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants