Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kan] Add new extractor #27959

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

[kan] Add new extractor #27959

wants to merge 9 commits into from

Conversation

yhager
Copy link

@yhager yhager commented Jan 25, 2021

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

New extractor for kan.org.il.
Fixes #26551.

This is my first extractor, I tried to follow the guide the best way I can, please let me know if there are any issues and I will address them.



class KanIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?kan\.org\.il/(?:[iI]tem/\?item[iI]d|program/\?cat[iI]d)=(?P<id>[0-9]+)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be two different extractors: for videos and for playlists.

creator = data.get('author', {}).get('name') or \
self._og_search_property('site_name', webpage, fatal=False)
thumbnail = get_thumbnail(data)
m3u8_url = data.get('content', {}).get('src')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mandatory. Read coding conventions.

video_id)
title = data.get('title') or \
self._og_search_title(webpage) or \
self._html_search_regex(r'<title>([^<]+)</title>', webpage, 'title')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never reachable.

self._html_search_regex(r'<title>([^<]+)</title>', webpage, 'title')
description = data.get('summary') or \
self._og_search_description(webpage, fatal=False)
creator = data.get('author', {}).get('name') or \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_get.

m3u8_url = data.get('content', {}).get('src')
formats = self._extract_m3u8_formats(m3u8_url, video_id, ext='mp4')
return {
'_type': 'video',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is default.


def _extract_list(self, list_id, webpage):
video_ids = re.findall(r'onclick="playVideo\(.*,\'([0-9]+)\'\)', webpage)
title = self._og_search_title(webpage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playlist title is optional.

creator = try_get(data, lambda x: x['author']['name'], str) or \
self._og_search_property('site_name', webpage, fatal=False)
thumbnail = get_thumbnail(data)
m3u8_url = try_get(data, lambda x: x['content']['src'], str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed.

Comment on lines 48 to 49
if not m3u8_url:
raise ExtractorError('Unable to extract m3u8 url')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Comment on lines 33 to 40
data = self._parse_json(
self._search_regex(
r'<script id="kan_app_search_data" type="application/json">([^<]+)</script>',
webpage,
'data',
),
video_id,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove excessive verbosity. Read coding conventions.

title = data.get('title') or self._og_search_title(webpage)
description = data.get('summary') or \
self._og_search_description(webpage, fatal=False)
creator = try_get(data, lambda x: x['author']['name'], str) or \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compat_str

'id': video_id,
'title': title,
'thumbnail': thumbnail,
'formats': self._extract_m3u8_formats(m3u8_url, video_id, ext='mp4'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be done right after title extraction.

'description': description,
'creator': creator,
'release_date': unified_strdate(data.get('published')),
'duration': parse_duration(data.get('extensions', {}).get('duration')),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_get

video_ids = re.findall(r'onclick="playVideo\(.*,\'([0-9]+)\'\)', webpage)
entries = []
for video_id in video_ids:
url = 'https://www.kan.org.il/Item/?itemId=%s' % video_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not shadow input url..

@yhager yhager requested a review from dstftw February 13, 2021 23:53
@yhager
Copy link
Author

yhager commented Apr 10, 2021

Is there anything else I can do in order to get this extension merged?

@aarubui
Copy link

aarubui commented May 19, 2021

For a start, follow "Trailing parentheses" section from the readme. Review get_thumbnail as it's defined outside the class and also only called from one place. skip_download for the tests needing ffmpeg. Unless it's just for me, geo bypass doesn't work so we can get rid of _GEO_COUNTRIES and geo_verification_headers() and skip the tests.

@yhager
Copy link
Author

yhager commented May 19, 2021

@aarubui thanks for your review. I've fixed the trailing parens, added skip_download and removed _GEO_COUNTRIES. I'm not sure what you want to do with get_thumbnail(). Should I move this function to be in the class? or should I just unpack it and extract the thumbnail directly in the calling code?

@aarubui
Copy link

aarubui commented May 19, 2021

  • It's more common to use self.playlist_result() but I don't think you have to.
  • download_webpage is now identical to _download_webpage and redundant.
  • For get_thumbnail, I can see you need it as a method to break out of multiple loops. I've just noticed that, in other extractors, there are examples of functions defined both in and out of the classes. So you can ignore what I said and leave it as it is.
  • "Trailing parentheses" wasn't for the import. It was fine before.
  • If there's a risk of < appearing in the JSON, as part of the summary for example, you can try using (.+?) with flags=re.DOTALL.
  • The site was geo-restricted to Israel yesterday but not today. You'll need skip should that change back. You're fine for now.

Good luck with the real review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request for site support: kan.org.il
3 participants