[kan] Add new extractor #27959

yhager · 2021-01-25T08:12:22Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

New extractor for kan.org.il.
Fixes #26551.

This is my first extractor, I tried to follow the guide the best way I can, please let me know if there are any issues and I will address them.

dstftw · 2021-01-27T20:04:53Z

youtube_dl/extractor/kan.py

+
+
+class KanIE(InfoExtractor):
+    _VALID_URL = r'https?://(?:www\.)?kan\.org\.il/(?:[iI]tem/\?item[iI]d|program/\?cat[iI]d)=(?P<id>[0-9]+)'


There must be two different extractors: for videos and for playlists.

dstftw · 2021-01-27T20:05:32Z

youtube_dl/extractor/kan.py

+        creator = data.get('author', {}).get('name') or \
+            self._og_search_property('site_name', webpage, fatal=False)
+        thumbnail = get_thumbnail(data)
+        m3u8_url = data.get('content', {}).get('src')


This is mandatory. Read coding conventions.

dstftw · 2021-01-27T20:06:57Z

youtube_dl/extractor/kan.py

+            video_id)
+        title = data.get('title') or \
+            self._og_search_title(webpage) or \
+            self._html_search_regex(r'<title>([^<]+)</title>', webpage, 'title')


This is never reachable.

dstftw · 2021-01-27T20:07:16Z

youtube_dl/extractor/kan.py

+            self._html_search_regex(r'<title>([^<]+)</title>', webpage, 'title')
+        description = data.get('summary') or \
+            self._og_search_description(webpage, fatal=False)
+        creator = data.get('author', {}).get('name') or \


dstftw · 2021-01-27T20:08:13Z

youtube_dl/extractor/kan.py

+        m3u8_url = data.get('content', {}).get('src')
+        formats = self._extract_m3u8_formats(m3u8_url, video_id, ext='mp4')
+        return {
+            '_type': 'video',


This is default.

dstftw · 2021-01-27T20:08:46Z

youtube_dl/extractor/kan.py

+
+    def _extract_list(self, list_id, webpage):
+        video_ids = re.findall(r'onclick="playVideo\(.*,\'([0-9]+)\'\)', webpage)
+        title = self._og_search_title(webpage)


Playlist title is optional.

dstftw · 2021-02-13T23:00:11Z

youtube_dl/extractor/kan.py

+        creator = try_get(data, lambda x: x['author']['name'], str) or \
+            self._og_search_property('site_name', webpage, fatal=False)
+        thumbnail = get_thumbnail(data)
+        m3u8_url = try_get(data, lambda x: x['content']['src'], str)


Nothing changed.

dstftw · 2021-02-13T23:00:17Z

youtube_dl/extractor/kan.py

+        if not m3u8_url:
+            raise ExtractorError('Unable to extract m3u8 url')


dstftw · 2021-02-13T23:00:57Z

youtube_dl/extractor/kan.py

+        data = self._parse_json(
+            self._search_regex(
+                r'<script id="kan_app_search_data" type="application/json">([^<]+)</script>',
+                webpage,
+                'data',
+            ),
+            video_id,
+        )


Remove excessive verbosity. Read coding conventions.

dstftw · 2021-02-13T23:01:16Z

youtube_dl/extractor/kan.py

+        title = data.get('title') or self._og_search_title(webpage)
+        description = data.get('summary') or \
+            self._og_search_description(webpage, fatal=False)
+        creator = try_get(data, lambda x: x['author']['name'], str) or \


dstftw · 2021-02-13T23:02:06Z

youtube_dl/extractor/kan.py

+            'id': video_id,
+            'title': title,
+            'thumbnail': thumbnail,
+            'formats': self._extract_m3u8_formats(m3u8_url, video_id, ext='mp4'),


This must be done right after title extraction.

dstftw · 2021-02-13T23:02:20Z

youtube_dl/extractor/kan.py

+            'description': description,
+            'creator': creator,
+            'release_date': unified_strdate(data.get('published')),
+            'duration': parse_duration(data.get('extensions', {}).get('duration')),


dstftw · 2021-02-13T23:03:38Z

youtube_dl/extractor/kan.py

+        video_ids = re.findall(r'onclick="playVideo\(.*,\'([0-9]+)\'\)', webpage)
+        entries = []
+        for video_id in video_ids:
+            url = 'https://www.kan.org.il/Item/?itemId=%s' % video_id


Do not shadow input url..

yhager · 2021-04-10T19:12:45Z

Is there anything else I can do in order to get this extension merged?

aarubui · 2021-05-19T01:12:21Z

For a start, follow "Trailing parentheses" section from the readme. Review get_thumbnail as it's defined outside the class and also only called from one place. skip_download for the tests needing ffmpeg. Unless it's just for me, geo bypass doesn't work so we can get rid of _GEO_COUNTRIES and geo_verification_headers() and skip the tests.

yhager · 2021-05-19T14:40:11Z

@aarubui thanks for your review. I've fixed the trailing parens, added skip_download and removed _GEO_COUNTRIES. I'm not sure what you want to do with get_thumbnail(). Should I move this function to be in the class? or should I just unpack it and extract the thumbnail directly in the calling code?

aarubui · 2021-05-19T21:26:09Z

It's more common to use self.playlist_result() but I don't think you have to.
download_webpage is now identical to _download_webpage and redundant.
For get_thumbnail, I can see you need it as a method to break out of multiple loops. I've just noticed that, in other extractors, there are examples of functions defined both in and out of the classes. So you can ignore what I said and leave it as it is.
"Trailing parentheses" wasn't for the import. It was fine before.
If there's a risk of < appearing in the JSON, as part of the summary for example, you can try using (.+?) with flags=re.DOTALL.
The site was geo-restricted to Israel yesterday but not today. You'll need skip should that change back. You're fine for now.

Good luck with the real review.

yhager added 3 commits January 25, 2021 00:06

[kan] Add new extractor

fecc1dc

typo fix

5507979

minor fixes

e3a900e

dstftw requested changes Jan 27, 2021

View reviewed changes

dstftw added the pending-fixes label Jan 27, 2021

code review fixes

e6c7b3c

yhager requested a review from dstftw January 28, 2021 04:11

cypheron mentioned this pull request Feb 3, 2021

Evaluation / overview of new proposed extractors / sites #28054

Open

dstftw requested changes Feb 13, 2021

View reviewed changes

code review fixes

c0fd80c

yhager requested a review from dstftw February 13, 2021 23:53

fix typo

440aba2

yhager added 3 commits May 18, 2021 18:17

fix trailing parentheses

279539e

add skip_download to tests using ffmpeg

803b071

remove geo_countries

57c3cb4

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kan] Add new extractor #27959

[kan] Add new extractor #27959

yhager commented Jan 25, 2021

dstftw Jan 27, 2021

dstftw Jan 27, 2021

dstftw Jan 27, 2021

dstftw Jan 27, 2021

dstftw Jan 27, 2021

dstftw Jan 27, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

dstftw Feb 13, 2021

yhager commented Apr 10, 2021

aarubui commented May 19, 2021 •

edited

Loading

yhager commented May 19, 2021

aarubui commented May 19, 2021



		class KanIE(InfoExtractor):
		_VALID_URL = r'https?://(?:www\.)?kan\.org\.il/(?:[iI]tem/\?item[iI]d\|program/\?cat[iI]d)=(?P<id>[0-9]+)'

		if not m3u8_url:
		raise ExtractorError('Unable to extract m3u8 url')

[kan] Add new extractor #27959

Are you sure you want to change the base?

[kan] Add new extractor #27959

Conversation

yhager commented Jan 25, 2021

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhager commented Apr 10, 2021

aarubui commented May 19, 2021 • edited Loading

yhager commented May 19, 2021

aarubui commented May 19, 2021

aarubui commented May 19, 2021 •

edited

Loading