[duboku] add new extractor #26467

lkho · 2020-08-29T07:38:30Z

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

New extractor

Resolves #22125.

dirkf

Passes tests with all suggested changes applied on top of 530f458.

I'll apply them if you like.

dirkf · 2022-06-07T00:17:47Z

youtube_dl/extractor/duboku.py

+    IE_NAME = 'duboku'
+    IE_DESC = 'www.duboku.co'
+
+    _VALID_URL = r'(?:https?://[^/]+\.duboku\.co/vodplay/)(?P<id>[0-9]+-[0-9-]+)\.html.*'


Require n-n-n in id field; no need to match the tail:

Suggested change

_VALID_URL = r'(?:https?://[^/]+\.duboku\.co/vodplay/)(?P<id>[0-9]+-[0-9-]+)\.html.*'

_VALID_URL = r'(?:https?://[^/]+\.duboku\.co/vodplay/)(?P<id>(?:[0-9]+-){2}[0-9]+)\.html'

dirkf · 2022-06-07T00:18:17Z

youtube_dl/extractor/duboku.py

+        'url': 'https://www.duboku.co/vodplay/1575-1-1.html',
+        'info_dict': {
+            'id': '1575-1-1',
+            'ext': 'ts',


Fix test:

Suggested change

'ext': 'ts',

'ext': 'mp4',

dirkf · 2022-06-07T00:18:40Z

youtube_dl/extractor/duboku.py

+        'url': 'https://www.duboku.co/vodplay/1588-1-1.html',
+        'info_dict': {
+            'id': '1588-1-1',
+            'ext': 'ts',


Fix test:

Suggested change

'ext': 'ts',

'ext': 'mp4',

dirkf · 2022-06-07T00:19:17Z

youtube_dl/extractor/duboku.py

+            'id': '1588-1-1',
+            'ext': 'ts',
+            'series': '亲爱的自己',
+            'title': 'contains:预告片',


Page has changed:

Suggested change

'title': 'contains:预告片',

'title': '亲爱的自己第1集',

dirkf · 2022-06-07T00:20:06Z

youtube_dl/extractor/duboku.py

+        temp = video_id.split('-')
+        series_id = temp[0]
+        season_id = temp[1]
+        episode_id = temp[2]


Simpler:

Suggested change

temp = video_id.split('-')

series_id = temp[0]

season_id = temp[1]

episode_id = temp[2]

series_id, season_id, episode_id = video_id.split('-')

dirkf · 2022-06-07T00:29:15Z

youtube_dl/extractor/duboku.py

+                href = extract_attributes(mobj.group(0)).get('href')
+                if href:
+                    mobj1 = re.search(r'/(\d+)\.html', href)
+                    if mobj1 and mobj1.group(1) == series_id:
+                        series_title = clean_html(mobj.group(0))
+                        series_title = re.sub(r'[\s\r\n\t]+', ' ', series_title)
+                        title = clean_html(html)
+                        title = re.sub(r'[\s\r\n\t]+', ' ', title)
+                        break


use the resulting match object

avoid excessive indentation

r'\s' includes any whitespace

simplify clean_html() expressions

Suggested change

href = extract_attributes(mobj.group(0)).get('href')

if href:

mobj1 = re.search(r'/(\d+)\.html', href)

if mobj1 and mobj1.group(1) == series_id:

series_title = clean_html(mobj.group(0))

series_title = re.sub(r'[\s\r\n\t]+', ' ', series_title)

title = clean_html(html)

title = re.sub(r'[\s\r\n\t]+', ' ', title)

break

href = extract_attributes(html[mobj.start(0):mobj.start('content')]).get('href')

if not href:

continue

mobj1 = re.search(r'/(?P<s_id>\d+)\.html', href)

if mobj1 and mobj1.group('s_id') == series_id:

series_title = clean_html(re.sub(r'\s+', ' ', mobj.group('content')))

title = clean_html(re.sub(r'\s+', ' ', html))

break

dirkf · 2022-06-07T00:30:53Z

youtube_dl/extractor/duboku.py

+                'episode_id': episode_id,
+            }
+
+        formats = self._extract_m3u8_formats(data_url, video_id, 'mp4')


Pass Referer header to avoid 403:

Suggested change

formats = self._extract_m3u8_formats(data_url, video_id, 'mp4')

headers = {'Referer': 'https://www.duboku.co/static/player/videojs.html'}

formats = self._extract_m3u8_formats(data_url, video_id, 'mp4', headers=headers)

dirkf · 2022-06-07T00:31:33Z

youtube_dl/extractor/duboku.py

+            'episode_number': int_or_none(episode_id),
+            'episode_id': episode_id,
+            'formats': formats,
+            'http_headers': {'Referer': 'https://www.duboku.co/static/player/videojs.html'}


Use headers as introduced above:

Suggested change

'http_headers': {'Referer': 'https://www.duboku.co/static/player/videojs.html'}

'http_headers': headers,

dirkf · 2022-06-07T00:33:18Z

youtube_dl/extractor/duboku.py

+        'url': 'https://www.duboku.co/voddetail/1554.html#playlist2',
+        'info_dict': {
+            'id': '1554#playlist2',


#playlist2 has gone: use #playlist1 instead:

Suggested change

'url': 'https://www.duboku.co/voddetail/1554.html#playlist2',

'info_dict': {

'id': '1554#playlist2',

'url': 'https://www.duboku.co/voddetail/1554.html#playlist1',

'info_dict': {

'id': '1554#playlist1',

dirkf · 2022-06-07T00:34:21Z

youtube_dl/extractor/duboku.py

+        mobj = re.match(self._VALID_URL, url)
+        if mobj is None:
+            raise ExtractorError('Invalid URL: %s' % url)
+        series_id = mobj.group('id')


Simplify:

Suggested change

mobj = re.match(self._VALID_URL, url)

if mobj is None:

raise ExtractorError('Invalid URL: %s' % url)

series_id = mobj.group('id')

series_id = self._match_id(url)

lkho · 2022-06-16T15:06:42Z

@dirkf can you please help me apply the changes, my original repo was being deleted by github..

dirkf · 2022-06-16T22:38:41Z

I'll apply them if you like.

Unfortunately the GH website logic says "diff is outdated" if I try to do that, presumably because the source branch is blocked.

Contact GH support to get your repo unblocked (mention #27013). Much the easiest.

Or read https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally.

Or perhaps:

delete your blocked repo
fork yt-dl again under the same name
clone it to your system
cd to the youtube-dl directory
git checkout -b same_name_as your_original_branch 2020.07.28
save patch file https://github.com/ytdl-org/youtube-dl/pull/26467.patch as 26467.patch
git am 26467.patch
git push your_forked_repo
now we hope that GH will treat your new branch as the original PR source.

pukkandan · 2022-06-17T05:29:45Z

delete your blocked repo

You can't delete blocked repo without contacting support. You can create a new fork and a new PR from it though

Contact GH support to get your repo unblocked (mention #27013). Much the easiest.

When I contacted support a while ago to get my fork (not yt-dlp) restored, they only provided the option to delete it. I had to let them delete it and then re-fork. Luckily, I had local copy of all the branches

lkho added 8 commits August 29, 2020 15:33

[duboku] Add new extractor www.duboku.co

503406d

[duboku] add playlist extractor

de4144a

[duboku] add tests

d82b669

[duboku] fix test_no_duplicates

a8f88d2

[duboku] replace import *, fix tests

7cc9d5b

[duboku] fix list results, minor error checking

bf73929

[duboku] add referer header

1b8805f

[duboku] change ext to mp4

b0f5073

dstftw force-pushed the master branch 2 times, most recently from 5e26784 to da2069f Compare September 13, 2020 13:49

cypheron mentioned this pull request Feb 3, 2021

Evaluation / overview of new proposed extractors / sites #28054

Open

dirkf requested changes Jun 7, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[duboku] add new extractor #26467

[duboku] add new extractor #26467

lkho commented Aug 29, 2020 •

edited by dirkf

dirkf left a comment •

edited

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

dirkf Jun 7, 2022

lkho commented Jun 16, 2022

dirkf commented Jun 16, 2022 •

edited

pukkandan commented Jun 17, 2022

	_VALID_URL = r'(?:https?://[^/]+\.duboku\.co/vodplay/)(?P<id>[0-9]+-[0-9-]+)\.html.*'
	_VALID_URL = r'(?:https?://[^/]+\.duboku\.co/vodplay/)(?P<id>(?:[0-9]+-){2}[0-9]+)\.html'

	'title': 'contains:预告片',
	'title': '亲爱的自己第1集',

	formats = self._extract_m3u8_formats(data_url, video_id, 'mp4')
	headers = {'Referer': 'https://www.duboku.co/static/player/videojs.html'}
	formats = self._extract_m3u8_formats(data_url, video_id, 'mp4', headers=headers)

	'http_headers': {'Referer': 'https://www.duboku.co/static/player/videojs.html'}
	'http_headers': headers,

[duboku] add new extractor #26467

Are you sure you want to change the base?

[duboku] add new extractor #26467

Conversation

lkho commented Aug 29, 2020 • edited by dirkf

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

dirkf left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkho commented Jun 16, 2022

dirkf commented Jun 16, 2022 • edited

pukkandan commented Jun 17, 2022

lkho commented Aug 29, 2020 •

edited by dirkf

dirkf left a comment •

edited

dirkf commented Jun 16, 2022 •

edited