-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extractor/vlive] Replace with archive.org based extractor #6196
Conversation
yt_dlp/extractor/archiveorg.py
Outdated
def _download_wbm_page(self, url, video_id, timestamp='2', mode='id_', **kwargs): | ||
for retry in self.RetryManager(): | ||
try: | ||
return self._download_webpage(f'https://web.archive.org/web/{timestamp}{mode}/' + url, video_id, **kwargs) | ||
except ExtractorError as err: | ||
retry.error = err | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost same as:
yt-dlp/yt_dlp/extractor/archiveorg.py
Lines 892 to 907 in f14c233
retry_manager = self.RetryManager(fatal=False) | |
for retry in retry_manager: | |
try: | |
urlh = self._request_webpage( | |
HEADRequest('https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/%s' % video_id), | |
video_id, note='Fetching archived video file url', expected_status=True) | |
except ExtractorError as e: | |
# HTTP Error 404 is expected if the video is not saved. | |
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 404: | |
self.raise_no_formats( | |
'The requested video is not archived, indexed, or there is an issue with web.archive.org (try again later)', expected=True) | |
else: | |
retry.error = e | |
if retry_manager.error: | |
self.raise_no_formats(retry_manager.error, expected=True, video_id=video_id) |
@coletdjnz How feasible is it to make a baseclass for wayback machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fakeurl only applies to YouTube videos. Potentially, requests could be moved to a more general function in a baseclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there's some in common between the two so we could make a base class of some sort. But imo I wouldn't worry about it too much for this PR at least (would require more testing), so up to you.
yt_dlp/extractor/archiveorg.py
Outdated
|
||
vod_id = traverse_obj(player_info, ('postDetail', 'post', 'officialVideo', 'vodId')) | ||
|
||
vod_data = self._parse_json(self._download_wbm_page(f'https://apis.naver.com/rmcnmv/rmcnmv/vod/play/v2.0/{vod_id}', video_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long. Indent like:
vod_data = self._parse_json(self._download_wbm_page(
url, video_id, note=..., query={
...
}), video_id)
yt_dlp/extractor/archiveorg.py
Outdated
# Code from NaverBaseIE | ||
automatic_captions = {} | ||
subtitles = {} | ||
for caption in traverse_obj(vod_data, ('captions', 'list'), []): | ||
caption_url = caption.get('source') | ||
if not caption_url: | ||
continue | ||
caption_url = self._WAYBACK_BASE_URL + caption_url | ||
sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles | ||
lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und' | ||
if caption.get('type') == 'fan': | ||
lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict) | ||
sub_dict.setdefault(lang, []).append({ | ||
'url': caption_url, | ||
'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '), | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not subclass from it and call the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't call _extract_video_info
in NaverBaseIE
directly due to the function making API requests and returning all formats and not just the archived ones. There would be significant work in modifying the returned value to match what was archived.
Potentially the subtitle extraction part could be moved to its own function, but due to only archiving vtt and not ttml subtitles that also seems like more hassle than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work:
diff --git a/yt_dlp/extractor/naver.py b/yt_dlp/extractor/naver.py
index e2e6e9728..eae4d07fb 100644
--- a/yt_dlp/extractor/naver.py
+++ b/yt_dlp/extractor/naver.py
@@ -21,6 +21,23 @@
class NaverBaseIE(InfoExtractor):
_CAPTION_EXT_RE = r'\.(?:ttml|vtt)'
+ @staticmethod # NB: Used in VLiveWebArchiveIE
+ def process_subtitles(vod_data, process_url):
+ ret = {'subtitles': {}, 'automatic_captions': {}}
+ for caption in traverse_obj(vod_data, ('captions', 'list', ...)):
+ caption_url = caption.get('source')
+ if not caption_url:
+ continue
+ type_ = 'automatic_captions' if caption.get('type') == 'auto' else 'subtitles'
+ lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
+ if caption.get('type') == 'fan':
+ lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in ret[type_])
+ ret[type_].setdefault(lang, []).extend({
+ 'url': sub_url,
+ 'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
+ } for sub_url in process_url(caption_url))
+ return ret
+
def _extract_video_info(self, video_id, vid, key):
video_data = self._download_json(
'http://play.rmcnmv.naver.com/vod/play/v2.0/' + vid,
@@ -79,34 +96,18 @@ def get_subs(caption_url):
]
return [caption_url]
- automatic_captions = {}
- subtitles = {}
- for caption in get_list('caption'):
- caption_url = caption.get('source')
- if not caption_url:
- continue
- sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles
- lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
- if caption.get('type') == 'fan':
- lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict)
- sub_dict.setdefault(lang, []).extend({
- 'url': sub_url,
- 'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
- } for sub_url in get_subs(caption_url))
-
user = meta.get('user', {})
return {
'id': video_id,
'title': title,
'formats': formats,
- 'subtitles': subtitles,
- 'automatic_captions': automatic_captions,
'thumbnail': try_get(meta, lambda x: x['cover']['source']),
'view_count': int_or_none(meta.get('count')),
'uploader_id': user.get('id'),
'uploader': user.get('name'),
'uploader_url': user.get('url'),
+ **self.process_subtitles(video_data, get_subs),
}
**NaverBaseIE.process_subtitles(vod_data, lambda x: [self._WAYBACK_BASE_URL + x])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you!
yt_dlp/extractor/archiveorg.py
Outdated
params = {arg.get('name'): arg.get('value') for arg in stream.get('keys', []) if arg.get('type') == 'param'} | ||
m3u8_doc = self._download_wbm_page(max_stream.get('source'), video_id, note='Downloading m3u8', query=params, fatal=False) | ||
if m3u8_doc: | ||
# M3U8 document is not valid, so it needs to be fixed | ||
m3u8_doc_lines = m3u8_doc.splitlines() | ||
modified_m3u8_doc_lines = [] | ||
url_base = max_stream.get('source').rsplit('/', 1)[0] | ||
first_segment = None | ||
for line in m3u8_doc_lines: | ||
if line.startswith('#'): | ||
modified_m3u8_doc_lines.append(line) | ||
else: | ||
modified_line = f'{self._WAYBACK_BASE_URL}{url_base}/{line}?{urllib.parse.urlencode(params)}' | ||
modified_m3u8_doc_lines.append(modified_line) | ||
if first_segment is None: | ||
first_segment = modified_line | ||
modified_m3u8_doc = '\n'.join(modified_m3u8_doc_lines) | ||
|
||
# Segments may not have been archied. See 101870 | ||
first_segment_req = self._request_webpage(HEADRequest(first_segment), video_id, note='Check first segment availablity', errnote=False, fatal=False) | ||
if first_segment_req: | ||
formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be in a _extract_formats_from_m3u8
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that works, as the m3u8 needs to be adjusted to add the query parameter to each segment and a segment needs to be checked to see if the video was actually archived or just the playlist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I misunderstood. I thought you were suggesting using self._extract_m3u8_formats
. Moving this to a separate function seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should've said method...
yt_dlp/extractor/archiveorg.py
Outdated
formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id) | ||
|
||
# For parts of the project MP4 files were archived | ||
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None) | |
max_video = max( | |
traverse_obj(vod_data, ('videos', 'list', ...), default=[None]), | |
key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0)) |
max([])
will throw error. Similarly, key
returning a mix of int and None will also throw error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max([], default=None)
does not throw an error. Should I still change it?
Edit: Oh I missed that key might be None. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my bad. I read the default
to be inside traverse_obj
. The default=0
is still needed though. So this would work as well
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None) | |
max_video = max( | |
traverse_obj(vod_data, ('videos', 'list', ...)), | |
key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0), default=None) |
yt_dlp/extractor/archiveorg.py
Outdated
video_url = self._WAYBACK_BASE_URL + max_video.get('source') | ||
video_req = self._request_webpage(HEADRequest(video_url), video_id, note='Check video availablity', errnote=False, fatal=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is same as _download_wbm_page
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only making a HEAD request to not download the entire video
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the lack of retries here is also intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sort of. I added the retries due to the IA servers being quite unreliable and requests often timing out or being aborted. Originally, even 404 requests would be retired, which caused these availability checks to need three requests to fail. I did just adjust _download_wbm_page
to no longer retry 404ed requests, since that is dumb.
Any suggestions for a nice way to add the option of HEAD requests to _download_wbm_page
?
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Maybe it makes sense to use |
Isn't |
Hmm, indeed, you are correct. We recently discussed the possibility of The alternative would be to do sth like def str_non_empty(v):
if not isinstance(v, str):
return None
v = v.strip()
return v if v else None and then use |
When using Instead of your suggestion, we could create a function diff --git a/yt_dlp/extractor/panopto.py b/yt_dlp/extractor/panopto.py
index 32c103bc1..3b73f8cbe 100644
--- a/yt_dlp/extractor/panopto.py
+++ b/yt_dlp/extractor/panopto.py
@@ -412,7 +412,7 @@ def _real_extract(self, url):
return {
'id': video_id,
'title': delivery.get('SessionName'),
- 'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName'), default=[], expected_type=lambda x: x or None),
+ 'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName', {truthy_or_none})),
'timestamp': session_start_time - 11640000000 if session_start_time else None,
'duration': delivery.get('Duration'),
'thumbnail': base_url + f'/Services/FrameGrabber.svc/FrameRedirect?objectId={video_id}&mode=Delivery&random={random()}',
diff --git a/yt_dlp/extractor/ruv.py b/yt_dlp/extractor/ruv.py
index 12499d6ca..f3538b6bc 100644
--- a/yt_dlp/extractor/ruv.py
+++ b/yt_dlp/extractor/ruv.py
@@ -176,7 +176,7 @@ def _real_extract(self, url):
'title': traverse_obj(program, ('episodes', 0, 'title'), 'title'),
'description': traverse_obj(
program, ('episodes', 0, 'description'), 'description', 'short_description',
- expected_type=lambda x: x or None),
+ expected_type=truthy_or_none),
'subtitles': subs,
'thumbnail': episode.get('image', '').replace('$$IMAGESIZE$$', '1960') or None,
'timestamp': unified_timestamp(episode.get('firstrun')),
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..56d1e36d1 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2578,6 +2578,10 @@ def str_or_none(v, default=None):
return default if v is None else str(v)
+def truthy_or_none(v):
+ return v or None
+
+
def str_to_int(int_str):
""" A more relaxed version of int_or_none """
if isinstance(int_str, int):
Alternatively, We could change the behavior of diff --git a/yt_dlp/extractor/tiktok.py b/yt_dlp/extractor/tiktok.py
index cc96de364..9ca508b7e 100644m/jugaad-py/jugaad-trader
--- a/yt_dlp/extractor/tiktok.py
+++ b/yt_dlp/extractor/tiktok.py
@@ -395,7 +395,7 @@ def _parse_aweme_video_web(self, aweme_detail, webpage_url):
'artist': str_or_none(music_info.get('authorName')),
'formats': formats,
'thumbnails': thumbnails,
- 'description': str_or_none(aweme_detail.get('desc')),
+ 'description': traverse_obj(aweme_detail, ('desc', {str})),
'http_headers': {
'Referer': webpage_url
}
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..5578a7cfd 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2575,7 +2575,7 @@ def int_or_none(v, scale=1, default=None, get_attr=None, invscale=1):
def str_or_none(v, default=None):
- return default if v is None else str(v)
+ return default if v in (None, '') else str(v)
def str_to_int(int_str): In any case @seproDev, use the |
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
pls verify that I haven't broken anything |
Looks good to me! All tests still pass. |
vlive has shut down: https://web.archive.org/web/20221031171019/https://www.vlive.tv/notice/4749 Authored by: seproDev
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
VLive was shut down as of 2022-12-31. Most videos were saved by ArchiveTeam and are accessible through the wayback machine.
This PR removes the original VLive extractor and adds a new one, allowing to download videos from the internet archive.
The extractor closely follows the logic of the archiveteam grab script and this standalone script by OrIdow6. They are licensed under Unlicense and CC0 respectively.
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?