Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/vlive] Replace with archive.org based extractor #6196

Merged
merged 21 commits into from
Feb 12, 2023
Merged

[extractor/vlive] Replace with archive.org based extractor #6196

merged 21 commits into from
Feb 12, 2023

Conversation

seproDev
Copy link
Collaborator

@seproDev seproDev commented Feb 9, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

VLive was shut down as of 2022-12-31. Most videos were saved by ArchiveTeam and are accessible through the wayback machine.
This PR removes the original VLive extractor and adds a new one, allowing to download videos from the internet archive.

The extractor closely follows the logic of the archiveteam grab script and this standalone script by OrIdow6. They are licensed under Unlicense and CC0 respectively.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Comment on lines 1049 to 1055
def _download_wbm_page(self, url, video_id, timestamp='2', mode='id_', **kwargs):
for retry in self.RetryManager():
try:
return self._download_webpage(f'https://web.archive.org/web/{timestamp}{mode}/' + url, video_id, **kwargs)
except ExtractorError as err:
retry.error = err
continue
Copy link
Member

@pukkandan pukkandan Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost same as:

retry_manager = self.RetryManager(fatal=False)
for retry in retry_manager:
try:
urlh = self._request_webpage(
HEADRequest('https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/%s' % video_id),
video_id, note='Fetching archived video file url', expected_status=True)
except ExtractorError as e:
# HTTP Error 404 is expected if the video is not saved.
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 404:
self.raise_no_formats(
'The requested video is not archived, indexed, or there is an issue with web.archive.org (try again later)', expected=True)
else:
retry.error = e
if retry_manager.error:
self.raise_no_formats(retry_manager.error, expected=True, video_id=video_id)

@coletdjnz How feasible is it to make a baseclass for wayback machine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fakeurl only applies to YouTube videos. Potentially, requests could be moved to a more general function in a baseclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there's some in common between the two so we could make a base class of some sort. But imo I wouldn't worry about it too much for this PR at least (would require more testing), so up to you.

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved

vod_id = traverse_obj(player_info, ('postDetail', 'post', 'officialVideo', 'vodId'))

vod_data = self._parse_json(self._download_wbm_page(f'https://apis.naver.com/rmcnmv/rmcnmv/vod/play/v2.0/{vod_id}', video_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long. Indent like:

        vod_data = self._parse_json(self._download_wbm_page(
            url, video_id, note=..., query={
                ...
            }), video_id)

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
Comment on lines 1147 to 1162
# Code from NaverBaseIE
automatic_captions = {}
subtitles = {}
for caption in traverse_obj(vod_data, ('captions', 'list'), []):
caption_url = caption.get('source')
if not caption_url:
continue
caption_url = self._WAYBACK_BASE_URL + caption_url
sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles
lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
if caption.get('type') == 'fan':
lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict)
sub_dict.setdefault(lang, []).append({
'url': caption_url,
'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not subclass from it and call the function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't call _extract_video_info in NaverBaseIE directly due to the function making API requests and returning all formats and not just the archived ones. There would be significant work in modifying the returned value to match what was archived.
Potentially the subtitle extraction part could be moved to its own function, but due to only archiving vtt and not ttml subtitles that also seems like more hassle than it's worth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work:

diff --git a/yt_dlp/extractor/naver.py b/yt_dlp/extractor/naver.py
index e2e6e9728..eae4d07fb 100644
--- a/yt_dlp/extractor/naver.py
+++ b/yt_dlp/extractor/naver.py
@@ -21,6 +21,23 @@
 class NaverBaseIE(InfoExtractor):
     _CAPTION_EXT_RE = r'\.(?:ttml|vtt)'

+    @staticmethod  # NB: Used in VLiveWebArchiveIE
+    def process_subtitles(vod_data, process_url):
+        ret = {'subtitles': {}, 'automatic_captions': {}}
+        for caption in traverse_obj(vod_data, ('captions', 'list', ...)):
+            caption_url = caption.get('source')
+            if not caption_url:
+                continue
+            type_ = 'automatic_captions' if caption.get('type') == 'auto' else 'subtitles'
+            lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
+            if caption.get('type') == 'fan':
+                lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in ret[type_])
+            ret[type_].setdefault(lang, []).extend({
+                'url': sub_url,
+                'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
+            } for sub_url in process_url(caption_url))
+        return ret
+
     def _extract_video_info(self, video_id, vid, key):
         video_data = self._download_json(
             'http://play.rmcnmv.naver.com/vod/play/v2.0/' + vid,
@@ -79,34 +96,18 @@ def get_subs(caption_url):
                 ]
             return [caption_url]

-        automatic_captions = {}
-        subtitles = {}
-        for caption in get_list('caption'):
-            caption_url = caption.get('source')
-            if not caption_url:
-                continue
-            sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles
-            lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
-            if caption.get('type') == 'fan':
-                lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict)
-            sub_dict.setdefault(lang, []).extend({
-                'url': sub_url,
-                'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
-            } for sub_url in get_subs(caption_url))
-
         user = meta.get('user', {})

         return {
             'id': video_id,
             'title': title,
             'formats': formats,
-            'subtitles': subtitles,
-            'automatic_captions': automatic_captions,
             'thumbnail': try_get(meta, lambda x: x['cover']['source']),
             'view_count': int_or_none(meta.get('count')),
             'uploader_id': user.get('id'),
             'uploader': user.get('name'),
             'uploader_url': user.get('url'),
+            **self.process_subtitles(video_data, get_subs),
         }
        **NaverBaseIE.process_subtitles(vod_data, lambda x: [self._WAYBACK_BASE_URL + x])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you!

Comment on lines 1114 to 1135
params = {arg.get('name'): arg.get('value') for arg in stream.get('keys', []) if arg.get('type') == 'param'}
m3u8_doc = self._download_wbm_page(max_stream.get('source'), video_id, note='Downloading m3u8', query=params, fatal=False)
if m3u8_doc:
# M3U8 document is not valid, so it needs to be fixed
m3u8_doc_lines = m3u8_doc.splitlines()
modified_m3u8_doc_lines = []
url_base = max_stream.get('source').rsplit('/', 1)[0]
first_segment = None
for line in m3u8_doc_lines:
if line.startswith('#'):
modified_m3u8_doc_lines.append(line)
else:
modified_line = f'{self._WAYBACK_BASE_URL}{url_base}/{line}?{urllib.parse.urlencode(params)}'
modified_m3u8_doc_lines.append(modified_line)
if first_segment is None:
first_segment = modified_line
modified_m3u8_doc = '\n'.join(modified_m3u8_doc_lines)

# Segments may not have been archied. See 101870
first_segment_req = self._request_webpage(HEADRequest(first_segment), video_id, note='Check first segment availablity', errnote=False, fatal=False)
if first_segment_req:
formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be in a _extract_formats_from_m3u8 function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that works, as the m3u8 needs to be adjusted to add the query parameter to each segment and a segment needs to be checked to see if the video was actually archived or just the playlist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I misunderstood. I thought you were suggesting using self._extract_m3u8_formats. Moving this to a separate function seems reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should've said method...

formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id)

# For parts of the project MP4 files were archived
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)
Copy link
Member

@pukkandan pukkandan Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)
max_video = max(
traverse_obj(vod_data, ('videos', 'list', ...), default=[None]),
key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0))

max([]) will throw error. Similarly, key returning a mix of int and None will also throw error

Copy link
Collaborator Author

@seproDev seproDev Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max([], default=None) does not throw an error. Should I still change it?

Edit: Oh I missed that key might be None. Will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my bad. I read the default to be inside traverse_obj. The default=0 is still needed though. So this would work as well

Suggested change
max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)
max_video = max(
traverse_obj(vod_data, ('videos', 'list', ...)),
key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0), default=None)

Comment on lines 1140 to 1141
video_url = self._WAYBACK_BASE_URL + max_video.get('source')
video_req = self._request_webpage(HEADRequest(video_url), video_id, note='Check video availablity', errnote=False, fatal=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is same as _download_wbm_page, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only making a HEAD request to not download the entire video

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the lack of retries here is also intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sort of. I added the retries due to the IA servers being quite unreliable and requests often timing out or being aborted. Originally, even 404 requests would be retired, which caused these availability checks to need three requests to fail. I did just adjust _download_wbm_page to no longer retry 404ed requests, since that is dumb.

Any suggestions for a nice way to add the option of HEAD requests to _download_wbm_page?

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
@Grub4K
Copy link
Member

Grub4K commented Feb 9, 2023

Maybe it makes sense to use {str_or_none} for the fields in the info dict (uploader, ...) instead of {str}? Currently '' is returned, but that is not really useful, is it?

@seproDev
Copy link
Collaborator Author

seproDev commented Feb 9, 2023

Maybe it makes sense to use {str_or_none} for the fields in the info dict (uploader, ...) instead of {str}? Currently '' is returned, but that is not really useful, is it?

Isn't str_or_none('') == ''?
I do agree that the empty string is not really useful.

@Grub4K
Copy link
Member

Grub4K commented Feb 9, 2023

Hmm, indeed, you are correct.

We recently discussed the possibility of traverse_obj dropping empty strings. It might make sense to roll up this discussion again. Maybe only do that for dictionary values? Would it make sense to keep empty string values at all?

The alternative would be to do sth like

def str_non_empty(v):
    if not isinstance(v, str):
        return None
    v = v.strip()
    return v if v else None

and then use {str_non_empty}. Maybe a function like this could be moved to utils? @pukkandan

@pukkandan
Copy link
Member

When using traverse_obj without dict, this is easily solvable with a or None at end of each line. But doing {lambda x: x or None} for each entry is ugly. For this specific case, since don't mind the condition for all the fields, we could just do expected_type=lambda x: x or None. But it'd be useful to have some generalized way of doing this for future


Instead of your suggestion, we could create a function truthy_or_none instead which could be re-used for non-zero int etc. Then it'll be {str}, {truthy_or_none}. Though I'm not sure it's worth it.

diff --git a/yt_dlp/extractor/panopto.py b/yt_dlp/extractor/panopto.py
index 32c103bc1..3b73f8cbe 100644
--- a/yt_dlp/extractor/panopto.py
+++ b/yt_dlp/extractor/panopto.py
@@ -412,7 +412,7 @@ def _real_extract(self, url):
         return {
             'id': video_id,
             'title': delivery.get('SessionName'),
-            'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName'), default=[], expected_type=lambda x: x or None),
+            'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName', {truthy_or_none})),
             'timestamp': session_start_time - 11640000000 if session_start_time else None,
             'duration': delivery.get('Duration'),
             'thumbnail': base_url + f'/Services/FrameGrabber.svc/FrameRedirect?objectId={video_id}&mode=Delivery&random={random()}',
diff --git a/yt_dlp/extractor/ruv.py b/yt_dlp/extractor/ruv.py
index 12499d6ca..f3538b6bc 100644
--- a/yt_dlp/extractor/ruv.py
+++ b/yt_dlp/extractor/ruv.py
@@ -176,7 +176,7 @@ def _real_extract(self, url):
             'title': traverse_obj(program, ('episodes', 0, 'title'), 'title'),
             'description': traverse_obj(
                 program, ('episodes', 0, 'description'), 'description', 'short_description',
-                expected_type=lambda x: x or None),
+                expected_type=truthy_or_none),
             'subtitles': subs,
             'thumbnail': episode.get('image', '').replace('$$IMAGESIZE$$', '1960') or None,
             'timestamp': unified_timestamp(episode.get('firstrun')),
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..56d1e36d1 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2578,6 +2578,10 @@ def str_or_none(v, default=None):
     return default if v is None else str(v)


+def truthy_or_none(v):
+    return v or None
+
+
 def str_to_int(int_str):
     """ A more relaxed version of int_or_none """
     if isinstance(int_str, int):

Alternatively, We could change the behavior of str_or_none. Looking through tests, I could only find one case where the current behaviour is being used:

diff --git a/yt_dlp/extractor/tiktok.py b/yt_dlp/extractor/tiktok.py
index cc96de364..9ca508b7e 100644m/jugaad-py/jugaad-trader
--- a/yt_dlp/extractor/tiktok.py
+++ b/yt_dlp/extractor/tiktok.py
@@ -395,7 +395,7 @@ def _parse_aweme_video_web(self, aweme_detail, webpage_url):
             'artist': str_or_none(music_info.get('authorName')),
             'formats': formats,
             'thumbnails': thumbnails,
-            'description': str_or_none(aweme_detail.get('desc')),
+            'description': traverse_obj(aweme_detail, ('desc', {str})),
             'http_headers': {
                 'Referer': webpage_url
             }
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..5578a7cfd 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2575,7 +2575,7 @@ def int_or_none(v, scale=1, default=None, get_attr=None, invscale=1):


 def str_or_none(v, default=None):
-    return default if v is None else str(v)
+    return default if v in (None, '') else str(v)


 def str_to_int(int_str):

In any case @seproDev, use the expected_type=lambda x: x or None suggestion for now while we figure out how we want to handle this use-case

@coletdjnz coletdjnz self-requested a review February 10, 2023 02:36
yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved
@pukkandan
Copy link
Member

pls verify that I haven't broken anything

@seproDev
Copy link
Collaborator Author

Looks good to me! All tests still pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants