[extractor/vlive] Replace with archive.org based extractor #6196

seproDev · 2023-02-09T17:21:15Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

VLive was shut down as of 2022-12-31. Most videos were saved by ArchiveTeam and are accessible through the wayback machine.
This PR removes the original VLive extractor and adds a new one, allowing to download videos from the internet archive.

The extractor closely follows the logic of the archiveteam grab script and this standalone script by OrIdow6. They are licensed under Unlicense and CC0 respectively.

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

pukkandan · 2023-02-09T17:36:58Z

yt_dlp/extractor/archiveorg.py

+    def _download_wbm_page(self, url, video_id, timestamp='2', mode='id_', **kwargs):
+        for retry in self.RetryManager():
+            try:
+                return self._download_webpage(f'https://web.archive.org/web/{timestamp}{mode}/' + url, video_id, **kwargs)
+            except ExtractorError as err:
+                retry.error = err
+                continue


Almost same as:

yt-dlp/yt_dlp/extractor/archiveorg.py

Lines 892 to 907 in f14c233

retry_manager = self.RetryManager(fatal=False)

for retry in retry_manager:

try:

urlh = self._request_webpage(

HEADRequest('https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/%s' % video_id),

video_id, note='Fetching archived video file url', expected_status=True)

except ExtractorError as e:

# HTTP Error 404 is expected if the video is not saved.

if isinstance(e.cause, compat_HTTPError) and e.cause.code == 404:

self.raise_no_formats(

'The requested video is not archived, indexed, or there is an issue with web.archive.org (try again later)', expected=True)

else:

retry.error = e

if retry_manager.error:

self.raise_no_formats(retry_manager.error, expected=True, video_id=video_id)

@coletdjnz How feasible is it to make a baseclass for wayback machine?

I think fakeurl only applies to YouTube videos. Potentially, requests could be moved to a more general function in a baseclass.

yeah there's some in common between the two so we could make a base class of some sort. But imo I wouldn't worry about it too much for this PR at least (would require more testing), so up to you.

yt_dlp/extractor/archiveorg.py

pukkandan · 2023-02-09T17:38:40Z

yt_dlp/extractor/archiveorg.py

+
+        vod_id = traverse_obj(player_info, ('postDetail', 'post', 'officialVideo', 'vodId'))
+
+        vod_data = self._parse_json(self._download_wbm_page(f'https://apis.naver.com/rmcnmv/rmcnmv/vod/play/v2.0/{vod_id}', video_id,


line too long. Indent like:

vod_data = self._parse_json(self._download_wbm_page( url, video_id, note=..., query={ ... }), video_id)

yt_dlp/extractor/archiveorg.py

pukkandan · 2023-02-09T17:43:24Z

yt_dlp/extractor/archiveorg.py

+        # Code from NaverBaseIE
+        automatic_captions = {}
+        subtitles = {}
+        for caption in traverse_obj(vod_data, ('captions', 'list'), []):
+            caption_url = caption.get('source')
+            if not caption_url:
+                continue
+            caption_url = self._WAYBACK_BASE_URL + caption_url
+            sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles
+            lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und'
+            if caption.get('type') == 'fan':
+                lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict)
+            sub_dict.setdefault(lang, []).append({
+                'url': caption_url,
+                'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '),
+            })


Why not subclass from it and call the function?

I can't call _extract_video_info in NaverBaseIE directly due to the function making API requests and returning all formats and not just the archived ones. There would be significant work in modifying the returned value to match what was archived.
Potentially the subtitle extraction part could be moved to its own function, but due to only archiving vtt and not ttml subtitles that also seems like more hassle than it's worth.

This should work:

diff --git a/yt_dlp/extractor/naver.py b/yt_dlp/extractor/naver.py index e2e6e9728..eae4d07fb 100644 --- a/yt_dlp/extractor/naver.py +++ b/yt_dlp/extractor/naver.py @@ -21,6 +21,23 @@ class NaverBaseIE(InfoExtractor): _CAPTION_EXT_RE = r'\.(?:ttml|vtt)' + @staticmethod # NB: Used in VLiveWebArchiveIE + def process_subtitles(vod_data, process_url): + ret = {'subtitles': {}, 'automatic_captions': {}} + for caption in traverse_obj(vod_data, ('captions', 'list', ...)): + caption_url = caption.get('source') + if not caption_url: + continue + type_ = 'automatic_captions' if caption.get('type') == 'auto' else 'subtitles' + lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und' + if caption.get('type') == 'fan': + lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in ret[type_]) + ret[type_].setdefault(lang, []).extend({ + 'url': sub_url, + 'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '), + } for sub_url in process_url(caption_url)) + return ret + def _extract_video_info(self, video_id, vid, key): video_data = self._download_json( 'http://play.rmcnmv.naver.com/vod/play/v2.0/' + vid, @@ -79,34 +96,18 @@ def get_subs(caption_url): ] return [caption_url] - automatic_captions = {} - subtitles = {} - for caption in get_list('caption'): - caption_url = caption.get('source') - if not caption_url: - continue - sub_dict = automatic_captions if caption.get('type') == 'auto' else subtitles - lang = caption.get('locale') or join_nonempty('language', 'country', from_dict=caption) or 'und' - if caption.get('type') == 'fan': - lang += '_fan%d' % next(i for i in itertools.count(1) if f'{lang}_fan{i}' not in sub_dict) - sub_dict.setdefault(lang, []).extend({ - 'url': sub_url, - 'name': join_nonempty('label', 'fanName', from_dict=caption, delim=' - '), - } for sub_url in get_subs(caption_url)) - user = meta.get('user', {}) return { 'id': video_id, 'title': title, 'formats': formats, - 'subtitles': subtitles, - 'automatic_captions': automatic_captions, 'thumbnail': try_get(meta, lambda x: x['cover']['source']), 'view_count': int_or_none(meta.get('count')), 'uploader_id': user.get('id'), 'uploader': user.get('name'), 'uploader_url': user.get('url'), + **self.process_subtitles(video_data, get_subs), }

**NaverBaseIE.process_subtitles(vod_data, lambda x: [self._WAYBACK_BASE_URL + x])

This looks great. Thank you!

pukkandan · 2023-02-09T17:46:16Z

yt_dlp/extractor/archiveorg.py

+            params = {arg.get('name'): arg.get('value') for arg in stream.get('keys', []) if arg.get('type') == 'param'}
+            m3u8_doc = self._download_wbm_page(max_stream.get('source'), video_id, note='Downloading m3u8', query=params, fatal=False)
+            if m3u8_doc:
+                # M3U8 document is not valid, so it needs to be fixed
+                m3u8_doc_lines = m3u8_doc.splitlines()
+                modified_m3u8_doc_lines = []
+                url_base = max_stream.get('source').rsplit('/', 1)[0]
+                first_segment = None
+                for line in m3u8_doc_lines:
+                    if line.startswith('#'):
+                        modified_m3u8_doc_lines.append(line)
+                    else:
+                        modified_line = f'{self._WAYBACK_BASE_URL}{url_base}/{line}?{urllib.parse.urlencode(params)}'
+                        modified_m3u8_doc_lines.append(modified_line)
+                        if first_segment is None:
+                            first_segment = modified_line
+                modified_m3u8_doc = '\n'.join(modified_m3u8_doc_lines)
+
+                # Segments may not have been archied. See 101870
+                first_segment_req = self._request_webpage(HEADRequest(first_segment), video_id, note='Check first segment availablity', errnote=False, fatal=False)
+                if first_segment_req:
+                    formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id)


Could be in a _extract_formats_from_m3u8 function

I don't think that works, as the m3u8 needs to be adjusted to add the query parameter to each segment and a segment needs to be checked to see if the video was actually archived or just the playlist.

Oh, I think I misunderstood. I thought you were suggesting using self._extract_m3u8_formats. Moving this to a separate function seems reasonable.

Yes. I should've said method...

pukkandan · 2023-02-09T17:49:34Z

yt_dlp/extractor/archiveorg.py

+                    formats, _ = self._parse_m3u8_formats_and_subtitles(modified_m3u8_doc, ext='mp4', video_id=video_id)
+
+        # For parts of the project MP4 files were archived
+        max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)


Suggested change

max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)

max_video = max(

traverse_obj(vod_data, ('videos', 'list', ...), default=[None]),

key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0))

max([]) will throw error. Similarly, key returning a mix of int and None will also throw error

max([], default=None) does not throw an error. Should I still change it?

Edit: Oh I missed that key might be None. Will change.

ah, my bad. I read the default to be inside traverse_obj. The default=0 is still needed though. So this would work as well

Suggested change

max_video = max(traverse_obj(vod_data, ('videos', 'list'), []), key=lambda v: traverse_obj(v, ('bitrate', 'video')), default=None)

max_video = max(

traverse_obj(vod_data, ('videos', 'list', ...)),

key=lambda v: traverse_obj(v, ('bitrate', 'video'), default=0), default=None)

pukkandan · 2023-02-09T17:50:45Z

yt_dlp/extractor/archiveorg.py

+            video_url = self._WAYBACK_BASE_URL + max_video.get('source')
+            video_req = self._request_webpage(HEADRequest(video_url), video_id, note='Check video availablity', errnote=False, fatal=False)


this is same as _download_wbm_page, no?

Only making a HEAD request to not download the entire video

I assume the lack of retries here is also intentional?

Hmm, sort of. I added the retries due to the IA servers being quite unreliable and requests often timing out or being aborted. Originally, even 404 requests would be retired, which caused these availability checks to need three requests to fail. I did just adjust _download_wbm_page to no longer retry 404ed requests, since that is dumb.

Any suggestions for a nice way to add the option of HEAD requests to _download_wbm_page?

yt_dlp/extractor/archiveorg.py

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

Grub4K · 2023-02-09T20:06:30Z

Maybe it makes sense to use {str_or_none} for the fields in the info dict (uploader, ...) instead of {str}? Currently '' is returned, but that is not really useful, is it?

seproDev · 2023-02-09T20:10:49Z

Maybe it makes sense to use {str_or_none} for the fields in the info dict (uploader, ...) instead of {str}? Currently '' is returned, but that is not really useful, is it?

Isn't str_or_none('') == ''?
I do agree that the empty string is not really useful.

Grub4K · 2023-02-09T20:21:14Z

Hmm, indeed, you are correct.

We recently discussed the possibility of traverse_obj dropping empty strings. It might make sense to roll up this discussion again. Maybe only do that for dictionary values? Would it make sense to keep empty string values at all?

The alternative would be to do sth like

def str_non_empty(v):
    if not isinstance(v, str):
        return None
    v = v.strip()
    return v if v else None

and then use {str_non_empty}. Maybe a function like this could be moved to utils? @pukkandan

pukkandan · 2023-02-09T20:49:06Z

When using traverse_obj without dict, this is easily solvable with a or None at end of each line. But doing {lambda x: x or None} for each entry is ugly. For this specific case, since don't mind the condition for all the fields, we could just do expected_type=lambda x: x or None. But it'd be useful to have some generalized way of doing this for future

Instead of your suggestion, we could create a function truthy_or_none instead which could be re-used for non-zero int etc. Then it'll be {str}, {truthy_or_none}. Though I'm not sure it's worth it.

diff --git a/yt_dlp/extractor/panopto.py b/yt_dlp/extractor/panopto.py
index 32c103bc1..3b73f8cbe 100644
--- a/yt_dlp/extractor/panopto.py
+++ b/yt_dlp/extractor/panopto.py
@@ -412,7 +412,7 @@ def _real_extract(self, url):
         return {
             'id': video_id,
             'title': delivery.get('SessionName'),
-            'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName'), default=[], expected_type=lambda x: x or None),
+            'cast': traverse_obj(delivery, ('Contributors', ..., 'DisplayName', {truthy_or_none})),
             'timestamp': session_start_time - 11640000000 if session_start_time else None,
             'duration': delivery.get('Duration'),
             'thumbnail': base_url + f'/Services/FrameGrabber.svc/FrameRedirect?objectId={video_id}&mode=Delivery&random={random()}',
diff --git a/yt_dlp/extractor/ruv.py b/yt_dlp/extractor/ruv.py
index 12499d6ca..f3538b6bc 100644
--- a/yt_dlp/extractor/ruv.py
+++ b/yt_dlp/extractor/ruv.py
@@ -176,7 +176,7 @@ def _real_extract(self, url):
             'title': traverse_obj(program, ('episodes', 0, 'title'), 'title'),
             'description': traverse_obj(
                 program, ('episodes', 0, 'description'), 'description', 'short_description',
-                expected_type=lambda x: x or None),
+                expected_type=truthy_or_none),
             'subtitles': subs,
             'thumbnail': episode.get('image', '').replace('$$IMAGESIZE$$', '1960') or None,
             'timestamp': unified_timestamp(episode.get('firstrun')),
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..56d1e36d1 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2578,6 +2578,10 @@ def str_or_none(v, default=None):
     return default if v is None else str(v)


+def truthy_or_none(v):
+    return v or None
+
+
 def str_to_int(int_str):
     """ A more relaxed version of int_or_none """
     if isinstance(int_str, int):

Alternatively, We could change the behavior of str_or_none. Looking through tests, I could only find one case where the current behaviour is being used:

diff --git a/yt_dlp/extractor/tiktok.py b/yt_dlp/extractor/tiktok.py
index cc96de364..9ca508b7e 100644m/jugaad-py/jugaad-trader
--- a/yt_dlp/extractor/tiktok.py
+++ b/yt_dlp/extractor/tiktok.py
@@ -395,7 +395,7 @@ def _parse_aweme_video_web(self, aweme_detail, webpage_url):
             'artist': str_or_none(music_info.get('authorName')),
             'formats': formats,
             'thumbnails': thumbnails,
-            'description': str_or_none(aweme_detail.get('desc')),
+            'description': traverse_obj(aweme_detail, ('desc', {str})),
             'http_headers': {
                 'Referer': webpage_url
             }
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index 878b2b6a8..5578a7cfd 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -2575,7 +2575,7 @@ def int_or_none(v, scale=1, default=None, get_attr=None, invscale=1):


 def str_or_none(v, default=None):
-    return default if v is None else str(v)
+    return default if v in (None, '') else str(v)


 def str_to_int(int_str):

In any case @seproDev, use the expected_type=lambda x: x or None suggestion for now while we figure out how we want to handle this use-case

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

yt_dlp/extractor/archiveorg.py

pukkandan · 2023-02-12T04:39:46Z

pls verify that I haven't broken anything

seproDev · 2023-02-12T04:43:30Z

Looks good to me! All tests still pass.

vlive has shut down: https://web.archive.org/web/20221031171019/https://www.vlive.tv/notice/4749 Authored by: seproDev

seproDev added 2 commits February 5, 2023 12:04

[extractor/vlive] Remove extractor

b5b90ee

[extractor/web.archive:vlive] Add extractor

1b316f0

pukkandan requested changes Feb 9, 2023

View reviewed changes

Grub4K reviewed Feb 9, 2023

View reviewed changes

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved

seproDev and others added 12 commits February 9, 2023 19:21

[extractor/web.archive:vlive] Improve regex

f766254

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

[extractor/web.archive:vlive] Improve regex

70291a6

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

[extractor/web.archive:vlive] Improve regex

5d32742

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

[extractor/web.archive:vlive] Use raw string

811ee4e

[extractor/web.archive:vlive] Add _download_wbm_json

6e68ad0

[extractor/web.archive:vlive] Adjust traverse_obj usage

65b2901

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

[extractor/web.archive:vlive] max() provide key default

3770b13

[extractor/web.archive:vlive] fix traverse_obj call

031c8e4

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

[extractor/web.archive:vlive] Fix typo

b85531f

[extractor/web.archive:vlive] Simplify info dict building

d08d058

Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>

[extractor/web.archive:vlive] Adjust test cases

aeab03a

[extractor/web.archive:vlive] Fix formatting

d82dfe7

seproDev and others added 5 commits February 9, 2023 22:03

[extractor/web.archive:vlive] Don't return empty string

c16368c

[extractor/web.archive:vlive] Move m3u8 building to separate method

768a763

[extractor/naver] Move subtitle extraction to staticmethod

b890779

Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>

[extractor/web.archive:vlive] Use NaverBaseIE for subtitle processing

c7d5301

[extractor/web.archive:vlive] Don't retry 404ed requests

f29419a

coletdjnz self-requested a review February 10, 2023 02:36

cleanup

74d24ec

pukkandan approved these changes Feb 12, 2023

View reviewed changes

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved

yt_dlp/extractor/archiveorg.py Outdated Show resolved Hide resolved

finalize

7a660d5

pukkandan merged commit b3eaab7 into yt-dlp:master Feb 12, 2023

seproDev deleted the vlive branch February 12, 2023 04:48

seproDev mentioned this pull request Mar 31, 2023

[archive.org] can the archive.org extractor be enhanced to be used on archived web pages ? #6656

Closed

10 tasks

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[extractor/vlive] Replace with VLiveWebArchiveIE (yt-dlp#6196)

fd854d8

vlive has shut down: https://web.archive.org/web/20221031171019/https://www.vlive.tv/notice/4749 Authored by: seproDev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/vlive] Replace with archive.org based extractor #6196

[extractor/vlive] Replace with archive.org based extractor #6196

seproDev commented Feb 9, 2023 •

edited

pukkandan Feb 9, 2023 •

edited

seproDev Feb 9, 2023

coletdjnz Feb 12, 2023

pukkandan Feb 9, 2023

pukkandan Feb 9, 2023

seproDev Feb 9, 2023

pukkandan Feb 9, 2023

seproDev Feb 9, 2023

pukkandan Feb 9, 2023

seproDev Feb 9, 2023

seproDev Feb 9, 2023

pukkandan Feb 9, 2023

pukkandan Feb 9, 2023 •

edited

seproDev Feb 9, 2023 •

edited

pukkandan Feb 9, 2023

pukkandan Feb 9, 2023

seproDev Feb 9, 2023

pukkandan Feb 9, 2023

seproDev Feb 9, 2023

Grub4K commented Feb 9, 2023 •

edited

seproDev commented Feb 9, 2023 •

edited

Grub4K commented Feb 9, 2023 •

edited

pukkandan commented Feb 9, 2023

pukkandan commented Feb 12, 2023

seproDev commented Feb 12, 2023

	retry_manager = self.RetryManager(fatal=False)
	for retry in retry_manager:
	try:
	urlh = self._request_webpage(
	HEADRequest('https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/%s' % video_id),
	video_id, note='Fetching archived video file url', expected_status=True)
	except ExtractorError as e:
	# HTTP Error 404 is expected if the video is not saved.
	if isinstance(e.cause, compat_HTTPError) and e.cause.code == 404:
	self.raise_no_formats(
	'The requested video is not archived, indexed, or there is an issue with web.archive.org (try again later)', expected=True)
	else:
	retry.error = e

	if retry_manager.error:
	self.raise_no_formats(retry_manager.error, expected=True, video_id=video_id)


		vod_id = traverse_obj(player_info, ('postDetail', 'post', 'officialVideo', 'vodId'))

		vod_data = self._parse_json(self._download_wbm_page(f'https://apis.naver.com/rmcnmv/rmcnmv/vod/play/v2.0/{vod_id}', video_id,

		video_url = self._WAYBACK_BASE_URL + max_video.get('source')
		video_req = self._request_webpage(HEADRequest(video_url), video_id, note='Check video availablity', errnote=False, fatal=False)

[extractor/vlive] Replace with archive.org based extractor #6196

[extractor/vlive] Replace with archive.org based extractor #6196

Conversation

seproDev commented Feb 9, 2023 • edited

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

pukkandan Feb 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pukkandan Feb 9, 2023 • edited

Choose a reason for hiding this comment

seproDev Feb 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Grub4K commented Feb 9, 2023 • edited

seproDev commented Feb 9, 2023 • edited

Grub4K commented Feb 9, 2023 • edited

pukkandan commented Feb 9, 2023

pukkandan commented Feb 12, 2023

seproDev commented Feb 12, 2023

seproDev commented Feb 9, 2023 •

edited

pukkandan Feb 9, 2023 •

edited

pukkandan Feb 9, 2023 •

edited

seproDev Feb 9, 2023 •

edited

Grub4K commented Feb 9, 2023 •

edited

seproDev commented Feb 9, 2023 •

edited

Grub4K commented Feb 9, 2023 •

edited