Support Archive.org playlists download #31309

brauliobo · 2022-10-25T17:50:57Z

Checklist

I'm reporting a site feature request
I've verified that I'm running youtube-dl version 2021.12.17
I've searched the bugtracker for similar site feature requests including closed ones
archive.org playlists are broken #25466, closed as a duplicate with no link to the original issue

Description

Support Archive.org playlists download, e.g. https://archive.org/details/energy-from-the-vacuum/Energy+From+The+Vacuum/

dirkf · 2022-10-26T10:32:49Z

There is a site feature template. I've started to formulate your issue in the template; please check and complete.
You can get the playlist using the release yt-dl by removing the last part of the URL, like https://archive.org/details/energy-from-the-vacuum, but the item titles are not shown in the --list-formats output. If you have jq installed, this will list them (Windows: use ""):

youtube-dl -J --flat-playlist 'https://archive.org/details/energy-from-the-vacuum' | jq '.entries[].title'

It looks like everything needed for https://archive.org/details/{slug}/{playlist_title} is in the XML file https://archive.org/download/{slug}/{slug}_files.xml, including links to the original and transcoded media files. The playlist metadata XML is https://archive.org/download/{slug}/{slug}_meta.xml and the playlist thumbnail is https://archive.org/download/{slug}.thumbs/__ia_thumb.jpg. However the current extractor code uses the /embed/{video_id} page.
Should the extractor include the transcoded media (a) as-is (b) with lower preference than the originals (c) not at all?
Should the extractor include the original media (a) as-is (b) with higher preference than the originals (c) not at all, as now?
The ArchiveOrgIE extractor needs to be updated from the newer yt-dlp version. The 3rd test in the yt-dlp extractor is a playlist, as here, but only the first item is fetched in the test. Also, the first test in both versions of the extractor is an item with three parts that should be extracted as a multi_video playlist, but only the first part is fetched.

dirkf · 2022-10-26T16:09:57Z

This is a minimal patch to support the problem URL:

--- old/youtube_dl/extractor/archiveorg.py
+++ new/youtube_dl/extractor/archiveorg.py
@@ -1,9 +1,17 @@
+# coding: utf-8
 from __future__ import unicode_literals
 
+import re
+
 from .common import InfoExtractor
+from ..compat import (
+    compat_filter as filter,
+    compat_urllib_parse_unquote_plus,
+)
 from ..utils import (
     clean_html,
     extract_attributes,
+    ExtractorError,
     unified_strdate,
     unified_timestamp,
 )
@@ -11,8 +19,12 @@
 
 class ArchiveOrgIE(InfoExtractor):
     IE_NAME = 'archive.org'
     IE_DESC = 'archive.org videos'
-    _VALID_URL = r'https?://(?:www\.)?archive\.org/(?:details|embed)/(?P<id>[^/?#&]+)'
+    _VALID_URL = r'''(?x)
+                    https?://(?:www\.)?archive\.org/
+                        (?:(?P<det>details)|embed)/
+                        (?P<id>(?(det)[^/]+/)?[^/?#]+)(?:[?#]|/?$)
+                    '''
     _TESTS = [{
         'url': 'http://archive.org/details/XD300-23_68HighlightsAResearchCntAugHumanIntellect',
         'md5': '8af1d4cf447933ed3c7f4871162602db',
@@ -46,8 +58,11 @@
         'only_matching': True,
     }]
 
+
     def _real_extract(self, url):
-        video_id = self._match_id(url)
+        video_id = compat_urllib_parse_unquote_plus(self._match_id(url))
+        video_id, entry_id = (video_id.split('/', 1) + [None])[:2]
+
         webpage = self._download_webpage(
             'http://archive.org/embed/' + video_id, video_id)
 
@@ -67,10 +82,18 @@
         if jwplayer_playlist:
             info = self._parse_jwplayer_data(
                 {'playlist': jwplayer_playlist}, video_id, base_url=url)
+            for entry in info.get('entries') or []:
+                e_id = entry.get('thumbnail')
+                if e_id:
+                    e_id = self._generic_id(e_id).rsplit('/', 1)[-1]
+                    e_id = re.sub(r'(?!^)_\d+$', '', e_id).replace(' ', '_')
+                    e_id = (entry.get('id') or video_id).replace(video_id, '/'.join((video_id, e_id)))
+                    entry['id'] = (entry.get('id') or video_id).replace(video_id, e_id)
         else:
             # HTML5 media fallback
             info = self._parse_html5_media_entries(url, webpage, video_id)[0]
-            info['id'] = video_id
+
+        info.setdefault('id', video_id)
 
         def get_optional(metadata, field):
             return metadata.get(field, [None])[0]
@@ -81,8 +104,23 @@
             })['metadata']
         info.update({
             'title': get_optional(metadata, 'title') or info.get('title'),
-            'description': clean_html(get_optional(metadata, 'description')),
         })
+        if entry_id and info.get('entries') and '.' in entry_id:
+            ext = ''.join(entry_id.rpartition('.')[1:])
+
+            def match_entry(x):
+                if not x.get('id'):
+                    return False
+                return bool(re.search(
+                    r'(?:^|/)%s$' % (entry_id, ),
+                    x['id'] + ext))
+
+            info = next(filter(match_entry, info['entries']), None)
+            if not info:
+                raise ExtractorError('Entry %s not found in %s' % (entry_id, video_id))
+
+        if not info.get('description'):
+            info['description'] = clean_html(get_optional(metadata, 'description'))
         if info.get('_type') != 'playlist':
             creator = get_optional(metadata, 'creator')
             info.update({

dirkf added request Good first issue An issue that should be easier to solve labels Oct 26, 2022

dirkf mentioned this issue Oct 26, 2022

[archive.org] How do I download this video? yt-dlp/yt-dlp#2759

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Archive.org playlists download #31309

Support Archive.org playlists download #31309

brauliobo commented Oct 25, 2022 •

edited by dirkf

dirkf commented Oct 26, 2022 •

edited

dirkf commented Oct 26, 2022

Support Archive.org playlists download #31309

Support Archive.org playlists download #31309

Comments

brauliobo commented Oct 25, 2022 • edited by dirkf

Checklist

Description

dirkf commented Oct 26, 2022 • edited

dirkf commented Oct 26, 2022

brauliobo commented Oct 25, 2022 •

edited by dirkf

dirkf commented Oct 26, 2022 •

edited