Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Archive.org playlists download #31309

Open
2 of 3 tasks
brauliobo opened this issue Oct 25, 2022 · 2 comments
Open
2 of 3 tasks

Support Archive.org playlists download #31309

brauliobo opened this issue Oct 25, 2022 · 2 comments
Labels
Good first issue An issue that should be easier to solve request

Comments

@brauliobo
Copy link

brauliobo commented Oct 25, 2022

Checklist

  • I'm reporting a site feature request
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've searched the bugtracker for similar site feature requests including closed ones
    archive.org playlists are broken #25466, closed as a duplicate with no link to the original issue

Description

Support Archive.org playlists download, e.g. https://archive.org/details/energy-from-the-vacuum/Energy+From+The+Vacuum/

@dirkf
Copy link
Contributor

dirkf commented Oct 26, 2022

  1. There is a site feature template. I've started to formulate your issue in the template; please check and complete.

  2. You can get the playlist using the release yt-dl by removing the last part of the URL, like https://archive.org/details/energy-from-the-vacuum, but the item titles are not shown in the --list-formats output. If you have jq installed, this will list them (Windows: use ""):

youtube-dl -J --flat-playlist 'https://archive.org/details/energy-from-the-vacuum' | jq '.entries[].title'
  1. It looks like everything needed for https://archive.org/details/{slug}/{playlist_title} is in the XML file https://archive.org/download/{slug}/{slug}_files.xml, including links to the original and transcoded media files. The playlist metadata XML is https://archive.org/download/{slug}/{slug}_meta.xml and the playlist thumbnail is https://archive.org/download/{slug}.thumbs/__ia_thumb.jpg. However the current extractor code uses the /embed/{video_id} page.

  2. Should the extractor include the transcoded media (a) as-is (b) with lower preference than the originals (c) not at all?

  3. Should the extractor include the original media (a) as-is (b) with higher preference than the originals (c) not at all, as now?

  4. The ArchiveOrgIE extractor needs to be updated from the newer yt-dlp version. The 3rd test in the yt-dlp extractor is a playlist, as here, but only the first item is fetched in the test. Also, the first test in both versions of the extractor is an item with three parts that should be extracted as a multi_video playlist, but only the first part is fetched.

@dirkf dirkf added request Good first issue An issue that should be easier to solve labels Oct 26, 2022
@dirkf
Copy link
Contributor

dirkf commented Oct 26, 2022

This is a minimal patch to support the problem URL:

--- old/youtube_dl/extractor/archiveorg.py
+++ new/youtube_dl/extractor/archiveorg.py
@@ -1,9 +1,17 @@
+# coding: utf-8
 from __future__ import unicode_literals
 
+import re
+
 from .common import InfoExtractor
+from ..compat import (
+    compat_filter as filter,
+    compat_urllib_parse_unquote_plus,
+)
 from ..utils import (
     clean_html,
     extract_attributes,
+    ExtractorError,
     unified_strdate,
     unified_timestamp,
 )
@@ -11,8 +19,12 @@
 
 class ArchiveOrgIE(InfoExtractor):
     IE_NAME = 'archive.org'
     IE_DESC = 'archive.org videos'
-    _VALID_URL = r'https?://(?:www\.)?archive\.org/(?:details|embed)/(?P<id>[^/?#&]+)'
+    _VALID_URL = r'''(?x)
+                    https?://(?:www\.)?archive\.org/
+                        (?:(?P<det>details)|embed)/
+                        (?P<id>(?(det)[^/]+/)?[^/?#]+)(?:[?#]|/?$)
+                    '''
     _TESTS = [{
         'url': 'http://archive.org/details/XD300-23_68HighlightsAResearchCntAugHumanIntellect',
         'md5': '8af1d4cf447933ed3c7f4871162602db',
@@ -46,8 +58,11 @@
         'only_matching': True,
     }]
 
+
     def _real_extract(self, url):
-        video_id = self._match_id(url)
+        video_id = compat_urllib_parse_unquote_plus(self._match_id(url))
+        video_id, entry_id = (video_id.split('/', 1) + [None])[:2]
+
         webpage = self._download_webpage(
             'http://archive.org/embed/' + video_id, video_id)
 
@@ -67,10 +82,18 @@
         if jwplayer_playlist:
             info = self._parse_jwplayer_data(
                 {'playlist': jwplayer_playlist}, video_id, base_url=url)
+            for entry in info.get('entries') or []:
+                e_id = entry.get('thumbnail')
+                if e_id:
+                    e_id = self._generic_id(e_id).rsplit('/', 1)[-1]
+                    e_id = re.sub(r'(?!^)_\d+$', '', e_id).replace(' ', '_')
+                    e_id = (entry.get('id') or video_id).replace(video_id, '/'.join((video_id, e_id)))
+                    entry['id'] = (entry.get('id') or video_id).replace(video_id, e_id)
         else:
             # HTML5 media fallback
             info = self._parse_html5_media_entries(url, webpage, video_id)[0]
-            info['id'] = video_id
+
+        info.setdefault('id', video_id)
 
         def get_optional(metadata, field):
             return metadata.get(field, [None])[0]
@@ -81,8 +104,23 @@
             })['metadata']
         info.update({
             'title': get_optional(metadata, 'title') or info.get('title'),
-            'description': clean_html(get_optional(metadata, 'description')),
         })
+        if entry_id and info.get('entries') and '.' in entry_id:
+            ext = ''.join(entry_id.rpartition('.')[1:])
+
+            def match_entry(x):
+                if not x.get('id'):
+                    return False
+                return bool(re.search(
+                    r'(?:^|/)%s$' % (entry_id, ),
+                    x['id'] + ext))
+
+            info = next(filter(match_entry, info['entries']), None)
+            if not info:
+                raise ExtractorError('Entry %s not found in %s' % (entry_id, video_id))
+
+        if not info.get('description'):
+            info['description'] = clean_html(get_optional(metadata, 'description'))
         if info.get('_type') != 'playlist':
             creator = get_optional(metadata, 'creator')
             info.update({

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue An issue that should be easier to solve request
Projects
None yet
Development

No branches or pull requests

2 participants