Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArchiveOrg] Video with (encoded) question mark in file name cannot be downloaded with 404 error #9173

Closed
11 tasks done
nothrem opened this issue Feb 9, 2024 · 2 comments · Fixed by #9279
Closed
11 tasks done
Labels
patch-available There is patch available that should fix this issue. Someone needs to make a PR with it site-bug Issue with a specific website

Comments

@nothrem
Copy link

nothrem commented Feb 9, 2024

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Region

Europe/Czechia

Provide a description that is worded well enough to be understood

As an example the Mothy Python's "Whither Canada?" cannot be downloaded (for free without an account) from Internet Archive (http://archive.org) when using the page URL ("details" in URL) but can be downloaded with direct link ("download" in URL).

Problematic URLs are:

Trying to escape the question mark ends with message that no video is found: "[archive.org] Playlist Monty Python's Flying Circus: The Complete Series 1 to 4 [BLU-RAY] [SD]: Downloading 0 items"

The problem is probably in the last line [debug] Invoking http downloader on "https://archive.org/download/mpfc-remastered_20210305_1553/01. Whither Canada?.mkv" which indicates that the question mark in the URL was decoded from %3F to ? (or vice-versa was not correctly encoded when read from the downloaded page) which then converts the extension .mkv into QUERY parameter and server cannot find the file without the extension.

The file can be correctly downloaded via yt-dlp when manually searched for the file on server (Download options - Matroska - "01. Whither Canada?.mkv"):


All other parts can be downloaded from archive.org without a problem, e.g.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

--- Output for page link

[debug] Command-line config: ['-vU', 'https://archive.org/details/mpfc-remastered_20210305_1553/01.+Whither+Canada%3F.mkv']
[debug] Encodings: locale cp1250, fs utf-8, pref cp1250, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds [05420227a] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.22631-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: ffmpeg 6.1-full_build-www.gyan.dev (setts), ffprobe 6.1-full_build-www.gyan.dev
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, mutagen-1.47.0, requests-2.31.0, sqlite3-3.35.5, urllib3-2.2.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1832 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp-nightly-builds/releases/latest
Latest version: nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds
yt-dlp is up to date (nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds)
[archive.org] Extracting URL: https://archive.org/details/mpfc-remastered_20210305_1553/01.+Whither+Canada%3F.mkv
[archive.org] mpfc-remastered_20210305_1553: Downloading webpage
[archive.org] mpfc-remastered_20210305_1553: Downloading JSON metadata
[debug] Sort order given by extractor: source
[debug] Formats sorted by: hasvid, ie_pref, source, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] mpfc-remastered_20210305_1553/01. Whither Canada?.mkv: Downloading 1 format(s): 3
[debug] Invoking http downloader on "https://archive.org/download/mpfc-remastered_20210305_1553/01. Whither Canada?.mkv"

ERROR: unable to download video data: HTTP Error 404: Not Found
Traceback (most recent call last):
  File "yt_dlp\YoutubeDL.py", line 3417, in process_info
  File "yt_dlp\YoutubeDL.py", line 3138, in dl
  File "yt_dlp\downloader\common.py", line 455, in download
  File "yt_dlp\downloader\http.py", line 364, in real_download
  File "yt_dlp\downloader\http.py", line 120, in establish_connection
  File "yt_dlp\YoutubeDL.py", line 4081, in urlopen
  File "yt_dlp\networking\common.py", line 114, in send
  File "yt_dlp\networking\_helper.py", line 204, in wrapper
  File "yt_dlp\networking\common.py", line 325, in send
  File "yt_dlp\networking\_requests.py", line 348, in _send
yt_dlp.networking.exceptions.HTTPError: HTTP Error 404: Not Found



--- Output for direct link:

[debug] Command-line config: ['-vU', 'https://archive.org/download/mpfc-remastered_20210305_1553/01.%20Whither%20Canada%3F.mkv']
[debug] Encodings: locale cp1250, fs utf-8, pref cp1250, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds [05420227a] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.22631-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: ffmpeg 6.1-full_build-www.gyan.dev (setts), ffprobe 6.1-full_build-www.gyan.dev
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, mutagen-1.47.0, requests-2.31.0, sqlite3-3.35.5, urllib3-2.2.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1832 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp-nightly-builds/releases/latest
Latest version: nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds
yt-dlp is up to date (nightly@2024.02.05.232712 from yt-dlp/yt-dlp-nightly-builds)
[generic] Extracting URL: https://archive.org/download/mpfc-remastered_20210305_1553/01.%20Whither%20Canada%3F.mkv
[generic] 01. Whither Canada?: Downloading webpage
[redirect] Following redirect to https://ia903408.us.archive.org/5/items/mpfc-remastered_20210305_1553/01.%20Whither%20Canada%3F.mkv
[generic] Extracting URL: https://ia903408.us.archive.org/5/items/mpfc-remastered_20210305_1553/01.%20Whither%20Canada%3F.mkv
[generic] 01. Whither Canada?: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
WARNING: [generic] URL could be a direct video link, returning it as such.
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] 01. Whither Canada?: Downloading 1 format(s): 0
[debug] Invoking http downloader on "https://ia903408.us.archive.org/5/items/mpfc-remastered_20210305_1553/01.%20Whither%20Canada%3F.mkv"
[debug] File locking is not supported. Proceeding without locking
[download] Destination: 01. Whither Canada? [01. Whither Canada?].mkv
... then downloading continues until file is fully downloaded
@nothrem nothrem added site-bug Issue with a specific website triage Untriaged issue labels Feb 9, 2024
@dirkf
Copy link
Contributor

dirkf commented Feb 10, 2024

This also happens with a yt-dlp from before the networking rework, though not with yt-dl master (but the yt-dl extractor can't distinguish between the single video URL and the playlist).

@bashonly
Copy link
Member

diff --git a/yt_dlp/extractor/archiveorg.py b/yt_dlp/extractor/archiveorg.py
index 3bb6f2e31..c1bc1ba92 100644
--- a/yt_dlp/extractor/archiveorg.py
+++ b/yt_dlp/extractor/archiveorg.py
@@ -300,7 +300,7 @@ def _real_extract(self, url):
             is_logged_in = bool(self._get_cookies('https://archive.org').get('logged-in-sig'))
             if extension in KNOWN_EXTENSIONS and (not f.get('private') or is_logged_in):
                 entry['formats'].append({
-                    'url': 'https://archive.org/download/' + identifier + '/' + f['name'],
+                    'url': 'https://archive.org/download/' + identifier + '/' + urllib.parse.quote(f['name']),
                     'format': f.get('format'),
                     'width': int_or_none(f.get('width')),
                     'height': int_or_none(f.get('height')),

@bashonly bashonly added patch-available There is patch available that should fix this issue. Someone needs to make a PR with it and removed triage Untriaged issue labels Feb 10, 2024
@bashonly bashonly changed the title Video with (encoded) question mark in file name cannot be downloaded with 404 error [ArchiveOrg] Video with (encoded) question mark in file name cannot be downloaded with 404 error Feb 10, 2024
bashonly added a commit that referenced this issue Feb 24, 2024
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this issue Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch-available There is patch available that should fix this issue. Someone needs to make a PR with it site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants