[Substack] extractor fails ("Unable to extract preloads") #7155

2011 · 2023-05-28T12:05:21Z

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

I'm reporting that yt-dlp is broken on a supported site
I've verified that I'm running yt-dlp version 2023.03.04 (update instructions) or later (specify commit)
I've checked that all provided URLs are playable in a browser with the same IP and same login details
I've checked that all URLs and arguments with special characters are properly quoted or escaped
I've searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
I've read the guidelines for opening an issue
I've read about sharing account credentials and I'm willing to share it if required

Region

global

Provide a description that is worded well enough to be understood

Ran yt-dlp on a rather ordinary Substack post, and received an error message along with a request to report the issue.

Provide verbose output that clearly demonstrates the problem

Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
If using API, add 'verbose': True to YoutubeDL params instead
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['--restrict-filenames', '-o', '%(title)s-%(id)s-%(uploader)s.%(ext)s', '-w', '-v', 'https://pharmafiles.substack.com/p/how-big-pharma-calculates-a-patients']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d]
[debug] Python 3.11.3 (CPython x86_64 64bit) - Linux-5.15.88-with-glibc2.36 (OpenSSL 1.1.1t  7 Feb 2023, glibc 2.36)
[debug] exe versions: ffmpeg 4.4.3 (setts), ffprobe 4.4.3
[debug] Optional libraries: certifi-3021.03.16, pycrypto-3.17
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[Substack] Extracting URL: https://pharmafiles.substack.com/p/how-big-pharma-calculates-a-patients
[Substack] how-big-pharma-calculates-a-patients: Downloading webpage
ERROR: [Substack] how-big-pharma-calculates-a-patients: Unable to extract preloads; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 694, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/substack.py", line 80, in _real_extract
    webpage_info = self._search_json(r'<script[^>]*>\s*window\._preloads\s*=', webpage, 'preloads', display_id)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1256, in _search_json
    json_string = self._search_regex(
                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1242, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

The text was updated successfully, but these errors were encountered:

bashonly · 2023-05-28T12:11:54Z

diff --git a/yt_dlp/extractor/substack.py b/yt_dlp/extractor/substack.py
index fa3826388..726513499 100644
--- a/yt_dlp/extractor/substack.py
+++ b/yt_dlp/extractor/substack.py
@@ -2,7 +2,7 @@
 import urllib.parse
 
 from .common import InfoExtractor
-from ..utils import str_or_none, traverse_obj
+from ..utils import js_to_json, str_or_none, traverse_obj
 
 
 class SubstackIE(InfoExtractor):
@@ -77,7 +77,10 @@ def _real_extract(self, url):
         display_id, username = self._match_valid_url(url).group('id', 'username')
         webpage = self._download_webpage(url, display_id)
 
-        webpage_info = self._search_json(r'<script[^>]*>\s*window\._preloads\s*=', webpage, 'preloads', display_id)
+        json_string = self._search_json(
+            r'window\._preloads\s*=\s*JSON\.parse\(', webpage, 'json string',
+            display_id, transform_source=js_to_json, contains_pattern=r'"{(?s:.+)}"')
+        webpage_info = self._parse_json(json_string, display_id)
 
         post_type = webpage_info['post']['type']
         formats, subtitles = [], {}

Closes #7155 Authored by: bashonly

Closes yt-dlp#7155 Authored by: bashonly

2011 added site-bug Issue with a specific website triage Untriaged issue labels May 28, 2023

bashonly added patch-available There is patch available that should fix this issue. Someone needs to make a PR with it and removed triage Untriaged issue labels May 28, 2023

This was referenced May 30, 2023

Unable to extract preloads (substack site) #7174

Closed

[extractor/substack] Fix extraction #7218

Merged

bashonly closed this as completed in #7218 Jun 4, 2023

bashonly added a commit that referenced this issue Jun 4, 2023

[extractor/substack] Fix extraction (#7218)

12037d8

Closes #7155 Authored by: bashonly

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this issue Apr 21, 2024

[extractor/substack] Fix extraction (yt-dlp#7218)

5d6d311

Closes yt-dlp#7155 Authored by: bashonly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Substack] extractor fails ("Unable to extract preloads") #7155

[Substack] extractor fails ("Unable to extract preloads") #7155

2011 commented May 28, 2023

bashonly commented May 28, 2023

[Substack] extractor fails ("Unable to extract preloads") #7155

[Substack] extractor fails ("Unable to extract preloads") #7155

Comments

2011 commented May 28, 2023

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Provide a description that is worded well enough to be understood

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

bashonly commented May 28, 2023