Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/substack] Return canonical URLs #8219

Merged
merged 2 commits into from Oct 6, 2023

Conversation

handlerug
Copy link
Contributor

@handlerug handlerug commented Sep 27, 2023

Description of your pull request and other information

The URL passed to _real_extract is of format "{username}.substack.com", but for blogs with custom domains the canonical URL would use the custom domain. Because of the wrong URL, the cookies in the resulting info dict come from the wrong domain, which breaks subscriber content extraction.

This isn't really testable because for whatever reason yt-dlp itself doesn't have any trouble downloading such content; however, yt-dlp -j consumers are broken without this change because of how the cookies field is populated.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at fa51465

Summary

🎥🛠️🔗

Enhanced the Substack extractor to support custom domains. Used canonical_url instead of username for video extraction and added more metadata fields.

Substack extractor
canonical_url for videos
Autumn leaves falling

Walkthrough

  • Modify _extract_video_formats function to use url instead of username and urllib.parse.urljoin to construct video URL (link)
  • Modify _real_extract function to get domain value from webpage_info and use it to set canonical_url with custom domain if applicable (link)
  • Pass canonical_url instead of username to _extract_video_formats function in _real_extract function (link)
  • Add webpage_url and original_url fields to metadata dictionary in _real_extract function, using canonical_url as the value (link)

The URL passed to _real_extract is of format "{username}.substack.com",
but for blogs with custom domains the canonical URL would use the custom
domain. Because of the wrong URL, the cookies in the resulting info dict
come from the wrong domain, which breaks subscriber content extraction.
@bashonly bashonly self-requested a review September 28, 2023 15:13
@bashonly bashonly added the site-bug Issue with a specific website label Sep 29, 2023
yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved
yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved
@bashonly bashonly added the pending-fixes PR has had changes requested label Oct 3, 2023
Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>
@bashonly bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Oct 3, 2023
@Grub4K Grub4K removed the pending-review PR needs a review label Oct 5, 2023
@bashonly bashonly merged commit 2f2dda3 into yt-dlp:master Oct 6, 2023
16 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants