[extractor/substack] Return canonical URLs #8219

handlerug · 2023-09-27T17:15:53Z

Description of your pull request and other information

The URL passed to _real_extract is of format "{username}.substack.com", but for blogs with custom domains the canonical URL would use the custom domain. Because of the wrong URL, the cookies in the resulting info dict come from the wrong domain, which breaks subscriber content extraction.

This isn't really testable because for whatever reason yt-dlp itself doesn't have any trouble downloading such content; however, yt-dlp -j consumers are broken without this change because of how the cookies field is populated.

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at fa51465`

Summary

🎥🛠️🔗

Enhanced the Substack extractor to support custom domains. Used canonical_url instead of username for video extraction and added more metadata fields.

Substack extractor
canonical_url for videos
Autumn leaves falling

Walkthrough

Modify _extract_video_formats function to use url instead of username and urllib.parse.urljoin to construct video URL (link)
Modify _real_extract function to get domain value from webpage_info and use it to set canonical_url with custom domain if applicable (link)
Pass canonical_url instead of username to _extract_video_formats function in _real_extract function (link)
Add webpage_url and original_url fields to metadata dictionary in _real_extract function, using canonical_url as the value (link)

The URL passed to _real_extract is of format "{username}.substack.com", but for blogs with custom domains the canonical URL would use the custom domain. Because of the wrong URL, the cookies in the resulting info dict come from the wrong domain, which breaks subscriber content extraction.

yt_dlp/extractor/substack.py

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

Authored by: handlerug

bashonly self-requested a review September 28, 2023 15:13

bashonly added the site-bug Issue with a specific website label Sep 29, 2023

bashonly requested changes Oct 3, 2023

View reviewed changes

yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved

yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved

bashonly added the pending-fixes PR has had changes requested label Oct 3, 2023

Apply suggestions from code review

c7eb0f9

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

handlerug requested a review from bashonly October 3, 2023 10:09

bashonly approved these changes Oct 3, 2023

View reviewed changes

bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Oct 3, 2023

Grub4K approved these changes Oct 5, 2023

View reviewed changes

Grub4K removed the pending-review PR needs a review label Oct 5, 2023

Grub4K assigned bashonly Oct 5, 2023

bashonly merged commit 2f2dda3 into yt-dlp:master Oct 6, 2023
16 checks passed

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/substack] Fix download cookies bug (yt-dlp#8219)

da4da29

Authored by: handlerug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/substack] Return canonical URLs #8219

[extractor/substack] Return canonical URLs #8219

handlerug commented Sep 27, 2023 •

edited

[extractor/substack] Return canonical URLs #8219

[extractor/substack] Return canonical URLs #8219

Conversation

handlerug commented Sep 27, 2023 • edited

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at fa51465

Summary

Walkthrough

handlerug commented Sep 27, 2023 •

edited

`🤖 Generated by Copilot at fa51465`