[extractor/substack] Fix embed URL extraction #8218

handlerug · 2023-09-27T17:06:34Z

Description of your pull request and other information

The format of the JSON payload being extracted has changed to a JSON.parse("...")-style format. This PR fixes the extraction process so that it ignores the added slashes.

There's a theoretical possibility of the extracted string being too short if it contains an escape sequence (which would contain a slash thus not matching the [^\"]+ part). In practice, a valid DNS name is unlikely to get escaped by a sensible JSON encoder. I have searched for uses of _search_json and _parse_json and found none, so I opted for a fix to the regular expression.

Regarding tests, there are two in generic.py which should have actually been failing. If I try them manually using yt-dlp -j, the result seems to match, but I don't know how to run specific download tests in a generic way.

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at 6cfb6a0`

Summary

🐛🔧🧪

Fixed substack extractor to handle backslashes in subdomain names. Modified regex pattern in yt_dlp/extractor/substack.py.

Substack posts vary
Escape backslashes in regex
Winter bug is fixed

Walkthrough

Escape backslashes in subdomain pattern (link) to fix extraction of some substack posts in yt_dlp/extractor/substack.py

yt_dlp/extractor/substack.py

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

Authored by: handlerug

[extractor/substack] Fix embed URL extraction

6cfb6a0

bashonly self-requested a review September 28, 2023 15:13

bashonly added the site-bug Issue with a specific website label Sep 29, 2023

bashonly requested changes Oct 3, 2023

View reviewed changes

yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved

bashonly added the pending-fixes PR has had changes requested label Oct 3, 2023

Update yt_dlp/extractor/substack.py

fc40cdb

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>

handlerug requested a review from bashonly October 3, 2023 10:23

bashonly approved these changes Oct 3, 2023

View reviewed changes

bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Oct 3, 2023

Grub4K approved these changes Oct 5, 2023

View reviewed changes

Grub4K removed the pending-review PR needs a review label Oct 5, 2023

Grub4K assigned bashonly Oct 5, 2023

bashonly merged commit fbcc299 into yt-dlp:master Oct 6, 2023
16 checks passed

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/substack] Fix embed extraction (yt-dlp#8218)

4e678a1

Authored by: handlerug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/substack] Fix embed URL extraction #8218

[extractor/substack] Fix embed URL extraction #8218

handlerug commented Sep 27, 2023 •

edited

[extractor/substack] Fix embed URL extraction #8218

[extractor/substack] Fix embed URL extraction #8218

Conversation

handlerug commented Sep 27, 2023 • edited

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at 6cfb6a0

Summary

Walkthrough

handlerug commented Sep 27, 2023 •

edited

`🤖 Generated by Copilot at 6cfb6a0`