Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/substack] Fix embed URL extraction #8218

Merged
merged 2 commits into from Oct 6, 2023

Conversation

handlerug
Copy link
Contributor

@handlerug handlerug commented Sep 27, 2023

Description of your pull request and other information

The format of the JSON payload being extracted has changed to a JSON.parse("...")-style format. This PR fixes the extraction process so that it ignores the added slashes.

There's a theoretical possibility of the extracted string being too short if it contains an escape sequence (which would contain a slash thus not matching the [^\"]+ part). In practice, a valid DNS name is unlikely to get escaped by a sensible JSON encoder. I have searched for uses of _search_json and _parse_json and found none, so I opted for a fix to the regular expression.

Regarding tests, there are two in generic.py which should have actually been failing. If I try them manually using yt-dlp -j, the result seems to match, but I don't know how to run specific download tests in a generic way.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

馃 Generated by Copilot at 6cfb6a0

Summary

馃悰馃敡馃И

Fixed substack extractor to handle backslashes in subdomain names. Modified regex pattern in yt_dlp/extractor/substack.py.

Substack posts vary
Escape backslashes in regex
Winter bug is fixed

Walkthrough

  • Escape backslashes in subdomain pattern (link) to fix extraction of some substack posts in yt_dlp/extractor/substack.py

@bashonly bashonly self-requested a review September 28, 2023 15:13
@bashonly bashonly added the site-bug Issue with a specific website label Sep 29, 2023
yt_dlp/extractor/substack.py Outdated Show resolved Hide resolved
@bashonly bashonly added the pending-fixes PR has had changes requested label Oct 3, 2023
Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>
@bashonly bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Oct 3, 2023
@Grub4K Grub4K removed the pending-review PR needs a review label Oct 5, 2023
@bashonly bashonly merged commit fbcc299 into yt-dlp:master Oct 6, 2023
16 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants