Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GoPro extractor. #9019

Merged
merged 5 commits into from Jan 19, 2024
Merged

Fix GoPro extractor. #9019

merged 5 commits into from Jan 19, 2024

Conversation

stilor
Copy link
Contributor

@stilor stilor commented Jan 18, 2024

IMPORTANT: PRs without the template will be CLOSED

The GoPro extractor fails even when used on the URLs recorded in the tests with an error like:

[GoPro] Extracting URL: https://gopro.com/v/KRm6Vgp2peg4e [GoPro] KRm6Vgp2peg4e: Downloading webpage
ERROR: [GoPro] KRm6Vgp2peg4e: KRm6Vgp2peg4e: Failed to parse JSON (caused by JSONDecodeError('Expecting \',\' delimiter in \'"shop_url"=>"https:/\': line 1 column 5120 (char 5119)')); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/storage/avn/git/yt-dlp/yt_dlp/extractor/common.py", line 718, in extract
...
  • The JSON in the webpage now contains semicolons as part of HTML entities such as ". This makes the regexp used for matching truncate the JSON prematurely.
  • The same entities are replaced with the respective characters by clean_html, resulting in an invalid JSON.
  • One of the tests for the extractor fails because of unexpected author and track fields in the returned dictionary.
Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Comment on lines 62 to 64
metadata = self._parse_json(
self._html_search_regex(r'window\.__reflectData\s*=\s*([^;]+)', webpage, 'metadata'), video_id)
self._search_regex(r'<script>window\.__reflectData\s*=\s*(.+)\s*;\s*</script>',
webpage, 'metadata'), video_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        metadata = self._search_json(r'<script>window\.__reflectData\s*=', ...)

should work better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 26 to 27
'track': '',
'artist': '',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The track is not supposed to be an empty string. Extractor's returning wrong value. You don't have to fix the code in this PR, but don't add wrong value to test.

Suggested change
'track': '',
'artist': '',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the fix into a separate commit.

@seproDev seproDev added the site-bug Issue with a specific website label Jan 18, 2024
The extractor fails even when used on the URLs recorded in the tests
with an error like:

[GoPro] Extracting URL: https://gopro.com/v/KRm6Vgp2peg4e
[GoPro] KRm6Vgp2peg4e: Downloading webpage
ERROR: [GoPro] KRm6Vgp2peg4e: KRm6Vgp2peg4e: Failed to parse JSON (caused by JSONDecodeError('Expecting \',\' delimiter in \'"shop_url"=>"https:/\': line 1 column 5120 (char 5119)')); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/storage/avn/git/yt-dlp/yt_dlp/extractor/common.py", line 718, in extract
...

- The JSON in the webpage now contains colons as part of HTML entities
  such as &quot;. This makes the regexp used for matching truncate the
  JSON prematurely.
- The same entities are replaced with the respective characters by
  clean_html, resulting in an invalid JSON.

At the reviewer's suggestion, replaced _html_search_regex with
_search_json to resolve both issues.

Signed-off-by: Alexey Neyman <stilor@att.net>
gopro.com returns empty string as author/track for some videos; consider
such attributes to be missing.

Signed-off-by: Alexey Neyman <stilor@att.net>
Copy link
Collaborator

@seproDev seproDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please don't force push. The commits will be squashed upon merge.

Comment on lines 101 to 112
'artist': str_or_none(
'artist': nonempty_or_none(
video_info.get('music_track_artist')),
'track': str_or_none(
'track': nonempty_or_none(
video_info.get('music_track_name')),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just do:

            'artist': str_or_none(
                video_info.get('music_track_artist')) or None,
            'track': str_or_none(
                video_info.get('music_track_name')) or None,

@seproDev seproDev added the pending-fixes PR has had changes requested label Jan 19, 2024
Signed-off-by: Alexey Neyman <stilor@att.net>
@seproDev seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Jan 19, 2024
yt_dlp/extractor/gopro.py Outdated Show resolved Hide resolved
@bashonly bashonly removed the pending-review PR needs a review label Jan 19, 2024
@seproDev seproDev merged commit 4a07a45 into yt-dlp:master Jan 19, 2024
5 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants