Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ThePlatform Metadata extraction #8635

Merged
merged 37 commits into from Dec 12, 2023

Conversation

trainman261
Copy link
Contributor

@trainman261 trainman261 commented Nov 21, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This improves metadata extraction for sites that use ThePlatform and updates the tests accordingly:

  • "standard" metadata fields: tags/keywords and categories.
  • A number of metadata fields have site-specific prefixes, but many of those metadata fields themselves are not site-specific (for instance creator, series, season number in this PR). I had started a discussion on this topic on Discord a while ago, but the discussion stalled IMO without a resolution. Consider this a proposal continuing the discussion that is open to changes. The proposed method could also help solve an issue in the NBC IE where it subclasses ThePlatformIE instead of subclassing InfoExtractor which AFAIK is not best practice (see the comment on line 32).

I have done the same level of regression testing as for my previous ThePlatformIE PR and have fixed all regression issues that appeared.

I am also now proposing to add the media_type field with this PR. This should be more carefully looked at, since I don't know much about the code outside of the extractor classes - I've tried to find where changes are needed but I only found the explanation in the common extractor class. Please add any other changes that may be necessary.

Fixes #

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@seproDev seproDev added the site-enhancement Feature request for some website label Nov 21, 2023
@bashonly bashonly self-requested a review November 23, 2023 05:51
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
@bashonly
Copy link
Member

bashonly commented Nov 23, 2023

And, unrelated to my review (out-of-scope for this PR):

The proposed method could also help solve an issue in the NBC IE where it subclasses ThePlatformIE instead of subclassing InfoExtractor

I am thinking these multi-purpose extractors such as ThePlatformIE and AnvatoIE etc should be refactored so that all of their useful methods are in a baseclass, and so that the many extractors improperly subclassing from concrete IEs can properly subclass from the refactored baseclasses instead

@bashonly bashonly added the pending-fixes PR has had changes requested label Nov 23, 2023
trainman261 and others added 4 commits November 23, 2023 19:39
Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>
…only's suggestions

Adding the media_type field in common. This has to be reviewed, as I don't know where else this has to be added.
@trainman261
Copy link
Contributor Author

I've implemented the suggested changes and made some slight changes to them.
I've also added the media_type field, however, I don't know what other code has to be updated to properly support this, so please add whatever is necessary - or let me know so I can update those parts myself. I've updated the description of the PR to match.

yt_dlp/extractor/common.py Outdated Show resolved Hide resolved
yt_dlp/extractor/cbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/mediaset.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theplatform.py Outdated Show resolved Hide resolved
trainman261 and others added 2 commits December 5, 2023 08:13
Added the media_type field
My bad, that should have never made it into the PR.

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
trainman261 and others added 4 commits December 5, 2023 08:33
Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
yt_dlp/extractor/cbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/mediaset.py Outdated Show resolved Hide resolved
yt_dlp/extractor/mediaset.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nbc.py Outdated Show resolved Hide resolved
@seproDev
Copy link
Collaborator

seproDev commented Dec 5, 2023

I just noticed that the CWTVIE and AENetworksIE tests also needs to be updated

trainman261 and others added 3 commits December 5, 2023 19:56
…nto theplatform

# Conflicts:
#	yt_dlp/extractor/mediaset.py
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
…nto theplatform

# Conflicts:
#	yt_dlp/extractor/cbc.py
#	yt_dlp/extractor/mediaset.py
#	yt_dlp/extractor/nbc.py
@trainman261
Copy link
Contributor Author

@seproDev I've updated everything I can - unfortunately, those tests are either geo-restricted, subscriber-only or broken. If anyone has access they can feel free to add them.

@seproDev
Copy link
Collaborator

seproDev commented Dec 5, 2023

I've pushed those changes myself

@seproDev seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Dec 5, 2023
@pukkandan
Copy link
Member

Ignore the failing test. Pypy just randomly crashes sometimes

@trainman261
Copy link
Contributor Author

I've merged everything together and addressed all the issues. Anything else?

@pukkandan I've noticed Pypy crashing on my system occasionally as well. Is there any pattern that could be reported as an error (to Pypy devs)?

@bashonly bashonly removed the pending-review PR needs a review label Dec 8, 2023
@bashonly bashonly self-assigned this Dec 8, 2023
@bashonly bashonly merged commit 7e09c14 into yt-dlp:master Dec 12, 2023
13 of 14 checks passed
gonzalezjo pushed a commit to gonzalezjo/yt-dlp that referenced this pull request Dec 12, 2023
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-enhancement Feature request for some website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants