Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RadioComercial] Add extractor #8508

Merged
merged 15 commits into from Nov 11, 2023
Merged

Conversation

SirElderling
Copy link
Contributor

@SirElderling SirElderling commented Nov 3, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This extractor was created specifically for the Portuguese radio station Radio Comercial.
Its main function is to fetch and download podcast episodes.

Presently, it offers two extract functions:

  • Downloading individual podcast episodes by directly using their URL links.
  • Downloading all the episodes of a specified season for a particular podcast, utilizing the `playlist concept.

Valid URLs that are covered by this extractor:

  • Single episode: https://radiocomercial.pt/podcasts/convenca-me-num-minuto/t3/convenca-me-num-minuto-que-os-lobisomens-existem
  • Entire podcast playlist: https://radiocomercial.pt/podcasts/as-minhas-coisas-favoritas
  • Playlist specific to a particular season of a podcast: https://radiocomercial.pt/podcasts/convenca-me-num-minuto/t3
Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at bd9904c

Summary

🎙️📻🎧

This pull request adds two new extractors for yt-dlp, RadioComercialIE and RadioComercialPlaylistIE, which enable downloading audio and playlists from the Portuguese radio station Radio Comercial. It also fixes a minor formatting issue in _extractors.py.

Oh we are the coders of the sea
And we write extractors with glee
We pull the audio from RadioComercial
On the count of three, heave ho, heave ho!

Walkthrough

  • Add support for Radio Comercial extractors (link, link)
    • Import new classes RadioComercialIE and RadioComercialPlaylistIE from radiocomercial.py in _extractors.py (link)
    • Define new classes in radiocomercial.py that inherit from RadioComercialBaseExtractor (link)
  • Remove an empty line from _extractors.py for formatting consistency (link)

@Grub4K
Copy link
Member

Grub4K commented Nov 3, 2023

Same as #8507 (comment). Please instead edit the description and push additional commits onto the other branch

@Grub4K Grub4K added the site-request Request to support a new website label Nov 3, 2023
@bashonly bashonly self-requested a review November 3, 2023 20:05
yt_dlp/extractor/_extractors.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
@seproDev seproDev added the pending-fixes PR has had changes requested label Nov 4, 2023
@SirElderling
Copy link
Contributor Author

@seproDev thank you very much for taking the time to look into this, and provide those valuable suggestions. I'm working on those and will update the code once complete.

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@seproDev seproDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after this, we should be good from my side.

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
SirElderling and others added 3 commits November 5, 2023 21:35
Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
@seproDev seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 6, 2023
SirElderling and others added 2 commits November 6, 2023 08:42
…ve query parameters or use anchors.

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
@bashonly bashonly added pending-fixes PR has had changes requested and removed pending-review PR needs a review labels Nov 9, 2023
@SirElderling
Copy link
Contributor Author

SirElderling commented Nov 9, 2023

I'm currently having an issue extracting playlists from the website because of how it deals with unavailable episodes. At first, I was using a Python set to gather the episode links from the elements with class tm-ouvir-podcast, avoiding the duplicates entries (this class shows up twice for each episode).

The trouble is, any unavailable episode defaults to the same URL (like radiocomercial/podcast/<season>). This behaviour can be seen in test number 4 of the RadioComercialPlaylistIE class, in which two episodes are unavailable on the same page (first).

This situation leads to two problems:

Problem 1: When I was using a Python set and more than one episode was missing, I ended up with fewer episodes than the expected number (_PAGE_SIZE). This stopped the scanning of additional pages.
To fix this, I changed the code to get rid of the duplicate tm-ouvir-podcast classes. Now, I use lists instead of sets, ensuring the episode count always matches the _PAGE_SIZE, by keeping the unavailable episodes entries.

Problem 2: The unavailable episodes cause an error when the downloader tries to process them.:

[RadioComercialPlaylist] Playlist TNT - Todos No Top - Temporada 2023: Downloading 41 items of 41
[download] Downloading item 1 of 41
[RadioComercial] Extracting URL: https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/t-n-t-29-de-outubro
[RadioComercial] t-n-t-29-de-outubro: Downloading webpage
[info] t-n-t-29-de-outubro: Downloading 1 format(s): 0
[download] Destination: T.N.T 29 de outubro [t-n-t-29-de-outubro].mp3
[download] 100% of   86.69MiB in 00:00:13 at 6.62MiB/s
[download] Downloading item 2 of 41
ERROR: No suitable extractor (RadioComercial) found for URL https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/
[download] Downloading item 3 of 41
[RadioComercial] Extracting URL: https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/t-n-t-15-de-outubro
[RadioComercial] t-n-t-15-de-outubro: Downloading webpage`

Copy link
Member

@bashonly bashonly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of the problems you mentioned mean that a PagedList is not a viable option for this playlist extractor.

Instead we'll need to just use a generator function (e.g. _entries()), and we can use playlist_from_matches() which casts the URLs to an orderedSet and then constructs the url_results for us. We can also use RadioComercialIE.suitable() to ensure the URLs are not bogus

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
@seproDev
Copy link
Collaborator

I think I initially suggested using a PagedList over a generator function 🙃
Sorry about that

@bashonly
Copy link
Member

@seproDev all good, the issues with using a PagedList weren't immediately apparent

@SirElderling
Copy link
Contributor Author

Both of the problems you mentioned mean that a PagedList is not a viable option for this playlist extractor.

Instead we'll need to just use a generator function (e.g. _entries()), and we can use playlist_from_matches() which casts the URLs to an orderedSet and then constructs the url_results for us. We can also use RadioComercialIE.suitable() to ensure the URLs are not bogus

Thank you very much for taking the time and providing the proper solution for this use case. This has been a great learning experience so far.

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved
@bashonly bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 10, 2023


class RadioComercialIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)/?(?:$|[?#])'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read #8535 (comment)
Here I have the same question:

Suggested change
_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)/?(?:$|[?#])'
_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)'

It's needed in RadioComercialPlaylistIE._VALID_URL, but I don't think we need it here? Unless I'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you are correct bashonly. I removed it from the single episode regex.

@bashonly bashonly removed the pending-review PR needs a review label Nov 11, 2023
@bashonly bashonly self-assigned this Nov 11, 2023
@bashonly bashonly merged commit ef12dbd into yt-dlp:master Nov 11, 2023
16 checks passed
@SirElderling SirElderling deleted the radiocomercial branch November 12, 2023 08:18
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-request Request to support a new website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants