[RadioComercial] Add extractor #8508

SirElderling · 2023-11-03T16:44:44Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This extractor was created specifically for the Portuguese radio station Radio Comercial.
Its main function is to fetch and download podcast episodes.

Presently, it offers two extract functions:

Downloading individual podcast episodes by directly using their URL links.
Downloading all the episodes of a specified season for a particular podcast, utilizing the `playlist concept.

Valid URLs that are covered by this extractor:

Single episode: https://radiocomercial.pt/podcasts/convenca-me-num-minuto/t3/convenca-me-num-minuto-que-os-lobisomens-existem
Entire podcast playlist: https://radiocomercial.pt/podcasts/as-minhas-coisas-favoritas
Playlist specific to a particular season of a podcast: https://radiocomercial.pt/podcasts/convenca-me-num-minuto/t3

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at bd9904c`

Summary

🎙️📻🎧

This pull request adds two new extractors for yt-dlp, RadioComercialIE and RadioComercialPlaylistIE, which enable downloading audio and playlists from the Portuguese radio station Radio Comercial. It also fixes a minor formatting issue in _extractors.py.

Oh we are the coders of the sea
And we write extractors with glee
We pull the audio from RadioComercial
On the count of three, heave ho, heave ho!

Walkthrough

Add support for Radio Comercial extractors (link, link)
- Import new classes RadioComercialIE and RadioComercialPlaylistIE from radiocomercial.py in _extractors.py (link)
- Define new classes in radiocomercial.py that inherit from RadioComercialBaseExtractor (link)
Remove an empty line from _extractors.py for formatting consistency (link)

Grub4K · 2023-11-03T16:47:54Z

Same as #8507 (comment). Please instead edit the description and push additional commits onto the other branch

yt_dlp/extractor/_extractors.py

yt_dlp/extractor/radiocomercial.py

SirElderling · 2023-11-05T09:11:01Z

@seproDev thank you very much for taking the time to look into this, and provide those valuable suggestions. I'm working on those and will update the code once complete.

yt_dlp/extractor/radiocomercial.py

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

seproDev

I think after this, we should be good from my side.

yt_dlp/extractor/radiocomercial.py

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

…h single episodes

yt_dlp/extractor/radiocomercial.py

…ve query parameters or use anchors. Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

…ate extraction code

yt_dlp/extractor/radiocomercial.py

…gestions

SirElderling · 2023-11-09T20:38:41Z

I'm currently having an issue extracting playlists from the website because of how it deals with unavailable episodes. At first, I was using a Python set to gather the episode links from the elements with class tm-ouvir-podcast, avoiding the duplicates entries (this class shows up twice for each episode).

The trouble is, any unavailable episode defaults to the same URL (like radiocomercial/podcast/<season>). This behaviour can be seen in test number 4 of the RadioComercialPlaylistIE class, in which two episodes are unavailable on the same page (first).

This situation leads to two problems:

Problem 1: When I was using a Python set and more than one episode was missing, I ended up with fewer episodes than the expected number (_PAGE_SIZE). This stopped the scanning of additional pages.
To fix this, I changed the code to get rid of the duplicate tm-ouvir-podcast classes. Now, I use lists instead of sets, ensuring the episode count always matches the _PAGE_SIZE, by keeping the unavailable episodes entries.

Problem 2: The unavailable episodes cause an error when the downloader tries to process them.:

[RadioComercialPlaylist] Playlist TNT - Todos No Top - Temporada 2023: Downloading 41 items of 41
[download] Downloading item 1 of 41
[RadioComercial] Extracting URL: https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/t-n-t-29-de-outubro
[RadioComercial] t-n-t-29-de-outubro: Downloading webpage
[info] t-n-t-29-de-outubro: Downloading 1 format(s): 0
[download] Destination: T.N.T 29 de outubro [t-n-t-29-de-outubro].mp3
[download] 100% of   86.69MiB in 00:00:13 at 6.62MiB/s
[download] Downloading item 2 of 41
ERROR: No suitable extractor (RadioComercial) found for URL https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/
[download] Downloading item 3 of 41
[RadioComercial] Extracting URL: https://radiocomercial.pt/podcasts/tnt-todos-no-top/2023/t-n-t-15-de-outubro
[RadioComercial] t-n-t-15-de-outubro: Downloading webpage`

bashonly

Both of the problems you mentioned mean that a PagedList is not a viable option for this playlist extractor.

Instead we'll need to just use a generator function (e.g. _entries()), and we can use playlist_from_matches() which casts the URLs to an orderedSet and then constructs the url_results for us. We can also use RadioComercialIE.suitable() to ensure the URLs are not bogus

yt_dlp/extractor/radiocomercial.py

seproDev · 2023-11-10T00:21:38Z

I think I initially suggested using a PagedList over a generator function 🙃
Sorry about that

bashonly · 2023-11-10T00:27:03Z

@seproDev all good, the issues with using a PagedList weren't immediately apparent

…n review inputs

SirElderling · 2023-11-10T06:47:57Z

Both of the problems you mentioned mean that a PagedList is not a viable option for this playlist extractor.

Instead we'll need to just use a generator function (e.g. _entries()), and we can use playlist_from_matches() which casts the URLs to an orderedSet and then constructs the url_results for us. We can also use RadioComercialIE.suitable() to ensure the URLs are not bogus

Thank you very much for taking the time and providing the proper solution for this use case. This has been a great learning experience so far.

yt_dlp/extractor/radiocomercial.py

bashonly · 2023-11-11T13:25:51Z

yt_dlp/extractor/radiocomercial.py

+
+
+class RadioComercialIE(InfoExtractor):
+    _VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)/?(?:$|[?#])'


Read #8535 (comment)
Here I have the same question:

Suggested change

_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)/?(?:$|[?#])'

_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)'

It's needed in RadioComercialPlaylistIE._VALID_URL, but I don't think we need it here? Unless I'm missing something

I believe you are correct bashonly. I removed it from the single episode regex.

Authored by: SirElderling

[RadioComercial] Add extractor

bd9904c

Grub4K added the site-request Request to support a new website label Nov 3, 2023

bashonly self-requested a review November 3, 2023 20:05

seproDev requested changes Nov 4, 2023

View reviewed changes

seproDev added the pending-fixes PR has had changes requested label Nov 4, 2023

[RadioComercial] Add extractor - applied review recommendations

431858e

SirElderling requested a review from seproDev November 5, 2023 11:10

seproDev requested changes Nov 5, 2023

View reviewed changes

SirElderling and others added 3 commits November 5, 2023 19:00

Update yt_dlp/extractor/radiocomercial.py

e0c3428

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

[RadioComercial] Add extractor - more review recommendations

c24e8e3

[RadioComercial] Add extractor - add skip to test with inconsistent md5

473860d

SirElderling requested a review from seproDev November 5, 2023 21:00

seproDev reviewed Nov 5, 2023

View reviewed changes

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

SirElderling and others added 3 commits November 5, 2023 21:35

Update yt_dlp/extractor/radiocomercial.py

b8445e2

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

Update yt_dlp/extractor/radiocomercial.py

a4dc13c

Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

[RadioComercial] Add extractor - fix tests formatting

f8c9d92

SirElderling requested a review from seproDev November 6, 2023 06:32

[RadioComercial] Add extractor - fixed playlist URL regex to not matc…

0d2c5c2

…h single episodes

seproDev approved these changes Nov 6, 2023

View reviewed changes

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 6, 2023

SirElderling and others added 2 commits November 6, 2023 08:42

[RadioComercial] Add extractor - Playlist regex to match URLs that ha…

363047f

…ve query parameters or use anchors. Co-authored-by: sepro <4618135+seproDev@users.noreply.github.com>

[RadioComercial] Add extractor - add urljoin function + refactor on d…

c152ce1

…ate extraction code

bashonly requested changes Nov 9, 2023

View reviewed changes

bashonly added pending-fixes PR has had changes requested and removed pending-review PR needs a review labels Nov 9, 2023

[RadioComercial] Add extractor - additional fixes based on review sug…

4a9b211

…gestions

bashonly requested changes Nov 10, 2023

View reviewed changes

[RadioComercial] Add extractor - rewritten the playlist logic based o…

86162ce

…n review inputs

bashonly approved these changes Nov 10, 2023

View reviewed changes

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

yt_dlp/extractor/radiocomercial.py Outdated Show resolved Hide resolved

bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 10, 2023

Apply suggestions from code review

0e6ad22

bashonly reviewed Nov 11, 2023

View reviewed changes

x[RadioComercial] Add extractor - adjusted the single episode URL regex

7c5c897

bashonly removed the pending-review PR needs a review label Nov 11, 2023

bashonly self-assigned this Nov 11, 2023

bashonly merged commit ef12dbd into yt-dlp:master Nov 11, 2023
16 checks passed

SirElderling deleted the radiocomercial branch November 12, 2023 08:18

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/radiocomercial] Add extractors (yt-dlp#8508)

af332ca

Authored by: SirElderling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RadioComercial] Add extractor #8508

[RadioComercial] Add extractor #8508

SirElderling commented Nov 3, 2023 •

edited by ghost

Grub4K commented Nov 3, 2023

SirElderling commented Nov 5, 2023

seproDev left a comment

SirElderling commented Nov 9, 2023 •

edited

bashonly left a comment •

edited

seproDev commented Nov 10, 2023

bashonly commented Nov 10, 2023

SirElderling commented Nov 10, 2023

bashonly Nov 11, 2023

SirElderling Nov 11, 2023



		class RadioComercialIE(InfoExtractor):
		_VALID_URL = r'https?://(?:www\.)?radiocomercial\.pt/podcasts/[^/?#]+/t?(?P<season>\d+)/(?P<id>[\w-]+)/?(?:$\|[?#])'

[RadioComercial] Add extractor #8508

[RadioComercial] Add extractor #8508

Conversation

SirElderling commented Nov 3, 2023 • edited by ghost

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at bd9904c

Summary

Walkthrough

Grub4K commented Nov 3, 2023

SirElderling commented Nov 5, 2023

seproDev left a comment

Choose a reason for hiding this comment

SirElderling commented Nov 9, 2023 • edited

bashonly left a comment • edited

Choose a reason for hiding this comment

seproDev commented Nov 10, 2023

bashonly commented Nov 10, 2023

SirElderling commented Nov 10, 2023

bashonly Nov 11, 2023

Choose a reason for hiding this comment

SirElderling Nov 11, 2023

Choose a reason for hiding this comment

SirElderling commented Nov 3, 2023 •

edited by ghost

`🤖 Generated by Copilot at bd9904c`

SirElderling commented Nov 9, 2023 •

edited

bashonly left a comment •

edited