[TheGuardian] Add Extractor for podcasts #8535

SirElderling · 2023-11-06T20:59:36Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

The purpose of this extractor is to download The Guardian podcast playlists and single episodes.

Fixes #8520

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at c8f48ae`

Summary

🎧📰🐍

Add a new extractor TheGuardianPodcastIE for The Guardian podcast pages in theguardian.py and import it in _extractors.py. This allows yt-dlp to download audio files from The Guardian podcast URLs.

TheGuardianPodcastIE
New extractor for autumn podcasts
Inherits InfoExtractor

Walkthrough

Add support for The Guardian podcasts (link, link)
- Import TheGuardianPodcastIE from theguardian.py in _extractors.py (link)
- Define TheGuardianPodcastIE class in theguardian.py (link)

SirElderling · 2023-11-10T07:05:55Z

Converted this PR to draft in order to add the playlist extract functionality.

yt_dlp/extractor/theguardian.py

bashonly · 2023-11-11T13:04:26Z

yt_dlp/extractor/theguardian.py

+        title = self._generic_title(url, webpage, default='')
+        description = self._og_search_description(webpage) or get_element_by_class(
+            'header__description', webpage)


Might be nice to extract a title without the " | The Guardian" junk at the end. Also use clean_html just in case

Suggested change

title = self._generic_title(url, webpage, default='')

description = self._og_search_description(webpage) or get_element_by_class(

'header__description', webpage)

title = clean_html(get_element_by_class(

'index-page-header__title', webpage)) or self._generic_title(url, webpage)

description = self._og_search_description(webpage) or clean_html(get_element_by_class(

'header__description', webpage))

If the URL given by the user has a page, the webpage title will be something like this: Today in Focus | Page 2 of 66 | News | The Guardian.
Is there a helper function that can be used here to clean the text? Or is it fine to do something like title, _ = title.split('|') ?

remove_end is the helper typically used for this, but it needs a fixed string to remove; so it wouldn't be useful here

IMO let's try to grab the clean title from one of these elements instead of doing string surgery with the title element

title = clean_html(get_element_by_class( 'index-page-header__title', webpage) or get_element_by_class('flagship-audio__title', webpage))

I implemented your suggestion bashonly. So far, it seems to pick up all the titles correctly.

…ue to number of episodes per page

yt_dlp/extractor/theguardian.py

…ggestions

SirElderling · 2023-11-18T06:45:31Z

Thank you @bashonly for all the suggestions provided. The code has been adjusted with them in mind.

Closes yt-dlp#8520 Authored by: SirElderling

[TheGuardian] Add Extractor

c8f48ae

bashonly added the site-request Request to support a new website label Nov 6, 2023

bashonly self-requested a review November 6, 2023 22:44

[TheGuardian] Add Extractor - additional fix

5f71cc2

SirElderling marked this pull request as draft November 10, 2023 07:05

[TheGuardian] Add Extractor - Add playlist extraction

13a4ddd

SirElderling marked this pull request as ready for review November 10, 2023 09:23

[TheGuardian] Add Extractor - adjusted episodes URL regex

47d6bf7

SirElderling changed the title ~~[TheGuardian] Add Extractor~~ [TheGuardian] Add Extractor for podcasts Nov 11, 2023

bashonly requested changes Nov 11, 2023

View reviewed changes

bashonly added the pending-fixes PR has had changes requested label Nov 11, 2023

bashonly mentioned this pull request Nov 11, 2023

[RadioComercial] Add extractor #8508

Merged

9 tasks

[TheGuardian] Add Extractor for podcasts - rewritten playlist logic d…

023ebae

…ue to number of episodes per page

bashonly removed the pending-fixes PR has had changes requested label Nov 11, 2023

bashonly self-requested a review November 11, 2023 19:48

bashonly requested changes Nov 15, 2023

View reviewed changes

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

bashonly added the pending-fixes PR has had changes requested label Nov 15, 2023

[TheGuardian] Add Extractor for podcasts - apply review suggestions

4635d20

bashonly removed the pending-fixes PR has had changes requested label Nov 17, 2023

bashonly approved these changes Nov 17, 2023

View reviewed changes

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved

[TheGuardian] Add Extractor for podcasts - apply additional review su…

c99a484

…ggestions

just to be sure

d8ecf72

bashonly self-assigned this Nov 18, 2023

bashonly merged commit 1fa3f24 into yt-dlp:master Nov 18, 2023
15 checks passed

SirElderling deleted the theguardian branch November 19, 2023 08:22

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/theguardian] Add extractors (yt-dlp#8535)

21d3ca2

Closes yt-dlp#8520 Authored by: SirElderling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TheGuardian] Add Extractor for podcasts #8535

[TheGuardian] Add Extractor for podcasts #8535

SirElderling commented Nov 6, 2023 •

edited

SirElderling commented Nov 10, 2023

bashonly Nov 11, 2023

SirElderling Nov 11, 2023 •

edited

bashonly Nov 15, 2023

SirElderling Nov 17, 2023 •

edited

SirElderling commented Nov 18, 2023 •

edited

-        title = self._generic_title(url, webpage, default='')
-        description = self._og_search_description(webpage) or get_element_by_class(
-            'header__description', webpage)
+        title = clean_html(get_element_by_class(
+            'index-page-header__title', webpage)) or self._generic_title(url, webpage)
+        description = self._og_search_description(webpage) or clean_html(get_element_by_class(
+            'header__description', webpage))

[TheGuardian] Add Extractor for podcasts #8535

[TheGuardian] Add Extractor for podcasts #8535

Conversation

SirElderling commented Nov 6, 2023 • edited

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at c8f48ae

Summary

Walkthrough

SirElderling commented Nov 10, 2023

bashonly Nov 11, 2023

Choose a reason for hiding this comment

SirElderling Nov 11, 2023 • edited

Choose a reason for hiding this comment

bashonly Nov 15, 2023

Choose a reason for hiding this comment

SirElderling Nov 17, 2023 • edited

Choose a reason for hiding this comment

SirElderling commented Nov 18, 2023 • edited

SirElderling commented Nov 6, 2023 •

edited

`🤖 Generated by Copilot at c8f48ae`

SirElderling Nov 11, 2023 •

edited

SirElderling Nov 17, 2023 •

edited

SirElderling commented Nov 18, 2023 •

edited