Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/antenna] Improvements #7584

Merged
merged 7 commits into from Aug 29, 2023

Conversation

stdedos
Copy link
Contributor

@stdedos stdedos commented Jul 13, 2023

  • Rename base extractor as Antenna, as that's the top-level name
  • Modernize extractor code, including
    • Avoid _og_search_description warning
    • Set format to mp4 (or equivalent, instead of 0)
  • Update tests for bigger images
  • Support extracting also antenna.gr websites (+ add a test)

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

ADD DESCRIPTION HERE

Fixes #

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at 5b2d77b

Summary

🚚✨♻️

Renamed and refactored the ant1newsgr.py extractor to antenna.py to support both ant1news.gr and antenna.gr websites. Added a new extractor class for antenna.gr watch pages and a base class for common logic. Updated the import statement in _extractors.py.

ant1newsgr.py
Renamed to antenna.py
Snowflakes of refactor

Walkthrough

  • Rename ant1newsgr.py to antenna.py and update import statement in _extractors.py (link)
  • Create abstract base class AntennaWatchGrBaseIE with abstract property _API_PATH and simplify format and subtitle extraction logic (link, link, link)
  • Modify Ant1NewsGrWatchIE to inherit from AntennaWatchGrBaseIE and implement _API_PATH (link)
  • Add new subclass AntennaGrWatchIE to handle videos from antenna.gr and provide test case (link)
  • Modify Ant1NewsGrEmbedIE to inherit from AntennaWatchGrBaseIE and implement _API_PATH (link)

Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All you had to do is

-    _VALID_URL = r'https?://(?P<netloc>(?:www\.)?ant1news\.gr)/watch/(?P<id>\d+)/'
+    _VALID_URL = r'https?://(?P<netloc>(?:www\.)?(?:antenna|ant1news)\.gr)/watch/(?P<id>\d+)/' 

then add test and maybe change IE_NAME/DESC. The refactoring is not welcome and the new class seem pointless

@pukkandan pukkandan added site-enhancement Feature request for some website pending-fixes PR has had changes requested labels Jul 14, 2023
@stdedos
Copy link
Contributor Author

stdedos commented Jul 14, 2023

The refactoring is not welcome and the new class seem pointless

Well - they look like different websites, and I thought changing the names reported might be confusing for the users.

The new class (and the selected test) suffer from

ERROR: [antennagr:watch] 1643812: [antennagr:watch] unable to extract OpenGraph description; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/common.py", line 710, in extract
    ie_result = self._real_extract(url)
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/antenna.py", line 65, in _real_extract
    info['description'] = self._og_search_description(webpage)
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/common.py", line 1392, in _og_search_description
    return self._og_search_property('description', html, fatal=False, **kargs)
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/common.py", line 1383, in _og_search_property
    escaped = self._search_regex(og_regexes, html, name, flags=re.DOTALL, **kargs)
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/common.py", line 1260, in _search_regex
    self.report_warning('unable to extract %s' % _name + bug_reports_message())
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/yt_dlp/extractor/common.py", line 1127, in report_warning
    self._downloader.report_warning(msg, *args, **kwargs)
  File "/home/stdedos/Documents/WorkBulk/Ubuntu/repos/yt-dlp/test/helper.py", line 315, in _report_warning
    real_warning(w, *args, **kwargs)
  File "test/test_download.py", line 51, in report_warning
    raise ExtractorError(message)

E

which I don't know how to fix either 😕

@pukkandan
Copy link
Member

Do we want that warning to be shown to user? If so, add expected_warnings to test. If not, add a default=None to self._og_search_description

@stdedos stdedos force-pushed the impr/extractor/antenna/2023-07-14 branch from 5b2d77b to 3d23a30 Compare July 14, 2023 07:21
@stdedos
Copy link
Contributor Author

stdedos commented Jul 14, 2023

Test should now be working.

Do you still want me to squash the new class to the old one?
If so, which name do I pick?

@stdedos
Copy link
Contributor Author

stdedos commented Jul 14, 2023

Can I somehow make that output better?

[info] 1643812: Downloading 1 format(s): 0

like ... idk? mp4?

That definitely makes sense in the youtube extractor, but not on this extractor 😕

@pukkandan
Copy link
Member

Can I somehow make that output better?

[info] 1643812: Downloading 1 format(s): 0

like ... idk? mp4?

You can set format_ids, but mp4 won't work unless there's only a single format and it's guaranteed to be mp4

@pukkandan
Copy link
Member

@stdedos
Copy link
Contributor Author

stdedos commented Jul 22, 2023

  • The tests are failing

I am not sure which tests are failing 😕

I ran the ones I edited, and the CI didn't give anything anything red

* Rename base extractor as Antenna, as that's the top-level name
* Make `Ant1NewsGrBaseIE` to `AntennaWatchGrBaseIE` as an ABC
  * `_API_PATH` is mandatory, differs per extractor
* Untangle `formats, subs` assignment
* Create a clone of `Ant1NewsGrWatchIE`, `AntennaGrWatchIE`
  (with almost everything similar)
* Convert `Ant1NewsGrArticleIE` to `InfoExtractor`,
  as it shared nothing with its parent class

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
@stdedos stdedos force-pushed the impr/extractor/antenna/2023-07-14 branch from 846ec51 to 5c6e8a7 Compare July 22, 2023 07:13
@stdedos stdedos force-pushed the impr/extractor/antenna/2023-07-14 branch from 5c6e8a7 to e52fb83 Compare July 22, 2023 07:15
@pukkandan
Copy link
Member

❯ pytest --tb=short -k AntennaGrWatch
========================================================================================== test session starts ==========================================================================================
platform win32 -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0 -- REDACTED
cachedir: .pytest_cache
rootdir: REDACTED
configfile: setup.cfg
collected 5882 items / 5880 deselected / 2 selected

test/test_download.py::TestDownload::test_AntennaGrWatch FAILED                                                                                                                                    [ 50%]
test/test_download.py::TestDownload::test_AntennaGrWatch_all PASSED                                                                                                                                [100%]

=============================================================================================== FAILURES ================================================================================================
___________________________________________________________________________________ TestDownload.test_AntennaGrWatch ____________________________________________________________________________________
test\test_download.py:215: in test_template
    expect_info_dict(self, tc_res_dict, tc.get('info_dict', {}))
test\helper.py:233: in expect_info_dict
    expect_dict(self, got_dict, expected_dict)
test\helper.py:186: in expect_dict
    expect_value(self, got, expected, info_field)
test\helper.py:178: in expect_value
    self.assertEqual(
E   AssertionError: 'ΟΙ ΠΡΟΔΟΤΕΣ – ΕΠΕΙΣΟΔΙΟ 01' != 'antennagr-watch video #1643812'
E   - ΟΙ ΠΡΟΔΟΤΕΣ – ΕΠΕΙΣΟΔΙΟ 01
E   + antennagr-watch video #1643812
E    : Invalid value for field title, expected 'ΟΙ ΠΡΟΔΟΤΕΣ – ΕΠΕΙΣΟΔΙΟ 01', got 'antennagr-watch video #1643812'
----------------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------------
[debug] Loaded 1856 extractors
[antennagr:watch] Extracting URL: https://www.antenna.gr/watch/1643812/oi-prodotes-epeisodio-01
[antennagr:watch] 1643812: Downloading webpage
[antennagr:watch] 1643812: Downloading JSON metadata
[debug] Extractor gave empty title. Creating a generic title
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] 1643812: Downloading 1 format(s): 0
[info] Writing video metadata as JSON to: test_AntennaGrWatch_1643812.info.json
[debug] Invoking http downloader on "https://antennavod.akamaized.net/VODS2/noGR.mp4"
[debug] File locking is not supported. Proceeding without locking[download] Destination: test_AntennaGrWatch_1643812.mp4
[download] 100% of   10.00KiB in 00:00:00 at 53.79KiB/s
======================================================================================== short test summary info ========================================================================================
FAILED test/test_download.py::TestDownload::test_AntennaGrWatch - AssertionError: 'ΟΙ ΠΡΟΔΟΤΕΣ – ΕΠΕΙΣΟΔΙΟ 01' != 'antennagr-watch video #1643812'

Add missed `class` rename

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
@stdedos stdedos force-pushed the impr/extractor/antenna/2023-07-14 branch from e52fb83 to 46ea0b8 Compare July 22, 2023 09:31
@stdedos
Copy link
Contributor Author

stdedos commented Jul 22, 2023

I don't have the same failures as you 😕

$ pytest --tb=short -k AntennaGrWatch
=========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: ~repos/yt-dlp, configfile: setup.cfg
plugins: pudb-0.7.0
collected 6013 items / 6010 deselected / 3 selected                                                                                                                                                       

test/test_download.py::TestDownload::test_AntennaGrWatch FAILED                                                                                                                                     [ 33%]
test/test_download.py::TestDownload::test_AntennaGrWatch_1 PASSED                                                                                                                                   [ 66%]
test/test_download.py::TestDownload::test_AntennaGrWatch_all PASSED                                                                                                                                 [100%]

================================================================================================ FAILURES =================================================================================================
____________________________________________________________________________________ TestDownload.test_AntennaGrWatch _____________________________________________________________________________________
test/test_download.py:212: in test_template
    expect_info_dict(self, tc_res_dict, tc.get('info_dict', {}))
test/helper.py:233: in expect_info_dict
    expect_dict(self, got_dict, expected_dict)
test/helper.py:186: in expect_dict
    expect_value(self, got, expected, info_field)
test/helper.py:178: in expect_value
    self.assertEqual(
E   AssertionError: 'http[22 chars]e.net/imgHandler/640/26d46bf6-8158-4f02-b197-7096c714b2de.jpg' != 'http[22 chars]e.net/imgHandler/1000/26d46bf6-8158-4f02-b197-7096c714b2de.jpg'
E   - https://ant1media.azureedge.net/imgHandler/640/26d46bf6-8158-4f02-b197-7096c714b2de.jpg
E   ?                                            ^^
E   + https://ant1media.azureedge.net/imgHandler/1000/26d46bf6-8158-4f02-b197-7096c714b2de.jpg
E   ?                                            ^^^
E    : Invalid value for field thumbnail, expected 'https://ant1media.azureedge.net/imgHandler/640/26d46bf6-8158-4f02-b197-7096c714b2de.jpg', got 'https://ant1media.azureedge.net/imgHandler/1000/26d46bf6-8158-4f02-b197-7096c714b2de.jpg'
------------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------------
[debug] Loaded 1857 extractors
[antenna:watch] Extracting URL: https://www.ant1news.gr/watch/1506168/ant1-news-09112021-stis-18-45
[antenna:watch] 1506168: Downloading webpage
[antenna:watch] 1506168: Downloading JSON metadata
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] 1506168: Downloading 1 format(s): mp4
[info] Writing video metadata as JSON to: test_AntennaGrWatch_1506168.info.json
[debug] Invoking http downloader on "https://antennavod.akamaized.net/VODS2/3c2ce15b-0749-40cb-853a-869fb34d6fed.mp4"
[download] Destination: test_AntennaGrWatch_1506168.mp4
[download] 100% of   10.00KiB in 00:00:00 at 35.10KiB/s    
========================================================================================= short test summary info =========================================================================================
FAILED test/test_download.py::TestDownload::test_AntennaGrWatch - AssertionError: 'http[22 chars]e.net/imgHandler/640/26d46bf6-8158-4f02-b197-7096c714b2de.jpg' != 'http[22 chars]e.net/imgHandler/1000/26d46bf6-8158-4f02-b197-7096c714b2de.jpg'
============================================================================== 1 failed, 2 passed, 6010 deselected in 3.69s ===============================================================================

Both URLs are valid; they just have difference in size (funnily, /1000/ has an 999px * img) - and the templates/data/player was changed to offer 1000 "since then".
It seems that https://ant1media.azureedge.net/imgHandler/%d/26d46bf6-8158-4f02-b197-7096c714b2de.jpg is an argument`

Not strictly related to this PR, but since we are updating the extractor,
might as well clean it up
@pukkandan pukkandan removed the pending-fixes PR has had changes requested label Jul 22, 2023
@pukkandan
Copy link
Member

I don't have the same failures as you 😕

Must be due to geolocation. If tests are passing for you, we can merge.

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
@stdedos
Copy link
Contributor Author

stdedos commented Jul 22, 2023

Somehow I had to update the md5 hash. Took the opportunity to add one for the new test too.

@stdedos
Copy link
Contributor Author

stdedos commented Aug 22, 2023

@pukkandan Are we now ready? 😕

@bashonly
Copy link
Member

shouldn't we revert the AntennaGrWatchIE class name back to Ant1NewsGrWatchIE ?

@stdedos
Copy link
Contributor Author

stdedos commented Aug 22, 2023

I don't see why. "Antenna" is the company, and Ant1 "sounds the same as Antenna", and "ant1news.gr" belongs to "antenna.gr".
Original Author probably focused on the specific place he tested - but the logic and code are exactly the same for both.

"I'd consider that" if we go with the original approach of AntennaGrBaseIE.

@bashonly
Copy link
Member

I don't see why.

It's not just a cosmetic change, it would be to maintain compatibility of --download-archive

Or else you should probably add _old_archive_ids to its returned info dict

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
@stdedos
Copy link
Contributor Author

stdedos commented Aug 22, 2023

Is this now complete + correct?

yt_dlp/extractor/antenna.py Outdated Show resolved Hide resolved
@stdedos
Copy link
Contributor Author

stdedos commented Aug 22, 2023

"Are we there yet"? 🙃

Copy link
Member

@bashonly bashonly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a final review by maintainer-in-chief before it can be merged; it will happen when time permits. Thanks for your patience

@bashonly bashonly added the pending-review PR needs a review label Aug 22, 2023
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have been easier imo to not touch the IE class names. It's only IE_NAME that users see. But since it's already done, I won't object

@bashonly bashonly removed the pending-review PR needs a review label Aug 27, 2023
@bashonly bashonly merged commit 6658760 into yt-dlp:master Aug 29, 2023
13 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-enhancement Feature request for some website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants