New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ProSiebenSat1] Improve title extraction (#13915) #14128
Conversation
With this commit, og:title titles are preferred over the old extraction. Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better.
if title is None: | ||
self._html_search_regex( | ||
self._TITLE_REGEXES, webpage, 'title', | ||
default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title
must not be None
.
self._TITLE_REGEXES, webpage, 'title', | ||
default=None) or self._og_search_title(webpage) | ||
title = self._og_search_title(webpage) | ||
if title is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title
can't be None
here. Don't change the order of extraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the order of extraction is the whole point of the PR, because the title extraction with _TITLE_REGEXES
didn't lead to the expected results (see referenced issue). So I decided to keep them as fallback, but prefer the _og_search_title
result.
Looking at it's definition, I do see a return None
. Or is that return path unreachable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your reference (after line 889 in common.py
):
def _og_search_property(self, prop, html, name=None, **kargs):
if not isinstance(prop, (list, tuple)):
prop = [prop]
if name is None:
name = 'OpenGraph %s' % prop[0]
og_regexes = []
for p in prop:
og_regexes.extend(self._og_regexes(p))
escaped = self._search_regex(og_regexes, html, name, flags=re.DOTALL, **kargs)
if escaped is None:
return None
return unescapeHTML(escaped)
_og_search_property
is called by _og_search_title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original order was intentional since _og_search_title
provides incorrect titles for some extractors that have skipped tests now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your fast reply. I checked all tests and found the following:
test_ProSiebenSat1_2
and test_ProSiebenSat1_5
have ERROR
s (can't extract thumbnail), so not related to this.
test_ProSiebenSat1_9
FAIL
No title at all (None
). Could look into this, but at least this is not a wrong title. Not related, too.
test_ProSiebenSat1_3, test_ProSiebenSat1_4, test_ProSiebenSat1_6,test_ProSiebenSat1_11
All FAIL
, because of wrong titles (you are referring to those, I guess):
- Sexy laufen in Ugg Boots
+ Stars & Style - Sexy laufen in Ugg Boots
- Im Interview: Kai Wiesinger
+ Der Rücktritt - Im Interview: Kai Wiesinger
- Schalke: Tönnies möchte Raul zurück
+ Bundesliga - Schalke: Tönnies möchte Raul zurück
- Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe
+ Oscars ® 2017 - Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe
Since I'm german, I can see, that the "wrong" titles are actually better than the expected ones (inconsistent webpage though ^^ ). So at least, for this extractor, _og_search_title
provides very good results. And obviously (#13915), the current way of title extraction doesn't work better than _og_search_title
.
Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
With this commit, og:title titles are preferred over the old extraction.
Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better (the old titles in the tests weren't wrong though).
The relevant tests can be executed by doing:
python test/test_download.py TestDownload.test_ProSiebenSat1{,_7,_8,_10}
This fixes #13915.