Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ProSiebenSat1] Improve title extraction (#13915) #14128

Closed
wants to merge 1 commit into from

Conversation

kayb94
Copy link
Contributor

@kayb94 kayb94 commented Sep 5, 2017

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

  • Bug fix
  • Improvement

With this commit, og:title titles are preferred over the old extraction.
Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better (the old titles in the tests weren't wrong though).
The relevant tests can be executed by doing:
python test/test_download.py TestDownload.test_ProSiebenSat1{,_7,_8,_10}

This fixes #13915.

With this commit, og:title titles are preferred over the old extraction.
Some tests had to be adjusted, but I have verified the now extracted
titles are equally well or better.
if title is None:
self._html_search_regex(
self._TITLE_REGEXES, webpage, 'title',
default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title must not be None.

self._TITLE_REGEXES, webpage, 'title',
default=None) or self._og_search_title(webpage)
title = self._og_search_title(webpage)
if title is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title can't be None here. Don't change the order of extraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the order of extraction is the whole point of the PR, because the title extraction with _TITLE_REGEXES didn't lead to the expected results (see referenced issue). So I decided to keep them as fallback, but prefer the _og_search_title result.
Looking at it's definition, I do see a return None. Or is that return path unreachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your reference (after line 889 in common.py):

def _og_search_property(self, prop, html, name=None, **kargs):
    if not isinstance(prop, (list, tuple)):
        prop = [prop]
    if name is None:
        name = 'OpenGraph %s' % prop[0]
    og_regexes = []
    for p in prop:
        og_regexes.extend(self._og_regexes(p))
    escaped = self._search_regex(og_regexes, html, name, flags=re.DOTALL, **kargs)
    if escaped is None:
        return None
    return unescapeHTML(escaped)

_og_search_property is called by _og_search_title

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original order was intentional since _og_search_title provides incorrect titles for some extractors that have skipped tests now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your fast reply. I checked all tests and found the following:

test_ProSiebenSat1_2 and test_ProSiebenSat1_5 have ERRORs (can't extract thumbnail), so not related to this.

test_ProSiebenSat1_9 FAIL
No title at all (None). Could look into this, but at least this is not a wrong title. Not related, too.

test_ProSiebenSat1_3, test_ProSiebenSat1_4, test_ProSiebenSat1_6,test_ProSiebenSat1_11
All FAIL, because of wrong titles (you are referring to those, I guess):

- Sexy laufen in Ugg Boots
+ Stars & Style - Sexy laufen in Ugg Boots

- Im Interview: Kai Wiesinger
+ Der Rücktritt - Im Interview: Kai Wiesinger

- Schalke: Tönnies möchte Raul zurück
+ Bundesliga - Schalke: Tönnies möchte Raul zurück

- Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe
+ Oscars ® 2017 - Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe

Since I'm german, I can see, that the "wrong" titles are actually better than the expected ones (inconsistent webpage though ^^ ). So at least, for this extractor, _og_search_title provides very good results. And obviously (#13915), the current way of title extraction doesn't work better than _og_search_title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[prosiebensat1] Wrong title extracted from ran.de video
2 participants