[ProSiebenSat1] Improve title extraction (#13915) #14128

kayb94 · 2017-09-05T19:49:32Z

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

Bug fix
Improvement

With this commit, og:title titles are preferred over the old extraction.
Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better (the old titles in the tests weren't wrong though).
The relevant tests can be executed by doing:
python test/test_download.py TestDownload.test_ProSiebenSat1{,_7,_8,_10}

This fixes #13915.

With this commit, og:title titles are preferred over the old extraction. Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better.

dstftw · 2017-09-08T14:00:32Z

youtube_dl/extractor/prosiebensat1.py

+        if title is None:
+            self._html_search_regex(
+                self._TITLE_REGEXES, webpage, 'title',
+                default=None) 


title must not be None.

dstftw · 2017-09-08T14:03:16Z

youtube_dl/extractor/prosiebensat1.py

-            self._TITLE_REGEXES, webpage, 'title',
-            default=None) or self._og_search_title(webpage)
+        title = self._og_search_title(webpage)
+        if title is None:


title can't be None here. Don't change the order of extraction.

Changing the order of extraction is the whole point of the PR, because the title extraction with _TITLE_REGEXES didn't lead to the expected results (see referenced issue). So I decided to keep them as fallback, but prefer the _og_search_title result.
Looking at it's definition, I do see a return None. Or is that return path unreachable?

For your reference (after line 889 in common.py):

def _og_search_property(self, prop, html, name=None, **kargs): if not isinstance(prop, (list, tuple)): prop = [prop] if name is None: name = 'OpenGraph %s' % prop[0] og_regexes = [] for p in prop: og_regexes.extend(self._og_regexes(p)) escaped = self._search_regex(og_regexes, html, name, flags=re.DOTALL, **kargs) if escaped is None: return None return unescapeHTML(escaped)

_og_search_property is called by _og_search_title

Original order was intentional since _og_search_title provides incorrect titles for some extractors that have skipped tests now.

Thank you for your fast reply. I checked all tests and found the following:

test_ProSiebenSat1_2 and test_ProSiebenSat1_5 have ERRORs (can't extract thumbnail), so not related to this.

test_ProSiebenSat1_9 FAIL
No title at all (None). Could look into this, but at least this is not a wrong title. Not related, too.

test_ProSiebenSat1_3, test_ProSiebenSat1_4, test_ProSiebenSat1_6,test_ProSiebenSat1_11
All FAIL, because of wrong titles (you are referring to those, I guess):

- Sexy laufen in Ugg Boots
+ Stars & Style - Sexy laufen in Ugg Boots

- Im Interview: Kai Wiesinger
+ Der Rücktritt - Im Interview: Kai Wiesinger

- Schalke: Tönnies möchte Raul zurück
+ Bundesliga - Schalke: Tönnies möchte Raul zurück

- Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe
+ Oscars ® 2017 - Jetzt erst enthüllt: Das Geheimnis von Emma Stones Oscar-Robe

Since I'm german, I can see, that the "wrong" titles are actually better than the expected ones (inconsistent webpage though ^^ ). So at least, for this extractor, _og_search_title provides very good results. And obviously (#13915), the current way of title extraction doesn't work better than _og_search_title.

[ProSiebenSat1] Improve title extraction (#13915)

36b93e5

With this commit, og:title titles are preferred over the old extraction. Some tests had to be adjusted, but I have verified the now extracted titles are equally well or better.

dstftw requested changes Sep 8, 2017

View reviewed changes

dstftw added the pending-fixes label Sep 8, 2017

dstftw requested changes Sep 8, 2017

View reviewed changes

dstftw force-pushed the master branch from 37318e1 to 65220c3 Compare January 27, 2018 22:49

kayb94 closed this Mar 6, 2018

kayb94 deleted the prosiebensat1 branch March 6, 2018 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ProSiebenSat1] Improve title extraction (#13915) #14128

[ProSiebenSat1] Improve title extraction (#13915) #14128

kayb94 commented Sep 5, 2017

dstftw Sep 8, 2017

dstftw Sep 8, 2017

kayb94 Sep 8, 2017

kayb94 Sep 12, 2017

dstftw Sep 12, 2017

kayb94 Sep 12, 2017

[ProSiebenSat1] Improve title extraction (#13915) #14128

[ProSiebenSat1] Improve title extraction (#13915) #14128

Conversation

kayb94 commented Sep 5, 2017

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

dstftw Sep 8, 2017

Choose a reason for hiding this comment

dstftw Sep 8, 2017

Choose a reason for hiding this comment

kayb94 Sep 8, 2017

Choose a reason for hiding this comment

kayb94 Sep 12, 2017

Choose a reason for hiding this comment

dstftw Sep 12, 2017

Choose a reason for hiding this comment

kayb94 Sep 12, 2017

Choose a reason for hiding this comment