Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Retrieve the URLs of all availale thumbnails #2266

Closed
wants to merge 4 commits into from

3 participants

MikeCol Philipp Hagemeister Jaime Marquínez Ferrándiz
MikeCol

The clips on Tumblr pages are usually accompanied by more than one thumbnail. Their URL is stored in an array in the source code of the page.

This patch decodes this array in its entirety and returns their URL under the thumbnails key.
It also returns the first URL under the thumbnail key.

youtube_dl/extractor/tumblr.py
@@ -34,11 +34,19 @@ def _real_extract(self, url):
video_url = video.group('video_url')
ext = video.group('ext')
- video_thumbnail = self._search_regex(
- r'posters.*?\[\\x22(.*?)\\x22',
- webpage, 'thumbnail', fatal=False) # We pick the first poster
- if video_thumbnail:
- video_thumbnail = video_thumbnail.replace('\\\\/', '/')
+ # retrieve all available thumbnails
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').replace('\\\\/', '/').split(','):
+ t = t.replace('\\x22','"')
+ if (t[0]=='"') and (t[-1]=='"'):
+ thumb_list.append(t[1:-1])
+
+ # take the first, if user only wants one
Philipp Hagemeister Collaborator
phihag added a note

This should not be necessary. Instead, a fitting picture out of thumbnails should automatically selected. If it's not, we'll fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
youtube_dl/extractor/tumblr.py
@@ -34,11 +34,19 @@ def _real_extract(self, url):
video_url = video.group('video_url')
ext = video.group('ext')
- video_thumbnail = self._search_regex(
- r'posters.*?\[\\x22(.*?)\\x22',
- webpage, 'thumbnail', fatal=False) # We pick the first poster
- if video_thumbnail:
- video_thumbnail = video_thumbnail.replace('\\\\/', '/')
+ # retrieve all available thumbnails
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').replace('\\\\/', '/').split(','):
+ t = t.replace('\\x22','"')
+ if (t[0]=='"') and (t[-1]=='"'):
Philipp Hagemeister Collaborator
phihag added a note

This parsing looks quite messy. For example, when doesn't this condition hit? In any case, you don't need the parentheses here, but spaces around == are quite idiomatic.

MikeCol
MikeCol added a note

The text to parse looks messy, too:

posters     : [\x22http:\\/\\/media.tumblr.com\\/tumblr_mqf3m3tJrk1rhkyls_frame1.jpg\x22,\x22http:\\/\\/media.tumblr.com\\/tumblr_mqf3m3tJrk1rhkyls_frame2.jpg\ ...

Ok, I can get rid off a few backslashes in the replace statements... I'll do that.
And I add some space :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
youtube_dl/extractor/tumblr.py
((5 lines not shown))
- r'posters.*?\[\\x22(.*?)\\x22',
- webpage, 'thumbnail', fatal=False) # We pick the first poster
- if video_thumbnail:
- video_thumbnail = video_thumbnail.replace('\\\\/', '/')
+ # retrieve all available thumbnails
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').replace('\\\\/', '/').split(','):
+ t = t.replace('\\x22','"')
+ if (t[0]=='"') and (t[-1]=='"'):
+ thumb_list.append(t[1:-1])
+
+ # take the first, if user only wants one
+ single_thumb = None
+ if len(thumb_list)>0:
Philipp Hagemeister Collaborator
phihag added a note

By the way, the pythonic way is to just evaluate thumb_list

MikeCol
MikeCol added a note

Yes, but it's not save. We don't have any control over what the website is sending and we might end up with something like this (or nastier):

eval("os.system('ls')")
Jaime Marquínez Ferrándiz Collaborator
jaimeMF added a note

He meant to evaluate if thumb_list is true in a boolean check, not to run eval with its content.

if thumb_list:
    ...

Is equivalent to:

if len(thumb_list) > 0:
    ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
youtube_dl/extractor/tumblr.py
@@ -34,11 +34,14 @@ def _real_extract(self, url):
video_url = video.group('video_url')
ext = video.group('ext')
- video_thumbnail = self._search_regex(
- r'posters.*?\[\\x22(.*?)\\x22',
- webpage, 'thumbnail', fatal=False) # We pick the first poster
- if video_thumbnail:
- video_thumbnail = video_thumbnail.replace('\\\\/', '/')
+ # retrieve all available thumbnails
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').replace(r'\\/', '/').split(','):
Philipp Hagemeister Collaborator
phihag added a note

This parses JSON, we should use a proper parser.

MikeCol
MikeCol added a note

Looks like JSON, but it isn't. In JSON the keys have to be in double quotes...
But I think, I can get rid of the replace commands using .decode('string-escape').
I'll do that shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Philipp Hagemeister
Collaborator

Sorry, but the multiple thumbnails do not seem to be present anymore, at least not in our tumblr test page.

Philipp Hagemeister phihag closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 8 additions and 6 deletions.
  1. +8 −6 youtube_dl/extractor/tumblr.py
14 youtube_dl/extractor/tumblr.py
View
@@ -34,11 +34,13 @@ def _real_extract(self, url):
video_url = video.group('video_url')
ext = video.group('ext')
- video_thumbnail = self._search_regex(
- r'posters.*?\[\\x22(.*?)\\x22',
- webpage, 'thumbnail', fatal=False) # We pick the first poster
- if video_thumbnail:
- video_thumbnail = video_thumbnail.replace('\\\\/', '/')
+ # retrieve all available thumbnails
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').decode('string-escape').replace(r'\/',r'/').split(','):
+ if (t[0] == '"') and (t[-1] == '"'):
+ thumb_list.append( {"url": t[1:-1]} )
# The only place where you can get a title, it's not complete,
# but searching in other places doesn't work for all videos
@@ -48,6 +50,6 @@ def _real_extract(self, url):
return [{'id': video_id,
'url': video_url,
'title': video_title,
- 'thumbnail': video_thumbnail,
+ 'thumbnails': thumb_list,
'ext': ext
}]
Something went wrong with that request. Please try again.