Retrieve the URLs of all availale thumbnails #2266

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

MikeCol commented Jan 28, 2014

The clips on Tumblr pages are usually accompanied by more than one thumbnail. Their URL is stored in an array in the source code of the page.

This patch decodes this array in its entirety and returns their URL under the thumbnails key.
It also returns the first URL under the thumbnail key.

youtube_dl/extractor/tumblr.py
+ if (t[0]=='"') and (t[-1]=='"'):
+ thumb_list.append(t[1:-1])
+
+ # take the first, if user only wants one
@phihag

phihag Jan 28, 2014

Collaborator

This should not be necessary. Instead, a fitting picture out of thumbnails should automatically selected. If it's not, we'll fix that.

youtube_dl/extractor/tumblr.py
+ if not ma is None:
+ for t in ma.group('thumb').replace('\\\\/', '/').split(','):
+ t = t.replace('\\x22','"')
+ if (t[0]=='"') and (t[-1]=='"'):
@phihag

phihag Jan 28, 2014

Collaborator

This parsing looks quite messy. For example, when doesn't this condition hit? In any case, you don't need the parentheses here, but spaces around == are quite idiomatic.

@MikeCol

MikeCol Jan 28, 2014

Contributor

The text to parse looks messy, too:

posters     : [\x22http:\\/\\/media.tumblr.com\\/tumblr_mqf3m3tJrk1rhkyls_frame1.jpg\x22,\x22http:\\/\\/media.tumblr.com\\/tumblr_mqf3m3tJrk1rhkyls_frame2.jpg\ ...

Ok, I can get rid off a few backslashes in the replace statements... I'll do that.
And I add some space :-)

youtube_dl/extractor/tumblr.py
+
+ # take the first, if user only wants one
+ single_thumb = None
+ if len(thumb_list)>0:
@phihag

phihag Jan 28, 2014

Collaborator

By the way, the pythonic way is to just evaluate thumb_list

@MikeCol

MikeCol Jan 28, 2014

Contributor

Yes, but it's not save. We don't have any control over what the website is sending and we might end up with something like this (or nastier):

eval("os.system('ls')")
@jaimeMF

jaimeMF Jan 28, 2014

Collaborator

He meant to evaluate if thumb_list is true in a boolean check, not to run eval with its content.

if thumb_list:
    ...

Is equivalent to:

if len(thumb_list) > 0:
    ...
youtube_dl/extractor/tumblr.py
+ thumb_list = []
+ ma = re.search(r'posters.*?\[(?P<thumb>\\x22.*?\\x22)]', webpage)
+ if not ma is None:
+ for t in ma.group('thumb').replace(r'\\/', '/').split(','):
@phihag

phihag Jan 29, 2014

Collaborator

This parses JSON, we should use a proper parser.

@MikeCol

MikeCol Jan 29, 2014

Contributor

Looks like JSON, but it isn't. In JSON the keys have to be in double quotes...
But I think, I can get rid of the replace commands using .decode('string-escape').
I'll do that shortly

Collaborator

phihag commented Jan 25, 2015

Sorry, but the multiple thumbnails do not seem to be present anymore, at least not in our tumblr test page.

@phihag phihag closed this Jan 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment