Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic thumbnailer #111

Closed
nodiscc opened this issue Feb 14, 2015 · 29 comments

Comments

@nodiscc
Copy link
Member

commented Feb 14, 2015

Currently we only support thumbnails for a bunch of sites. Thumbnailers for other sites have been requested at sebsauvage#91, sebsauvage#127, sebsauvage#169, #14. As we cannot possibly add regexes for every site out there, I think a generic thumbnailer is the way to go. IMHO it should

  • Run server-side to allow caching and not burden client bandwidth and CPU
  • Work on every page that shows images via HTML/CSS.
  • Have little impact on performance (little CPU/bandwidth overhead for the server, don't slow down page load)
  • Still allow for regexes for specific sites to speed up the thumbnailing process
  • Be portable (no exotic dependencies like WSGI scripts, node...)
  • Be optional (can be disabled through a checkbox)

@pikzen has a nice prototype of such a thumbnailer in php at https://github.com/pikzen/Shaarli/blob/thumbnailer/ThumbnailExtractor.php (184 lines). You can try it by cloning it and pointing your browser to http://path/to/your/ThumbnailExtractor.php?surface=no&url=http://example.url/with/images.

It works by downloading the page HTML, parsing it to find <img> tags, and assigning a score to each one depending on regexes on it's URL. We are free to use the highest-scored image as a thumbnail. It can then be cached by Shaarli. The ?surface={yes,no} param sets whether the server should download images from the page and calculate their surface, which affects their score. It has a performance impact.

I'd like to see this as a plugin, but there is no way to store the thumbnail URL in the link array from a plugin, so we would have to call tpl/plugins/thumbnailer/thumbnailer.php?url=http://blahblah 50 times per pageload, and it would not be cacheable.

Alternatively we could have an option for this thumbnailer to only get favicons and not try to get thumbnails from the whole page. It has been requested too.

Please comment on how/if you'd like to see this implemented. The code is about the same complexity/size as the previous thumbnailer code, and works with all sites that contain images.

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2015

My bad, sebsauvage#91 is about reducing size of thumbnails and should be addressed separately.

@pikzen

This comment has been minimized.

Copy link

commented Feb 14, 2015

A few things about this thumbnailer:

  • Can't be done clientside (in Javascript) because of CORS rules. A project exists, ImageResolver which does it in JS but it basically sends a request to self, which sends a request to a specialized server to handle thumbnailing, which returns the data. It's a bunch of complications, and in the end it still runs on the server.
  • Doing this with surface checking is AWFULLY slow. I'm talking 30s slow on certain websites.
  • Without surface checking, it is still pretty damn slow at 2.5s per request. Or 0.00045 if using specialized codepaths like YouTube's (no need to do a query, just generate the URL). While I'm not a big fan of this, adding specialized codepaths for common websites would make it faster in most cases. I'd hardly call 2.5s to add a link "acceptable".
  • ThumbnailExtractor sends back an array of images on the site, with a score from 0 to 1, 1 being what seems the most important image. One way of speeding this up would be to stop querying once we've reached a score lower than 0.5, and adding maluses the further down the page the image is.

Basically, I'm focused on making it fast right now, but I fear that as soon as we have to make requests, it's going to be 1s+.

You can also test it at http://ec2-54-93-76-97.eu-central-1.compute.amazonaws.com/Shaarli/ThumbnailExtractor.php?url=http://reddit.com/

Note that it totally breaks down when http:// is not present, when mimetypes of the response are not text/html and that warnings may be shown.

@ArthurHoaro

This comment has been minimized.

Copy link
Member

commented Feb 20, 2015

There is a warning if the url doesn't contain any image.

About performance issue, I'm not sure that's a huge problem since the requests will be done once, when the link is added. After that, we just need cache.

Another solution would be to make the requests asynchronous. It can't be done in JS, but we can call the thumbnailer through AJAX.

@pikzen

This comment has been minimized.

Copy link

commented Feb 20, 2015

Yeah, the parameters parts is just here for demo purposes and does no checks or anything.

I'd love to push out an improved version (which doubles performance, down to 1.5s per request WITH surface checking.) but it's currently sitting on a computer with no SSH or physical access :(

@ArthurHoaro ArthurHoaro referenced this issue Feb 20, 2015
@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2015

@pikzen I'd remove the youtube-specific code and use the <meta property="og:image" content="https://i.ytimg.com/vi/cqCoUvz_nwI/maxresdefault.jpg"> to get thumbnails for youtube. #117 also proposed using og:image. This metatag is described on the Opengraph website. Youtube and other sites make use of it (eg. http://www.heise.de/newsticker/meldung/SIM-Karten-Hack-Die-Kompromittierung-der-Mobilfunknetze-durch-NSA-GCHQ-2555714.html) This may also be faster than parsing the whole page; if the metatag points to a valid image, use this and exit.

@mro

This comment has been minimized.

Copy link

commented Feb 21, 2015

You also may want to have a look at

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2015

@mro nice to have you here.

generic vs. site specific and eager GET vs. lazy

I'd rather have it use eager GET requests instead of adding long site-specific rules (as you said in #117 (comment), most of your code is trying to resolve image URLs, which will become a mess if we add more and more sites - removing all site-specific code, including current one would be good). Yes og:image is a gift. We may need to set a timeout for the GET requests to complete.

thumbnail caching

and cache expiration look very good (expiration time may be discussed)

cleaning up broken images

I'd enjoy this. unlink($filepath); makes me cautious though, and we'd better try all possible cases to make sure it doesn't unlink(something_unexpected).

make various relative URLs absolute

Yes I've hit bugs related to relative URLs with @pikzen's thumbnailer in a few cases. This should be sorted out.

implement following http redirects

In what cases is this useful? Can you provide an URL? Presumably if you bookmark a page your shaared link should point to the redirection endpoint. I'm not sure we should follow redirects server-side. Rogue redirects/hijacking? Cross-domain redirects?

Thanks again for your input

@mro

This comment has been minimized.

Copy link

commented Feb 21, 2015

@mro nice to have you here.

that's very kind, thank you, it's very nice to be welcome.

implement following http redirects

In what cases is this useful? Can you provide an URL?

When bookmarking URLs that promise long-term availability. They're often logical URLs redirecting to a 'real' URL. See http://www.w3.org/Provider/Style/URI for the idea of permanent URLs.

example: http://purl.org/dc/terms/ (purl is for permanent url)

As to eager GET to look for og:image: Maybe the 'add link' form might get 1 more field - a (optional) thumbnail image url.
If that gives a 200 OK, then there's no GET to the URL to bookmark needed - only GET the image for resize+caching. Otherwise it may be fine to just probe for a few like <meta, <link rel or just any any (1st) image URL inside the html.

@pikzen

This comment has been minimized.

Copy link

commented Feb 23, 2015

just any any (1st) image URL inside the html.

This usually gives awful results. On most of the websites I tried, it returned icons, or in the best case the website's logo. Even google.com's first image is the 'Download Chrome' logo.

I 100% agree on one point: parsing every single image should be a last resort because it's awfully long. A way to make this invisible for the end user would be to run in on another thread (lol php threads), but in most cases og:image s provide all we need.
I'll be pushing the updated code in a few hours once I get access to it.

file_get_contents does follow the redirect on http://purl.org/dc/terms/

@e2jk

This comment has been minimized.

Copy link

commented Feb 23, 2015

@pikzen

This comment has been minimized.

Copy link

commented Feb 23, 2015

I'm actually looking at releasing my thumbnailer as a small library once I've ironed out a few edge cases. So Shaarli can make use of a thumbnailing system for its links system instead of having it be an integral part of Shaarli :D

(another reason not to parse the entire HTML for images).

After profiling I saw that the actual request (as in the part in file_get_content that actually downloads the page) is 99.02% of all the thumbnailing time, while parsing it is barely anything. And considering that we need to download them anyways to parse the head, it might as well do a small lookup on the images. Surface checking is disabled by default to prevent huge load times.

Major websites like Youtube ,Vimeo and others do provide og:image just because they are often shared on Facebook and others. The websites that don't are the small websites, blogs and others which already don't have a special case (and thus don't have any image displayed on Shaarli).

@mro

This comment has been minimized.

Copy link

commented Feb 23, 2015

RE: #111 (comment)

implement following http redirects

Can you provide an URL?

example: http://purl.org/dc/terms/ (purl is for permanent url)

file_get_contents does follow the redirect on http://purl.org/dc/terms/

Ouch, indeed. But it fails for http://heise.de/-2557279 or http://spiegel.de/article.do?id=1019990.

@pikzen

This comment has been minimized.

Copy link

commented Feb 24, 2015

Are you sure you don't have an issue with your PHP config (or mine has a generous config) ? I can file_get_contents both of those URLs and they return the entire site without any issue.

<?php

echo "<code>";
echo file_get_contents("http://heise.de/-2557279");
echo "<hr><hr><hr><hr><hr><hr>";
echo file_get_contents("http://spiegel.de/article.do?id=1019990");
echo "</code>";
?>

screenshot from 2015-02-24 11 11 19

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2015

Same results as @pikzen on Debian stable, apache 2.2.22-13+deb7u4, libapache2-mod-php5 5.4.36-0+deb7u3, mostly vanilla config. file_get_contents seems to follow 302 and 301 redirects for me.

@mro

This comment has been minimized.

Copy link

commented Feb 24, 2015

You both (#111 (comment), #111 (comment)) are absolutely right - what fooled me, was that the $httpstatus is 301 Moved Permanently. So no match here: https://github.com/mro/Shaarli/blob/92cddfd40b72ea97aa042dea487d6e2162fc4b4a/index.php#L2583.

Thank you for your patience!

P.S.: testing if makes no sense anyway - if we have valid image data we won't care about the http status, will we?

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2015

@mro I see you're still working on your thumbnailing system. To prevent duplicate work, can you either:

  1. Base your work on @pikzen's thumbnailer branch (git remote add pikzen https://github.com/pikzen/Shaarli; git remote update; git checkout thumbnailer)
  2. Detail and post updates about your thumbnailer implementation here

Regarding client-side caching: what do you mean? Of course your browser caches thumbnails as any other resource. The caching should be done server-side to prevent redoing thumbnailing operations again and again.

Regarding resizing: I see we're not recommending php-gd in the README yet. Without GD resizing won't work.

  • TODO: add php-gd as an optional dependency in the README.
@mro mro referenced this issue Feb 26, 2015
@mro

This comment has been minimized.

Copy link

commented Feb 26, 2015

@ #111 (comment)

To prevent duplicate work

I don't mind (a bit of) duplicate work. Increases choice.

Regarding client-side caching: what do you mean? Of course your browser caches thumbnails as any other resource.

No AFAIK. It's actively disabled:

// Prevent caching on client side or proxy: (yes, it's ugly)
vs. https://github.com/mro/Shaarli/commit/b30a147d3a4ac80571f761b00adb5f963044bb9f

@mro

This comment has been minimized.

Copy link

commented Feb 26, 2015

IMO the core (yet unanswered) question is: what is the default for links - thumbnail or not?

Currently it's no thumb - only after a positive string match the whole process starts. This saves space and clutter (few thumbnails = few server-side cached files) but requires domain/url whitelists or patterns. Or an additional field (thumb source url or at least a flag) in the data store.

Secondary question is whether to create & cache a thumbnail image or steal bandwidth. Again - either whitelists or persistent flag per link.

I hesitated to anticipate any decision and focused on code reduction (~ LOC-=80) & clarifying of the existing code base without introducing any new constructs or fiddling with persistence for now.

What do you guys think?

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2015

@mro Your pull request at #128 looks good.

I don't mind (a bit of) duplicate work. Increases choice.

Yes fine! As long as you keep a pull request up to keep track of your advancement.

  • nice $filepath and $blankpath clarification/cleanup.
  • function saveThumbnailToCacheAndRespond() very much needed.
  • generic thumbnailing function using meta looks good (can you detail/explain what the regexes are supposed to return?)

Question: Can we get rid of site-specific thumbnailers?

  • Eg. the youtube thumbnailer can be removed, as your generic thumbnailing code will properly pick up the og:image property on youtube.
  • Can we get rid of all other site-specific code? (youtube.com, youtu.be, pix.toile-libre.org, imgur.com, i.imgur.com, dailymotion.com, .imageshack.us, flickr.com, vimeo.com, ted.com, xkcd.com) And use parsing of <img src= tags to assign points to each image, like https://github.com/pikzen/Shaarli/blob/thumbnailer/ThumbnailExtractor.php does? (eg a .*\/comics\/.*\.png match would get +2 points) Note that this would only apply to sites where there are no Opengraph meta tags.
@mro

This comment has been minimized.

Copy link

commented Feb 26, 2015

can you detail/explain what the regexes are supposed to return?

they return the thumbnail image url candidate. It's 2 matching groups because the attributes can be written <meta property=... content=.../> or <meta content=... property=.../> (<link likewise) and exactly one matching group is non-null in case of a match.

A proper xml (sax/dom) parser would be cleaner, but also increase conceptual complexity. A few plain regexp matches look acceptable to me. Doing more than simple probing for the usual suspects is too much magic for my taste.
As somebody mentioned earlier, shaarli isn't about thumbnails in the first place. Rather about links and simplicity. So having simpler code and the same thumbnails as yesterday is IMO ok, more thumbnails are bonus.
Frankly I'm not a fan of https://github.com/pikzen/Shaarli/blob/thumbnailer/ThumbnailExtractor.php#L173

More important is IMO to explicitly answer #111 (comment).

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2015

More important is IMO to explicitly answer #111 (comment).

Default should be enabled, with a toggle-off option. I was initially in favor of a plugin for this (see #129 for a basic plugin system) but in the current form, the plugin system does not allow us to cache files (does it?).

more thumbnails are bonus

Yes your code improves the current thumbnailer, even though I'd still like to remove the ~200 lines of site-specific code, it's better than what we have. We should consider merging (after testing). Thanks for your work.

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2015

New thumbnailer Pull Request in #153, based on previous #128 which broke because of a missing character. It's now unbroken, but thumbnailing code itself doesn't work. Help welcome

@nodiscc nodiscc added the help wanted label Mar 7, 2015

@nodiscc nodiscc removed the discussion label Mar 7, 2015

@mro

This comment has been minimized.

Copy link

commented Mar 9, 2015

the thumbnailer does work - you got bitten by #111 (comment). Default as of today: no thumbnail! That's what you see.

Have a look at https://github.com/mro/Shaarli/commit/3d338d499eb36780b9795f6360d00c1acd2d0bab#commitcomment-10094552

@nodiscc nodiscc self-assigned this Mar 9, 2015

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2015

Pull Request at #153 is working. Please help rewiewing/testing/adding comments.

@nodiscc nodiscc modified the milestone: 1.1 Mar 13, 2015

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2015

Should we move this to the 0.9beta milestone? The generic thumbnailer is working, with a few exceptions (see #153). We could start by whitelisting only a bunch of known working sites and merging this for 0.9beta; then fixing the regex and allowing users to select Generate thumbnails for all domains in the prefs for 1.1.

Ok?

@nodiscc nodiscc added the need info label Mar 18, 2015

@mro

This comment has been minimized.

Copy link

commented Mar 18, 2015

a few exceptions (see #153)

what's the remaining issues?

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2015

preg_replace() failing loudly for some links (#153 (comment)), alt text still displayed (https://imgur.com/YtiJhbW)

@mro

This comment has been minimized.

Copy link

commented Mar 19, 2015

you applied #153 (comment) and still get the same errors?

@nodiscc

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2015

o sorry, I didn't read your message properly. I'll do what you suggest and try again.

Closing this and continuing discussion in #153

@nodiscc nodiscc closed this Mar 23, 2015

ArthurHoaro added a commit to ArthurHoaro/Shaarli that referenced this issue Jun 25, 2015
shaarli#111 thumbnails: gently trade some site specific php code for …
…site specific regexps/configuration. A bit DRYer and cleaner.

TODO: test flickr image links!

see also shaarli#128

@virtualtam virtualtam modified the milestones: 1.1, 0.6.0 Jul 30, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.