Blogger Articles show up as untitled #803

anno1337 · 2014-08-05T06:57:14Z

I tried saving the following article to my wallabag setup: http://www.gavinj.net/2012/06/building-python-daemon-process.html

This uses one of these rather silly javascript templates which seems to cause problems with wallabag. No content is saved.

tcitworld · 2014-08-06T09:27:26Z

Yup, I see no easy way to get content since content isn't in page source. That's really, really bad, talking about web standards.

tcitworld · 2014-08-06T12:21:26Z

@fivefilters has been informed.
This might help. https://developers.google.com/webmasters/ajax-crawling/docs/specification#MetaTag

fivefilters · 2014-08-06T12:36:18Z

Yes, I don't understand how this even caught on - it looks like a horrible way to present simple content. Nonetheless, Google has a page for developers who insist on presenting content in this way to help them offer the same content in plain HTML for the benefit of crawlers and other such systems.

Blogspot - which powers this particular site - follows the spec. It contains a meta element <meta content='!' name='fragment'/> which signals to compatible crawlers that the plain HTML content can be found by appending ?_escaped_fragment_= to the URL: http://www.gavinj.net/2012/06/building-python-daemon-process.html?_escaped_fragment_=

Full-Text RSS also understands this, but we only look in the first 4000 characters of the HTML to find the meta tag which signals that we should fetch the plain HTML URL. In this case, due to a large embedded image, that tag appears after 4000 characters, and so is missed by Full-Text RSS.

We'll try to fix this in a future update. To fix it manually, you can try editing HumbleHttpAgent.php - https://github.com/wallabag/wallabag/blob/master/inc/3rdparty/libraries/humble-http-agent/HumbleHttpAgent.php - and replacing the number 4000 (3 occurrunces) with a bigger number. Or removing the parameter completely so Full-Text RSS searches for the tag in the entire HTML.

j0k3r · 2016-04-10T19:00:58Z

Thanks for reporting that blogger website.
Of course, as FTRSS, graby put a _escaped_fragment_ to get the html.

The problem here is that this website use a data-uri image as the open graph image. So the body is huge and graby (should be the same for FTRSS) don't check too much html to be able to detect that fragment.

It'll be fixed in the 2.0.2.

tcitworld added the Bug label Aug 6, 2014

nicosomb added this to the 2.1 milestone Jul 30, 2015

nicosomb added Site Config and removed Bug labels Feb 19, 2016

j0k3r mentioned this issue Apr 10, 2016

Avoid data:image in open graph data j0k3r/graby#47

Merged

j0k3r closed this as completed Apr 10, 2016

j0k3r modified the milestones: 2.0.2, 2.1.0 Apr 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blogger Articles show up as untitled #803

Blogger Articles show up as untitled #803

anno1337 commented Aug 5, 2014

tcitworld commented Aug 6, 2014

tcitworld commented Aug 6, 2014

fivefilters commented Aug 6, 2014

j0k3r commented Apr 10, 2016

Blogger Articles show up as untitled #803

Blogger Articles show up as untitled #803

Comments

anno1337 commented Aug 5, 2014

tcitworld commented Aug 6, 2014

tcitworld commented Aug 6, 2014

fivefilters commented Aug 6, 2014

j0k3r commented Apr 10, 2016