Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blogger Articles show up as untitled #803

Closed
anno1337 opened this issue Aug 5, 2014 · 4 comments · Fixed by j0k3r/graby#47
Closed

Blogger Articles show up as untitled #803

anno1337 opened this issue Aug 5, 2014 · 4 comments · Fixed by j0k3r/graby#47
Milestone

Comments

@anno1337
Copy link
Contributor

anno1337 commented Aug 5, 2014

I tried saving the following article to my wallabag setup: http://www.gavinj.net/2012/06/building-python-daemon-process.html

This uses one of these rather silly javascript templates which seems to cause problems with wallabag. No content is saved.

@tcitworld
Copy link
Member

Yup, I see no easy way to get content since content isn't in page source. That's really, really bad, talking about web standards.

@tcitworld tcitworld added the Bug label Aug 6, 2014
@tcitworld
Copy link
Member

@fivefilters
Copy link

Yes, I don't understand how this even caught on - it looks like a horrible way to present simple content. Nonetheless, Google has a page for developers who insist on presenting content in this way to help them offer the same content in plain HTML for the benefit of crawlers and other such systems.

Blogspot - which powers this particular site - follows the spec. It contains a meta element <meta content='!' name='fragment'/> which signals to compatible crawlers that the plain HTML content can be found by appending ?_escaped_fragment_= to the URL: http://www.gavinj.net/2012/06/building-python-daemon-process.html?_escaped_fragment_=

Full-Text RSS also understands this, but we only look in the first 4000 characters of the HTML to find the meta tag which signals that we should fetch the plain HTML URL. In this case, due to a large embedded image, that tag appears after 4000 characters, and so is missed by Full-Text RSS.

We'll try to fix this in a future update. To fix it manually, you can try editing HumbleHttpAgent.php - https://github.com/wallabag/wallabag/blob/master/inc/3rdparty/libraries/humble-http-agent/HumbleHttpAgent.php - and replacing the number 4000 (3 occurrunces) with a bigger number. Or removing the parameter completely so Full-Text RSS searches for the tag in the entire HTML.

@nicosomb nicosomb added this to the 2.1 milestone Jul 30, 2015
@nicosomb nicosomb added Site Config and removed Bug labels Feb 19, 2016
@j0k3r
Copy link
Member

j0k3r commented Apr 10, 2016

Thanks for reporting that blogger website.
Of course, as FTRSS, graby put a _escaped_fragment_ to get the html.

The problem here is that this website use a data-uri image as the open graph image. So the body is huge and graby (should be the same for FTRSS) don't check too much html to be able to detect that fragment.

It'll be fixed in the 2.0.2.

@j0k3r j0k3r closed this as completed Apr 10, 2016
@j0k3r j0k3r modified the milestones: 2.0.2, 2.1.0 Apr 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants