Incorrect mimetype when adding a link to a pdf file #4142

teroshan · 2019-10-06T21:17:32Z

Issue details

I'm trying to use the mimetype of an entry to known when to skip the conversion and directly go for a direct download. More specifically, I'm looking for "application/pdf" in order to download the PDF file directly.

When adding links to some pdf files, I get error in the web interface (fetching content failed) and the mimetype is not set to "application/pdf" as expected. Looking at the logs, and doing a manual curl -I on the link shows the correct mimetype. The logs also suggest that pdf parsing was attempted, so the mimetype was correctly detected at the beginning.

When testing on https://f43.me/feed/test, I also get an error at the parsing of the url. However, I expect the mimetype of the entry to still reflect that of the content of the url, regardless of an issue when parsing the link.

Should the mimetype only be updated when the content is correctly parsed? Should I do my own fetching on the original url to get its mimetype? I want to avoid doing that on my client if possible.

Thanks for your help! :)

Environment

The issue is reproducible on the demo wallabag instance, as of now version is 2.3.8.

Steps to reproduce/test case

Add a custom tagging rule mimetype = "application/pdf" to tag entries as pdf for example, just to make it easier to spot discrepancies.
Add this link as entry.
The entry is not tagged as it should be.

The text was updated successfully, but these errors were encountered:

j0k3r · 2019-10-07T07:11:12Z

The issue is reproducible on the demo wallabag instance, as of now version is 2.3.8.

First of all, the demo wallabag instance should NOT be used at all. We didn't update it and it's out of date.

Should the mimetype only be updated when the content is correctly parsed?

Yes because when the fetch is failing we save the message of the fail, which is obviously html and not pdf, jpg or zip.

The logs also suggest that pdf parsing was attempted, so the mimetype was correctly detected at the beginning.

We detect the mimetype early do manage how we are going to fetch the content. For a PDF, we just download it and then try to extract content from it. If we can't, we save nothing except the error message.

Add this link as entry.

The parsing of the PDF can't be done because the server reply with a "Too Many Requests" and serve us an html page of the failure:

teroshan · 2019-10-07T07:26:36Z

I see, so doing the scenario on a public instance maybe is not the best idea. Still, I get the issue on my self-hosted instance, and the result is the same.

I'm able to wget the file from the same server just fine, so I'm not sure that it's a "too many requests" issue. But what you're telling me is that if there is any error during the parsing at any point, the mimetype will no longer reflect that of the content?

For example in my case, it looks to me like the pdf was downloaded correctly, but its parsing failed and raised an exception

[2019-10-06 21:21:13] security.DEBUG: Read existing security token from the session. {"key":"_security_secured_area"} []
[2019-10-06 21:21:13] security.DEBUG: User was reloaded from a user provider. {"username":"wallabag","provider":"Symfony\\Bridge\\Doctrine\\Security\\User\\EntityUserProvider"} []
[2019-10-06 21:21:13] app.DEBUG: Restricted access config enabled? {"enabled":1} []
[2019-10-06 21:21:13] graby.INFO: Graby is ready to fetch [] []
[2019-10-06 21:21:13] graby.INFO: . looking for site config for researchgate.net in primary folder {"host":"researchgate.net"} []
[2019-10-06 21:21:13] graby.INFO: Appending site config settings from global.txt [] []
[2019-10-06 21:21:13] graby.INFO: . looking for site config for global in primary folder {"host":"global"} []
[2019-10-06 21:21:13] graby.INFO: ... found site config global.txt {"host":"global.txt"} []
[2019-10-06 21:21:13] graby.INFO: Cached site config with key: researchgate.net {"key":"researchgate.net"} []
[2019-10-06 21:21:13] graby.INFO: . looking for site config for global in primary folder {"host":"global"} []
[2019-10-06 21:21:13] graby.INFO: ... found site config global.txt {"host":"global.txt"} []
[2019-10-06 21:21:13] graby.INFO: Appending site config settings from global.txt [] []
[2019-10-06 21:21:13] graby.INFO: Cached site config with key: global {"key":"global"} []
[2019-10-06 21:21:13] graby.INFO: Cached site config with key: researchgate.net.merged {"key":"researchgate.net.merged"} []
[2019-10-06 21:21:13] graby.INFO: Fetching url: https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf {"url":"https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf"} []
[2019-10-06 21:21:13] graby.INFO: Trying using method "get" on url "https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf" {"method":"get","url":"https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf"} []
[2019-10-06 21:21:13] graby.INFO: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf"} []
[2019-10-06 21:21:13] graby.INFO: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf"} []
[2019-10-06 21:21:13] graby.DEBUG: Auth: no credentials available for host. {"host":"researchgate.net"} []
[2019-10-06 21:21:14] graby.DEBUG: Auth: no credentials available for host. {"host":"researchgate.net"} []
[2019-10-06 21:21:14] graby.INFO: Data fetched: [array] {"data":{"effective_url":"https://www.researchgate.net/profile/Kaiyuan_Yang5/publication/306301849_A2_Analog_Malicious_Hardware/links/5cce6b53458515712e928b3e/A2-Analog-Malicious-Hardware.pdf","body":"(only length for debug): 2769279","headers":"application/pdf","all_headers":{"date":"Sun, 06 Oct 2019 21:21:13 GMT","content-type":"application/pdf","content-length":"2769279","connection":"keep-alive","set-cookie":"__cfduid=d387c86c9d7bda056551c38b865536c3d1570396873; expires=Mon, 05-Oct-20 21:21:13 GMT; path=/; domain=.researchgate.net; HttpOnly; Secure, sid=5BjBnArz0ggUgb3m68OnQdktJ38vMvfsU1DlcM4ug5iF7gULxYjoMRNbnwk0hAwM4OMSzuSj6rXJw31MxopP1STLa5wR4EOBZchy1J3GAP5P0klhPLAH19Z7RYMJCtQr;Path=/;Domain=.www.researchgate.net;Secure;HttpOnly, did=J4g0uAO8ROM2htvu0YuqP18SfQ5YKD2Z2BlN64VM5kGmSM7LjAFlsngUoya4sfBl; expires=Tue, 06-Oct-2020 21:21:13 GMT; Max-Age=31622400; path=/; domain=.www.researchgate.net; secure; httponly, pl=deleted; expires=Sat, 06-Oct-2018 21:21:12 GMT; Max-Age=0; path=/; domain=.www.researchgate.net; secure; httponly, ptc=RG1.8318582581154698758.1570396873; expires=Tue, 05-Oct-2021 21:21:13 GMT; Max-Age=63072000; path=/; domain=.www.researchgate.net; secure; httponly","cache-control":"must-revalidate, no-cache, no-store, post-check=0, pre-check=0, private","expires":"Thu, 19 Nov 1981 08:52:00 GMT","content-disposition":"inline; filename=\"A2_SP_2016.pdf\"","x-rg-decision-maker":"habibi-service","content-encoding":"identity","link":"<https://www.researchgate.net/publication/306301849_A2_Analog_Malicious_Hardware>; rel=\"canonical\"","access-control-allow-origin":"https://c5.rgstatic.net","access-control-allow-methods":"GET,POST,PUT,DELETE,PATCH,HEAD,OPTIONS","access-control-allow-headers":"Accept,Range,Origin,Content-Type,Authorization","access-control-expose-headers":"Accept-Ranges, Content-Encoding, Content-Length, Content-Range","x-correlation-id":"rgreq-63c9cd0db1634c89ba1c6b93be7794e0","accept-ranges":"bytes","cf-cache-status":"DYNAMIC","expect-ct":"max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"","server":"cloudflare","cf-ray":"521aaf0cf9c5b7e1-CDG"},"status":200}} []
[2019-10-06 21:21:14] app.ERROR: Error while saving an entry {"exception":"[object] (Exception(code: 0): Object list not found. Possible secured file. at /var/www/wallabag/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:103)","entry":"[object] (Wallabag\\CoreBundle\\Entity\\Entry: {})"} []
[2019-10-06 21:21:14] app.DEBUG: DownloadImagesSubscriber: disabled. [] []
[2019-10-06 21:21:14] security.DEBUG: Stored the security token in the session. {"key":"_security_secured_area"} []

I just saw the details of how to make the logs more verbose, I'll do that this evening.

j0k3r · 2019-10-07T07:29:41Z

Object list not found. Possible secured file

Looks like the parsing of the PDF fail.
We do not implement the parsing of PDF ourself we are using an external library for that: https://github.com/smalot/pdfparser

teroshan · 2019-10-07T22:52:47Z

We do not implement the parsing of PDF ourself we are using an external library for that: https://github.com/smalot/pdfparser

I understand the that the issue that triggers the error is out of the bounds of wallabag.
However, the way the mimetype is updated is not.

What is exactly the purpose of the mimetype attribute of an entry? Is it just supposed to be used for internal logic? Or is it supposed to be user/API facing?

In the first case, I would understand the current result. An application/pdf was detected, parsing was attempted. An error occurred, so nothing can be stored in the entry. Fair enough. mimetype allows me to see that the content wasn't parsed properly I guess?

In the second case, either:

Content is parsed correctly. mimetype is indicative of the original mimetype of the content, which was correctly parsed and appears as text. This is the ideal case anyway.
But errors happen:
Content was not parsed (error of some kind somewhere). mimetype displays None, even if the initial fetch had a mimetype which was used to start a parsing. mimetype displays None, and no text is stored.

Wouldn't it make sense to display the initial mimetype in the second case? Or if that attribute is heavily used internally/has a different purpose that what I understand, is adding an attribute like initial_mimetype acceptable?

The goal would be to allow to automate some tasks more easily. For example I want to be able to tag all mimetype/pdf with a given tag in wallabag, and have a script run to archive them and email them somewhere. All of this would make more sense and be easier if I didn't have to request the headers of the content on the client side again, since the server already did it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect mimetype when adding a link to a pdf file #4142

Incorrect mimetype when adding a link to a pdf file #4142

teroshan commented Oct 6, 2019 •

edited

j0k3r commented Oct 7, 2019

teroshan commented Oct 7, 2019

j0k3r commented Oct 7, 2019

teroshan commented Oct 7, 2019

Incorrect mimetype when adding a link to a pdf file #4142

Incorrect mimetype when adding a link to a pdf file #4142

Comments

teroshan commented Oct 6, 2019 • edited

Issue details

Environment

Steps to reproduce/test case

j0k3r commented Oct 7, 2019

teroshan commented Oct 7, 2019

j0k3r commented Oct 7, 2019

teroshan commented Oct 7, 2019

teroshan commented Oct 6, 2019 •

edited