New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default Accept header to the linkcheck #5140
Conversation
There are situations where requested server replies with a different content (in my particular case HTTP 404) when there is no accept header, possibly because it evaluates the content negotiation to an API request instead of a browser request. This change adds a default Accept header, which equals to what my Firefox sets out of the box to its requests. I stumbled upon this when checking a link to https://crates.io/crates/dredd-hooks. While curl -i https://crates.io/crates/dredd-hooks returns HTTP 404, following results in an expected HTTP 200 response with HTML body: curl -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -i https://crates.io/crates/dredd-hooks
Codecov Report
@@ Coverage Diff @@
## master #5140 +/- ##
==========================================
- Coverage 82.31% 82.29% -0.02%
==========================================
Files 297 292 -5
Lines 39212 38888 -324
Branches 6033 5979 -54
==========================================
- Hits 32276 32003 -273
+ Misses 5602 5558 -44
+ Partials 1334 1327 -7
Continue to review full report at Codecov.
|
'allow_redirects': True, | ||
'headers': { | ||
'Accept': ('text/html,application/xhtml+xml,' | ||
'application/xml;q=0.9,*/*;q=0.8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why application/xml
is needed. Can Sphinx recognize XML document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to test it again without application/xml
to verify, but in my opinion even without it, it would still solve the issue I had with crates.io
.
The reason I added the header like this is to mimic a real browser, 1:1. So I just took whatever my own Firefox is sending.
Since the link check verifies whether the external links in the docs are "not dead" and still accessible for humans reading the docs and clicking the links, displaying them in their browsers, I think it is a correct approach. If browsers are generally fine with XML and prefer it over other media types, then I think it should be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linkcheck builder uses HTMLParser
to parse the response. I'm okay if it can handle any XML data. To send Accept: application/xml
means the client can accept it as a data format. But I don't know our builder supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Even application/xml
not present, creates.io
returns its contents:
$ curl -H 'Accept: text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' -i https://crates.io/crates/dredd-hooks
HTTP/1.1 200 OK
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
Thank you @tk0miya! |
Adds default
Accept
header to thelinkcheck
as some servers have troubles to respond correctly without it. The addedAccept
header resembles what Firefox sets by default.Feature or Bugfix
Purpose
There are situations where requested server replies with a different content (in my particular case HTTP 404) when there is no accept header, possibly because it evaluates the content negotiation to an API request instead of a browser request. This change adds a default Accept header, which equals to what my Firefox sets out of the box to its requests.
I stumbled upon this when checking a link to https://crates.io/crates/dredd-hooks. While
returns HTTP 404, following results in an expected HTTP 200 response with HTML body: