Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkchecker: option to override user agent #1331

Closed
shimizukawa opened this issue Jan 3, 2015 · 12 comments
Closed

Linkchecker: option to override user agent #1331

shimizukawa opened this issue Jan 3, 2015 · 12 comments
Labels
extensions type:enhancement enhance or introduce a new feature
Milestone

Comments

@shimizukawa
Copy link
Member

linkcheck.py currently hardcodes a 'Mozilla/5.0' user agent to simulate a browser, which works with most sites.

But Sourceforge resets the connection for that particular string. Interestingly enough, it works OK for other user agents, including 'Mozilla/4.0'.

It may be the case that other websites exhibit similar quirks, and it would be nice if we could specify a string to be used as the user agent in conf.py.


@shimizukawa shimizukawa added this to the 1.3 milestone Jan 3, 2015
@shimizukawa shimizukawa added type:enhancement enhance or introduce a new feature extensions prio:low labels Jan 3, 2015
@shimizukawa
Copy link
Member Author

From MarioVilas on 2013-12-20 09:55:12+00:00

I'm also getting strange errors from other sites. For example code.activestate.com throws 405 Method Not Allowed errors (and it's not the only site that does that, may be related to the web server software rather than a specific site configuration), and Wordpress blogs also don't seem to like it (they give empty responses).

The 405 errors appear to be related to the use of HEAD, which is not mandatory in HTTP. Instead of failing, linkcheck.py should retry with the GET method.

@shimizukawa
Copy link
Member Author

From Takayuki Shimizukawa on 2013-12-22 03:50:01+00:00

I confirmed with sourceforge.com:

#!text

>>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html')
<Response [200]>

>>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html', headers={'User-agent': 'Mozilla/5.0'})
Traceback (most recent call last):
...
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docutils.sourceforge.net', port=80): Max retries exceeded with url: /docs/ref/rst/directives.html
(Caused by <class 'socket.error'>: [Errno 10054] Connection reset by peer.)

>>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html', headers={'User-agent': 'Mozilla/4.0'})
<Response [200]>

and also confirmed with code.activestate.com:

#!text

>>> requests.head('http://code.activestate.com/', headers={'User-agent': 'Mozilla/5.0'})
<Response [200]>
>>> requests.head('http://code.activestate.com/recipes/578788/', headers={'User-agent': 'Mozilla/5.0'})
<Response [405]>
>>> requests.head('http://code.activestate.com/recipes/578788/', headers={'User-agent': 'Mozilla/4.0'})
<Response [405]>
>>> requests.head('http://code.activestate.com/recipes/578788/')
<Response [405]>
>>> requests.get('http://code.activestate.com/recipes/578788/')
<Response [200]>

I think linkcheck should:

  • If a HEAD request receives 405 error, retry with a GET request.

However, I wonder why ('User-agent', 'Mozilla/5.0') cause Connection reset by peer. exception?

@shimizukawa
Copy link
Member Author

From Georg Brandl on 2014-01-12 07:45:48+00:00

This should be in 1.3 as a new feature...

@shimizukawa
Copy link
Member Author

From Takayuki Shimizukawa on 2014-01-12 08:49:18+00:00

Georg Brandl Ah.. Sorry. I mistook.

@shimizukawa
Copy link
Member Author

From Georg Brandl on 2014-01-12 23:06:39+00:00

Well, I should have fixed it with fa7c50ffb46f by retrying. I don't think a selection of the user-agent is necessary anymore.

@dHannasch
Copy link

dHannasch commented Oct 21, 2019

I'm not sure whether to open a new issue for this or not.

After #6381 added sphinx.util.requests.useragent_header to linkcheck's headers, linkcheck now reports "broken" on links that are not broken.

For example, http://doc.pytest.org/, http://keras.io/, http://www.coxlab.org/, and http://www.cvlibs.net/.

(line    5) broken    http://www.coxlab.org/ - 403 Client Error: Forbidden for url: http://www.coxlab.org/
(line   25) broken    http://keras.io/ - 403 Client Error: Forbidden for url: http://keras.io/
(line   35) broken    http://www.cvlibs.net/datasets/kitti/ - 403 Client Error: Forbidden for url: http://www.cvlibs.net/datasets/kitti/
(line   73) broken    http://keras.io/getting-started/faq/#how-can-i-run-keras-on-gpu - Anchor 'how-can-i-run-keras-on-gpu' not found

The problem is caused by sphinx.util.requests.useragent_header.
This can be most easily seen by just comparing the output of requests.head to the output of sphinx.util.requests.head (which also uses sphinx.util.requests.useragent_header).

python -c "import requests; print(requests.head('http://doc.pytest.org/', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('http://doc.pytest.org/', allow_redirects=True))"
<Response [200]>
<Response [403]>
python -c "import requests; print(requests.head('http://keras.io', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('http://keras.io', allow_redirects=True))"
<Response [200]>
<Response [403]>
python -c "import requests; print(requests.head('http://www.coxlab.org/')); import sphinx.util.requests; print(sphinx.util.requests.head('http://www.coxlab.org/'))"
<Response [200]>
<Response [403]>
python -c "import requests; print(requests.head('http://www.cvlibs.net/')); import sphinx.util.requests; print(sphinx.util.requests.head('http://www.cvlibs.net/'))"
<Response [200]>
<Response [403]>

Reverting #6381 fixes the problem.

(line    5) ok        http://www.coxlab.org/
(line   25) redirect  http://keras.io/ - permanently to https://keras.io/
(line   73) redirect  http://keras.io/getting-started/faq/#how-can-i-run-keras-on-gpu - permanently to https://keras.io/getting-started/faq/#how-can-i-run-keras-on-gpu
(line   35) ok        http://www.cvlibs.net/datasets/kitti/

The reason #6381 made the difference was that prior to #6381, linkcheck overrode the default headers of sphinx.util.requests.head to not use sphinx.util.requests.useragent_header.
(Of course, sphinx.util.requests.head will still fail on those URLs by default, but the sphinx build works with linkcheck overriding the headers to avoid sphinx.util.requests.useragent_header.)

It looks like a lot of people are now having to add linkcheck ignores due to sites rejecting the linkcheck user-agent, e.g. mtbc/ome-documentation@41f2e06.

I don't know the best way to handle this.

At maximum complexity, we could have a linkcheck_user_agents option, similar to linkcheck_ignore but a dictionary mapping URLs to User-Agent strings to use. (With None meaning to pass nothing and let requests add whatever User-Agent it saw fit.)
https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-the-linkcheck-builder

We could allow specifying a User-Agent as a command-line option to use for all links in that run of the builder; that may be what @kristian-kolev had in mind.

Really, speaking from my own perspective, an option to just disable the use of sphinx.util.requests.useragent_header would be plenty.

I don't know where the magic string Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0 came from in the first place. It originated without comment in the commit that added retries: 5406bff#diff-bb67587af94a24f53692228ea27061ed
(I could not locate fa7c50ffb46f: https://github.com/sphinx-doc/sphinx/search?q=fa7c50ffb46f&type=Commits)

It doesn't appear to be documented anywhere, and Google doesn't turn up any indication that there's anything particularly special about that string. Do we even need it? Apparently it fixed #6378; I'm not clear on how or why. When I try the URL from #6378, I get an SSLCertVerificationError:

python -c "import sphinx.util.requests; print(sphinx.util.requests.head('https://wiki.stm32duino.com/index.php?title=Blue_Pill', allow_redirects=True))"
Traceback (most recent call last):
  File "/tmp/prednet/.tox/docs/lib/python3.7/site-packages/urllib3/connectionpool.py", line 662, in urlopen
    self._prepare_proxy(conn)
  File "/tmp/prednet/.tox/docs/lib/python3.7/site-packages/urllib3/connectionpool.py", line 948, in _prepare_proxy
    conn.connect()
  File "/tmp/prednet/.tox/docs/lib/python3.7/site-packages/urllib3/connection.py", line 420, in connect
    _match_hostname(cert, self.assert_hostname or server_hostname)
  File "/tmp/prednet/.tox/docs/lib/python3.7/site-packages/urllib3/connection.py", line 430, in _match_hostname
    match_hostname(cert, asserted_hostname)
  File "/ascldap/users/dahanna/anaconda3/envs/python37env/lib/python3.7/ssl.py", line 334, in match_hostname
    % (hostname, ', '.join(map(repr, dnsnames))))
ssl.SSLCertVerificationError: ("hostname 'wiki.stm32duino.com' doesn't match either of '*.hostgator.com', 'hostgator.com'",)

@tk0miya
Copy link
Member

tk0miya commented Oct 22, 2019

Note: It seems succeeded from my local.

(py37) bash-3.2$ python -c "import requests; print(requests.head('http://doc.pytest.org/', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('http://doc.pytest.org/', allow_redirects=True))"
<Response [200]>
<Response [200]>
(py37) bash-3.2$ python -c "import requests; print(requests.head('http://keras.io', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('http://keras.io', allow_redirects=True))"
<Response [200]>
<Response [200]>
(py37) bash-3.2$ python -c "import requests; print(requests.head('http://www.coxlab.org/')); import sphinx.util.requests; print(sphinx.util.requests.head('http://www.coxlab.org/'))"
<Response [200]>
<Response [200]>
(py37) bash-3.2$ python -c "import requests; print(requests.head('http://www.cvlibs.net/')); import sphinx.util.requests; print(sphinx.util.requests.head('http://www.cvlibs.net/'))"
<Response [200]>
<Response [200]>

But it is also failed to access to stm32duino.com on my local.

(py37) bash-3.2$ python -c "import sphinx.util.requests; print(sphinx.util.requests.head('https://wiki.stm32duino.com/index.php?title=Blue_Pill', allow_redirects=True))"
Traceback (most recent call last):
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connection.py", line 420, in connect
    _match_hostname(cert, self.assert_hostname or server_hostname)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connection.py", line 430, in _match_hostname
    match_hostname(cert, asserted_hostname)
  File "/Users/tkomiya/.pyenv/versions/3.7.5/lib/python3.7/ssl.py", line 334, in match_hostname
    % (hostname, ', '.join(map(repr, dnsnames))))
ssl.SSLCertVerificationError: ("hostname 'wiki.stm32duino.com' doesn't match either of '*.hostgator.com', 'hostgator.com'",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wiki.stm32duino.com', port=443): Max retries exceeded with url: /index.php?title=Blue_Pill (Caused by SSLError(SSLCertVerificationError("hostname 'wiki.stm32duino.com' doesn't match either of '*.hostgator.com', 'hostgator.com'")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/tkomiya/work/sphinx/sphinx/util/requests.py", line 131, in head
    return requests.get(url, **kwargs)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/tkomiya/work/sphinx/.tox/py37/lib/python3.7/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='wiki.stm32duino.com', port=443): Max retries exceeded with url: /index.php?title=Blue_Pill (Caused by SSLError(SSLCertVerificationError("hostname 'wiki.stm32duino.com' doesn't match either of '*.hostgator.com', 'hostgator.com'")))

@tk0miya
Copy link
Member

tk0miya commented Oct 22, 2019

I also don't know where the User-Agent string came from. But I agree it is too old for real world. So +1 for use replace it to new one. And +0 for adding a configuration to modify it for who want to use old User-Agent.

@tk0miya tk0miya reopened this Oct 22, 2019
@dHannasch
Copy link

succeeded from my local.

Huh. I tried at https://www.pythonanywhere.com/try-ipython/ earlier to verify, and it got the 403 too:

In [1]: import sphinx.util.requests
In [2]: print(sphinx.util.requests.head('http://doc.pytest.org/', allow_redirects=True))
<Response [403]>

...but now that I go back and check the regular requests package on https://www.pythonanywhere.com/try-ipython/, that gets a 403 error too. (Only for the URLs above; most URLs give 200 as normal.)

In [5]: print(requests.head('http://doc.pytest.org/', allow_redirects=True))
<Response [403]>
In [7]: print(requests.head('http://github.com/', allow_redirects=True))
<Response [200]>

I...have no idea how to account for that. I assume everything must have something to do with User-Agent strings, because reverting that one change makes my build pass locally and on two different CI servers. But it seems that...I guess...the version of requests on https://www.pythonanywhere.com/try-ipython/ also has a User-Agent string that gets a 403. Or maybe something else in the header that gets a 403. I guess technically the reverting change that I made overrides any default headers requests might have with just 'Accept': 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' (as linkcheck used to). I don't know whether any other headers might be important. But overriding the headers doesn't seem to matter.

In [9]: print(requests.head('http://doc.pytest.org/', allow_redirects=True, headers={'Accept': 'tex
t/html,application/xhtml+xml;q=0.9,*/*;q=0.8'}))
<Response [403]>

tk0miya added a commit to tk0miya/sphinx that referenced this issue Oct 22, 2019
tk0miya added a commit to tk0miya/sphinx that referenced this issue Oct 26, 2019
tk0miya added a commit that referenced this issue Nov 5, 2019
Close #1331: Change default User-Agent header
@tk0miya
Copy link
Member

tk0miya commented Nov 5, 2019

Now #1331 is merged. It will be releases in next release.
Thank you for reporting!

@tk0miya tk0miya closed this as completed Nov 5, 2019
@dHannasch
Copy link

Thank you! Looks like this was merged into the 2.0 branch, so I'm installing sphinx@2.0 to make things work. (It's still the same on master. I guess not all changes go into master? I guess I don't really understand https://github.com/sphinx-doc/sphinx/blob/master/CONTRIBUTING.rst#branch-model. In any case, sphinx@2.0 is working great.)

@tk0miya
Copy link
Member

tk0miya commented Nov 17, 2019

Now I merged 2.0 branch into master branch. We do it by hand sometimes. Sorry for late!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
extensions type:enhancement enhance or introduce a new feature
Projects
None yet
Development

No branches or pull requests

3 participants