DOC linkcheck run; https and 301 link updates.

Closes #4359
scrapy · Feb 25, 2020 · a34c366 · a34c366
1 parent caa1dea
commit a34c366
Show file tree

Hide file tree

Showing 16 changed files with 58 additions and 59 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -281,6 +281,7 @@
 
 intersphinx_mapping = {
     'coverage': ('https://coverage.readthedocs.io/en/stable', None),
+    'cssselect': ('https://cssselect.readthedocs.io/en/latest', None),
     'pytest': ('https://docs.pytest.org/en/latest', None),
     'python': ('https://docs.python.org/3', None),
     'sphinx': ('https://www.sphinx-doc.org/en/master', None),

diff --git a/docs/contributing.rst b/docs/contributing.rst
@@ -143,7 +143,7 @@ by running ``git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE``
 (replace 'upstream' with a remote name for scrapy repository,
 ``$PR_NUMBER`` with an ID of the pull request, and ``$BRANCH_NAME_TO_CREATE``
 with a name of the branch you want to create locally).
-See also: https://help.github.com/articles/checking-out-pull-requests-locally/#modifying-an-inactive-pull-request-locally.
+See also: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally.
 
 When writing GitHub pull requests, try to keep titles short but descriptive.
 E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests"
@@ -168,7 +168,7 @@ Scrapy:
 
 * Don't put your name in the code you contribute; git provides enough
   metadata to identify author of the code.
-  See https://help.github.com/articles/setting-your-username-in-git/ for
+  See https://help.github.com/en/github/using-git/setting-your-username-in-git for
   setup instructions.
 
 .. _documentation-policies:
@@ -266,5 +266,5 @@ And their unit-tests are in::
 .. _tests/: https://github.com/scrapy/scrapy/tree/master/tests
 .. _open issues: https://github.com/scrapy/scrapy/issues
 .. _PEP 257: https://www.python.org/dev/peps/pep-0257/
-.. _pull request: https://help.github.com/en/articles/creating-a-pull-request
+.. _pull request: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request
 .. _pytest-xdist: https://github.com/pytest-dev/pytest-xdist
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -22,8 +22,8 @@ In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
 comparing `jinja2`_ to `Django`_.
 
 .. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
-.. _lxml: http://lxml.de/
-.. _jinja2: http://jinja.pocoo.org/
+.. _lxml: https://lxml.de/
+.. _jinja2: https://palletsprojects.com/p/jinja/
 .. _Django: https://www.djangoproject.com/
 
 Can I use Scrapy with BeautifulSoup?
@@ -269,7 +269,7 @@ The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
 more info on how it works see `this page`_. Also, here's an `example spider`_
 which scrapes one of these sites.
 
-.. _this page: http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
+.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
 .. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py
 
 What's the best way to parse big XML/CSV data feeds?

diff --git a/docs/intro/install.rst b/docs/intro/install.rst
@@ -65,7 +65,7 @@ please refer to their respective installation instructions:
 * `lxml installation`_
 * `cryptography installation`_
 
-.. _lxml installation: http://lxml.de/installation.html
+.. _lxml installation: https://lxml.de/installation.html
 .. _cryptography installation: https://cryptography.io/en/latest/installation/
 
 
@@ -253,11 +253,11 @@ For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.
 .. _Python: https://www.python.org/
 .. _pip: https://pip.pypa.io/en/latest/installing/
 .. _lxml: https://lxml.de/index.html
-.. _parsel: https://pypi.python.org/pypi/parsel
-.. _w3lib: https://pypi.python.org/pypi/w3lib
-.. _twisted: https://twistedmatrix.com/
-.. _cryptography: https://cryptography.io/
-.. _pyOpenSSL: https://pypi.python.org/pypi/pyOpenSSL
+.. _parsel: https://pypi.org/project/parsel/
+.. _w3lib: https://pypi.org/project/w3lib/
+.. _twisted: https://twistedmatrix.com/trac/
+.. _cryptography: https://cryptography.io/en/latest/
+.. _pyOpenSSL: https://pypi.org/project/pyOpenSSL/
 .. _setuptools: https://pypi.python.org/pypi/setuptools
 .. _AUR Scrapy package: https://aur.archlinux.org/packages/scrapy/
 .. _homebrew: https://brew.sh/

diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst
@@ -306,7 +306,7 @@ with a selector (see :ref:`topics-developer-tools`).
 visually selected elements, which works in many browsers.
 
 .. _regular expressions: https://docs.python.org/3/library/re.html
-.. _Selector Gadget: http://selectorgadget.com/
+.. _Selector Gadget: https://selectorgadget.com/
 
 
 XPath: a brief intro
@@ -337,7 +337,7 @@ recommend `this tutorial to learn XPath through examples
 <http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
 to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.
 
-.. _XPath: https://www.w3.org/TR/xpath
+.. _XPath: https://www.w3.org/TR/xpath/all/
 .. _CSS: https://www.w3.org/TR/selectors
 
 Extracting quotes and authors

diff --git a/docs/news.rst b/docs/news.rst
@@ -26,7 +26,7 @@ Backward-incompatible changes
 *   Python 3.4 is no longer supported, and some of the minimum requirements of
     Scrapy have also changed:
 
-    *   cssselect_ 0.9.1
+    *   :doc:`cssselect <cssselect:index>` 0.9.1
     *   cryptography_ 2.0
     *   lxml_ 3.5.0
     *   pyOpenSSL_ 16.2.0
@@ -1616,7 +1616,7 @@ Deprecations and Removals
   + ``scrapy.utils.datatypes.SiteNode``
 
 - The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and
-  replaced by `pydispatcher <https://pypi.python.org/pypi/PyDispatcher>`_.
+  replaced by `pydispatcher <https://pypi.org/project/PyDispatcher/>`_.
 
 
 Relocations
@@ -2450,7 +2450,7 @@ Other
 ~~~~~
 
 - Dropped Python 2.6 support (:issue:`448`)
-- Add `cssselect`_ python package as install dependency
+- Add :doc:`cssselect <cssselect:index>` python package as install dependency
 - Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on.
 - Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
 - Running test suite now requires ``mock`` python library (:issue:`390`)
@@ -3047,17 +3047,16 @@ Scrapy 0.7
 First release of Scrapy.
 
 
-.. _AJAX crawleable urls: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?csw=1
+.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
 .. _botocore: https://github.com/boto/botocore
 .. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
 .. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/
 .. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request
 .. _cryptography: https://cryptography.io/en/latest/
-.. _cssselect: https://github.com/scrapy/cssselect/
-.. _docstrings: https://docs.python.org/glossary.html#term-docstring
-.. _KeyboardInterrupt: https://docs.python.org/library/exceptions.html#KeyboardInterrupt
+.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring
+.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt
 .. _LevelDB: https://github.com/google/leveldb
-.. _lxml: http://lxml.de/
+.. _lxml: https://lxml.de/
 .. _marshal: https://docs.python.org/2/library/marshal.html
 .. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator
 .. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator
@@ -3068,11 +3067,11 @@ First release of Scrapy.
 .. _queuelib: https://github.com/scrapy/queuelib
 .. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml
 .. _resource: https://docs.python.org/2/library/resource.html
-.. _robots.txt: http://www.robotstxt.org/
+.. _robots.txt: https://www.robotstxt.org/
 .. _scrapely: https://github.com/scrapy/scrapely
 .. _service_identity: https://service-identity.readthedocs.io/en/stable/
 .. _six: https://six.readthedocs.io/
-.. _tox: https://pypi.python.org/pypi/tox
+.. _tox: https://pypi.org/project/tox/
 .. _Twisted: https://twistedmatrix.com/trac/
 .. _Twisted - hello, asynchronous programming: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming/
 .. _w3lib: https://github.com/scrapy/w3lib

diff --git a/docs/topics/broad-crawls.rst b/docs/topics/broad-crawls.rst
@@ -188,7 +188,7 @@ AjaxCrawlMiddleware helps to crawl them correctly.
 It is turned OFF by default because it has some performance overhead,
 and enabling it for focused crawls doesn't make much sense.
 
-.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
+.. _ajax crawlable: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started
 
 .. _broad-crawls-bfo:
 

diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst
@@ -709,7 +709,7 @@ HttpCompressionMiddleware
    provided `brotlipy`_ is installed.
 
 .. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt
-.. _brotlipy: https://pypi.python.org/pypi/brotlipy
+.. _brotlipy: https://pypi.org/project/brotlipy/
 
 HttpCompressionMiddleware Settings
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1038,7 +1038,7 @@ Based on `RobotFileParser
 * is Python's built-in robots.txt_ parser
 
 * is compliant with `Martijn Koster's 1996 draft specification
-  <http://www.robotstxt.org/norobots-rfc.txt>`_
+  <https://www.robotstxt.org/norobots-rfc.txt>`_
 
 * lacks support for wildcard matching
 
@@ -1061,7 +1061,7 @@ Based on `Reppy <https://github.com/seomoz/reppy/>`_:
   <https://github.com/seomoz/rep-cpp>`_
 
 * is compliant with `Martijn Koster's 1996 draft specification
-  <http://www.robotstxt.org/norobots-rfc.txt>`_
+  <https://www.robotstxt.org/norobots-rfc.txt>`_
 
 * supports wildcard matching
 
@@ -1086,7 +1086,7 @@ Based on `Robotexclusionrulesparser <http://nikitathespider.com/python/rerp/>`_:
 * implemented in Python
 
 * is compliant with `Martijn Koster's 1996 draft specification
-  <http://www.robotstxt.org/norobots-rfc.txt>`_
+  <https://www.robotstxt.org/norobots-rfc.txt>`_
 
 * supports wildcard matching
 
@@ -1115,7 +1115,7 @@ implementing the methods described below.
 .. autoclass:: RobotParser
    :members:
 
-.. _robots.txt: http://www.robotstxt.org/
+.. _robots.txt: https://www.robotstxt.org/
 
 DownloaderStats
 ---------------
@@ -1155,7 +1155,7 @@ AjaxCrawlMiddleware
 
    Middleware that finds 'AJAX crawlable' page variants based
    on meta-fragment html tag. See
-   https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
+   https://developers.google.com/search/docs/ajax-crawling/docs/getting-started
    for more info.
 
    .. note::

diff --git a/docs/topics/dynamic-content.rst b/docs/topics/dynamic-content.rst
@@ -241,12 +241,12 @@ along with `scrapy-selenium`_ for seamless integration.
 .. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
 .. _JavaScript: https://en.wikipedia.org/wiki/JavaScript
 .. _js2xml: https://github.com/scrapinghub/js2xml
-.. _json.loads: https://docs.python.org/library/json.html#json.loads
+.. _json.loads: https://docs.python.org/3/library/json.html#json.loads
 .. _pytesseract: https://github.com/madmaze/pytesseract
-.. _regular expression: https://docs.python.org/library/re.html
+.. _regular expression: https://docs.python.org/3/library/re.html
 .. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium
 .. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
-.. _Selenium: https://www.seleniumhq.org/
+.. _Selenium: https://www.selenium.dev/
 .. _Splash: https://github.com/scrapinghub/splash
 .. _tabula-py: https://github.com/chezou/tabula-py
 .. _wget: https://www.gnu.org/software/wget/

diff --git a/docs/topics/item-pipeline.rst b/docs/topics/item-pipeline.rst
@@ -158,8 +158,8 @@ method and how to clean up the resources properly.::
             self.db[self.collection_name].insert_one(dict(item))
             return item
 
-.. _MongoDB: https://www.mongodb.org/
-.. _pymongo: https://api.mongodb.org/python/current/
+.. _MongoDB: https://www.mongodb.com/
+.. _pymongo: https://api.mongodb.com/python/current/
 
 
 Take screenshot of item

diff --git a/docs/topics/items.rst b/docs/topics/items.rst
@@ -166,7 +166,7 @@ If your item contains mutable_ values like lists or dictionaries, a shallow
 copy will keep references to the same mutable values across all different
 copies.
 
-.. _mutable: https://docs.python.org/glossary.html#term-mutable
+.. _mutable: https://docs.python.org/3/glossary.html#term-mutable
 
 For example, if you have an item with a list of tags, and you create a shallow
 copy of that item, both the original item and the copy have the same list of
@@ -177,7 +177,7 @@ If that is not the desired behavior, use a deep copy instead.
 
 See the `documentation of the copy module`_ for more information.
 
-.. _documentation of the copy module: https://docs.python.org/library/copy.html
+.. _documentation of the copy module: https://docs.python.org/3/library/copy.html
 
 To create a shallow copy of an item, you can either call
 :meth:`~scrapy.item.Item.copy` on an existing item

diff --git a/docs/topics/leaks.rst b/docs/topics/leaks.rst
@@ -206,7 +206,7 @@ objects. If this is your case, and you can't find your leaks using ``trackref``,
 you still have another resource: the `Guppy library`_.
 If you're using Python3, see :ref:`topics-leaks-muppy`.
 
-.. _Guppy library: https://pypi.python.org/pypi/guppy
+.. _Guppy library: https://pypi.org/project/guppy/
 
 If you use ``pip``, you can install Guppy with the following command::
 
@@ -311,9 +311,9 @@ though neither Scrapy nor your project are leaking memory. This is due to a
 (not so well) known problem of Python, which may not return released memory to
 the operating system in some cases. For more information on this issue see:
 
-* `Python Memory Management <http://www.evanjones.ca/python-memory.html>`_
-* `Python Memory Management Part 2 <http://www.evanjones.ca/python-memory-part2.html>`_
-* `Python Memory Management Part 3 <http://www.evanjones.ca/python-memory-part3.html>`_
+* `Python Memory Management <https://www.evanjones.ca/python-memory.html>`_
+* `Python Memory Management Part 2 <https://www.evanjones.ca/python-memory-part2.html>`_
+* `Python Memory Management Part 3 <https://www.evanjones.ca/python-memory-part3.html>`_
 
 The improvements proposed by Evan Jones, which are detailed in `this paper`_,
 got merged in Python 2.5, but this only reduces the problem, it doesn't fix it
@@ -327,7 +327,7 @@ completely. To quote the paper:
     to move to a compacting garbage collector, which is able to move objects in
     memory. This would require significant changes to the Python interpreter.*
 
-.. _this paper: http://www.evanjones.ca/memoryallocator/
+.. _this paper: https://www.evanjones.ca/memoryallocator/
 
 To keep memory consumption reasonable you can split the job into several
 smaller jobs or enable :ref:`persistent job queue <topics-jobs>`

diff --git a/docs/topics/request-response.rst b/docs/topics/request-response.rst
@@ -396,7 +396,7 @@ The FormRequest class extends the base :class:`Request` with functionality for
 dealing with HTML forms. It uses `lxml.html forms`_  to pre-populate form
 fields with form data from :class:`Response` objects.
 
-.. _lxml.html forms: http://lxml.de/lxmlhtml.html#forms
+.. _lxml.html forms: https://lxml.de/lxmlhtml.html#forms
 
 .. class:: FormRequest(url, [formdata, ...])
 

diff --git a/docs/topics/selectors.rst b/docs/topics/selectors.rst
@@ -35,12 +35,11 @@ defines selectors to associate those styles with specific HTML elements.
     in speed and parsing accuracy to lxml.
 
 .. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
-.. _lxml: http://lxml.de/
+.. _lxml: https://lxml.de/
 .. _ElementTree: https://docs.python.org/2/library/xml.etree.elementtree.html
-.. _cssselect: https://pypi.python.org/pypi/cssselect/
-.. _XPath: https://www.w3.org/TR/xpath
+.. _XPath: https://www.w3.org/TR/xpath/all/
 .. _CSS: https://www.w3.org/TR/selectors
-.. _parsel: https://parsel.readthedocs.io/
+.. _parsel: https://parsel.readthedocs.io/en/latest/
 
 Using selectors
 ===============
@@ -255,7 +254,7 @@ that Scrapy (parsel) implements a couple of **non-standard pseudo-elements**:
     They will most probably not work with other libraries like
     `lxml`_ or `PyQuery`_.
 
-.. _PyQuery: https://pypi.python.org/pypi/pyquery
+.. _PyQuery: https://pypi.org/project/pyquery/
 
 Examples:
 
@@ -309,7 +308,7 @@ Examples:
     make much sense: text nodes do not have attributes, and attribute values
     are string values already and do not have children nodes.
 
-.. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors
+.. _CSS Selectors: https://www.w3.org/TR/selectors-3/#selectors
 
 .. _topics-selectors-nesting-selectors:
 
@@ -504,7 +503,7 @@ Another common case would be to extract all direct ``<p>`` children:
 For more details about relative XPaths see the `Location Paths`_ section in the
 XPath specification.
 
-.. _Location Paths: https://www.w3.org/TR/xpath#location-paths
+.. _Location Paths: https://www.w3.org/TR/xpath/all/#location-paths
 
 When querying by class, consider using CSS
 ------------------------------------------
@@ -612,7 +611,7 @@ But using the ``.`` to mean the node, works:
 >>> sel.xpath("//a[contains(., 'Next Page')]").getall()
 ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
 
-.. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions
+.. _`XPath string function`: https://www.w3.org/TR/xpath/all/#section-String-Functions
 
 .. _topics-selectors-xpath-variables:
 
@@ -764,7 +763,7 @@ Set operations
 These can be handy for excluding parts of a document tree before
 extracting text elements for example.
 
-Example extracting microdata (sample content taken from http://schema.org/Product)
+Example extracting microdata (sample content taken from https://schema.org/Product)
 with groups of itemscopes and corresponding itemprops::
 
     >>> doc = u"""

diff --git a/docs/topics/shell.rst b/docs/topics/shell.rst
@@ -41,7 +41,7 @@ variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`::
 
 .. _IPython: https://ipython.org/
 .. _IPython installation guide: https://ipython.org/install.html
-.. _bpython: https://www.bpython-interpreter.org/
+.. _bpython: https://bpython-interpreter.org/
 
 Launch the shell
 ================
@@ -142,7 +142,7 @@ Example of shell session
 ========================
 
 Here's an example of a typical shell session where we start by scraping the
-https://scrapy.org page, and then proceed to scrape the https://reddit.com
+https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/
 page. Finally, we modify the (Reddit) request method to POST and re-fetch it
 getting an error. We end the session by typing Ctrl-D (in Unix systems) or
 Ctrl-Z in Windows.
@@ -182,7 +182,7 @@ After that, we can start playing with the objects:
 >>> response.xpath('//title/text()').get()
 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
 
->>> fetch("https://reddit.com")
+>>> fetch("https://old.reddit.com/")
 
 >>> response.xpath('//title/text()').get()
 'reddit: the front page of the internet'