Skip to content

Commit

Permalink
DOC linkcheck run; https and 301 link updates.
Browse files Browse the repository at this point in the history
Closes #4359
  • Loading branch information
nyov committed Feb 25, 2020
1 parent caa1dea commit a34c366
Show file tree
Hide file tree
Showing 16 changed files with 58 additions and 59 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Expand Up @@ -281,6 +281,7 @@

intersphinx_mapping = {
'coverage': ('https://coverage.readthedocs.io/en/stable', None),
'cssselect': ('https://cssselect.readthedocs.io/en/latest', None),
'pytest': ('https://docs.pytest.org/en/latest', None),
'python': ('https://docs.python.org/3', None),
'sphinx': ('https://www.sphinx-doc.org/en/master', None),
Expand Down
6 changes: 3 additions & 3 deletions docs/contributing.rst
Expand Up @@ -143,7 +143,7 @@ by running ``git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE``
(replace 'upstream' with a remote name for scrapy repository,
``$PR_NUMBER`` with an ID of the pull request, and ``$BRANCH_NAME_TO_CREATE``
with a name of the branch you want to create locally).
See also: https://help.github.com/articles/checking-out-pull-requests-locally/#modifying-an-inactive-pull-request-locally.
See also: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally.

When writing GitHub pull requests, try to keep titles short but descriptive.
E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests"
Expand All @@ -168,7 +168,7 @@ Scrapy:

* Don't put your name in the code you contribute; git provides enough
metadata to identify author of the code.
See https://help.github.com/articles/setting-your-username-in-git/ for
See https://help.github.com/en/github/using-git/setting-your-username-in-git for
setup instructions.

.. _documentation-policies:
Expand Down Expand Up @@ -266,5 +266,5 @@ And their unit-tests are in::
.. _tests/: https://github.com/scrapy/scrapy/tree/master/tests
.. _open issues: https://github.com/scrapy/scrapy/issues
.. _PEP 257: https://www.python.org/dev/peps/pep-0257/
.. _pull request: https://help.github.com/en/articles/creating-a-pull-request
.. _pull request: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request
.. _pytest-xdist: https://github.com/pytest-dev/pytest-xdist
6 changes: 3 additions & 3 deletions docs/faq.rst
Expand Up @@ -22,8 +22,8 @@ In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
comparing `jinja2`_ to `Django`_.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: http://lxml.de/
.. _jinja2: http://jinja.pocoo.org/
.. _lxml: https://lxml.de/
.. _jinja2: https://palletsprojects.com/p/jinja/
.. _Django: https://www.djangoproject.com/

Can I use Scrapy with BeautifulSoup?
Expand Down Expand Up @@ -269,7 +269,7 @@ The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
more info on how it works see `this page`_. Also, here's an `example spider`_
which scrapes one of these sites.

.. _this page: http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
.. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py

What's the best way to parse big XML/CSV data feeds?
Expand Down
12 changes: 6 additions & 6 deletions docs/intro/install.rst
Expand Up @@ -65,7 +65,7 @@ please refer to their respective installation instructions:
* `lxml installation`_
* `cryptography installation`_

.. _lxml installation: http://lxml.de/installation.html
.. _lxml installation: https://lxml.de/installation.html
.. _cryptography installation: https://cryptography.io/en/latest/installation/


Expand Down Expand Up @@ -253,11 +253,11 @@ For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.
.. _Python: https://www.python.org/
.. _pip: https://pip.pypa.io/en/latest/installing/
.. _lxml: https://lxml.de/index.html
.. _parsel: https://pypi.python.org/pypi/parsel
.. _w3lib: https://pypi.python.org/pypi/w3lib
.. _twisted: https://twistedmatrix.com/
.. _cryptography: https://cryptography.io/
.. _pyOpenSSL: https://pypi.python.org/pypi/pyOpenSSL
.. _parsel: https://pypi.org/project/parsel/
.. _w3lib: https://pypi.org/project/w3lib/
.. _twisted: https://twistedmatrix.com/trac/
.. _cryptography: https://cryptography.io/en/latest/
.. _pyOpenSSL: https://pypi.org/project/pyOpenSSL/
.. _setuptools: https://pypi.python.org/pypi/setuptools
.. _AUR Scrapy package: https://aur.archlinux.org/packages/scrapy/
.. _homebrew: https://brew.sh/
Expand Down
4 changes: 2 additions & 2 deletions docs/intro/tutorial.rst
Expand Up @@ -306,7 +306,7 @@ with a selector (see :ref:`topics-developer-tools`).
visually selected elements, which works in many browsers.

.. _regular expressions: https://docs.python.org/3/library/re.html
.. _Selector Gadget: http://selectorgadget.com/
.. _Selector Gadget: https://selectorgadget.com/


XPath: a brief intro
Expand Down Expand Up @@ -337,7 +337,7 @@ recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.

.. _XPath: https://www.w3.org/TR/xpath
.. _XPath: https://www.w3.org/TR/xpath/all/
.. _CSS: https://www.w3.org/TR/selectors

Extracting quotes and authors
Expand Down
19 changes: 9 additions & 10 deletions docs/news.rst
Expand Up @@ -26,7 +26,7 @@ Backward-incompatible changes
* Python 3.4 is no longer supported, and some of the minimum requirements of
Scrapy have also changed:

* cssselect_ 0.9.1
* :doc:`cssselect <cssselect:index>` 0.9.1
* cryptography_ 2.0
* lxml_ 3.5.0
* pyOpenSSL_ 16.2.0
Expand Down Expand Up @@ -1616,7 +1616,7 @@ Deprecations and Removals
+ ``scrapy.utils.datatypes.SiteNode``

- The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and
replaced by `pydispatcher <https://pypi.python.org/pypi/PyDispatcher>`_.
replaced by `pydispatcher <https://pypi.org/project/PyDispatcher/>`_.


Relocations
Expand Down Expand Up @@ -2450,7 +2450,7 @@ Other
~~~~~

- Dropped Python 2.6 support (:issue:`448`)
- Add `cssselect`_ python package as install dependency
- Add :doc:`cssselect <cssselect:index>` python package as install dependency
- Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on.
- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
- Running test suite now requires ``mock`` python library (:issue:`390`)
Expand Down Expand Up @@ -3047,17 +3047,16 @@ Scrapy 0.7
First release of Scrapy.


.. _AJAX crawleable urls: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?csw=1
.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
.. _botocore: https://github.com/boto/botocore
.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
.. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/
.. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request
.. _cryptography: https://cryptography.io/en/latest/
.. _cssselect: https://github.com/scrapy/cssselect/
.. _docstrings: https://docs.python.org/glossary.html#term-docstring
.. _KeyboardInterrupt: https://docs.python.org/library/exceptions.html#KeyboardInterrupt
.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring
.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt
.. _LevelDB: https://github.com/google/leveldb
.. _lxml: http://lxml.de/
.. _lxml: https://lxml.de/
.. _marshal: https://docs.python.org/2/library/marshal.html
.. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator
.. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator
Expand All @@ -3068,11 +3067,11 @@ First release of Scrapy.
.. _queuelib: https://github.com/scrapy/queuelib
.. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml
.. _resource: https://docs.python.org/2/library/resource.html
.. _robots.txt: http://www.robotstxt.org/
.. _robots.txt: https://www.robotstxt.org/
.. _scrapely: https://github.com/scrapy/scrapely
.. _service_identity: https://service-identity.readthedocs.io/en/stable/
.. _six: https://six.readthedocs.io/
.. _tox: https://pypi.python.org/pypi/tox
.. _tox: https://pypi.org/project/tox/
.. _Twisted: https://twistedmatrix.com/trac/
.. _Twisted - hello, asynchronous programming: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming/
.. _w3lib: https://github.com/scrapy/w3lib
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/broad-crawls.rst
Expand Up @@ -188,7 +188,7 @@ AjaxCrawlMiddleware helps to crawl them correctly.
It is turned OFF by default because it has some performance overhead,
and enabling it for focused crawls doesn't make much sense.

.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
.. _ajax crawlable: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started

.. _broad-crawls-bfo:

Expand Down
12 changes: 6 additions & 6 deletions docs/topics/downloader-middleware.rst
Expand Up @@ -709,7 +709,7 @@ HttpCompressionMiddleware
provided `brotlipy`_ is installed.

.. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt
.. _brotlipy: https://pypi.python.org/pypi/brotlipy
.. _brotlipy: https://pypi.org/project/brotlipy/

HttpCompressionMiddleware Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -1038,7 +1038,7 @@ Based on `RobotFileParser
* is Python's built-in robots.txt_ parser

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* lacks support for wildcard matching

Expand All @@ -1061,7 +1061,7 @@ Based on `Reppy <https://github.com/seomoz/reppy/>`_:
<https://github.com/seomoz/rep-cpp>`_

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* supports wildcard matching

Expand All @@ -1086,7 +1086,7 @@ Based on `Robotexclusionrulesparser <http://nikitathespider.com/python/rerp/>`_:
* implemented in Python

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* supports wildcard matching

Expand Down Expand Up @@ -1115,7 +1115,7 @@ implementing the methods described below.
.. autoclass:: RobotParser
:members:

.. _robots.txt: http://www.robotstxt.org/
.. _robots.txt: https://www.robotstxt.org/

DownloaderStats
---------------
Expand Down Expand Up @@ -1155,7 +1155,7 @@ AjaxCrawlMiddleware

Middleware that finds 'AJAX crawlable' page variants based
on meta-fragment html tag. See
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
https://developers.google.com/search/docs/ajax-crawling/docs/getting-started
for more info.

.. note::
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/dynamic-content.rst
Expand Up @@ -241,12 +241,12 @@ along with `scrapy-selenium`_ for seamless integration.
.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
.. _JavaScript: https://en.wikipedia.org/wiki/JavaScript
.. _js2xml: https://github.com/scrapinghub/js2xml
.. _json.loads: https://docs.python.org/library/json.html#json.loads
.. _json.loads: https://docs.python.org/3/library/json.html#json.loads
.. _pytesseract: https://github.com/madmaze/pytesseract
.. _regular expression: https://docs.python.org/library/re.html
.. _regular expression: https://docs.python.org/3/library/re.html
.. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium
.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
.. _Selenium: https://www.seleniumhq.org/
.. _Selenium: https://www.selenium.dev/
.. _Splash: https://github.com/scrapinghub/splash
.. _tabula-py: https://github.com/chezou/tabula-py
.. _wget: https://www.gnu.org/software/wget/
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/item-pipeline.rst
Expand Up @@ -158,8 +158,8 @@ method and how to clean up the resources properly.::
self.db[self.collection_name].insert_one(dict(item))
return item

.. _MongoDB: https://www.mongodb.org/
.. _pymongo: https://api.mongodb.org/python/current/
.. _MongoDB: https://www.mongodb.com/
.. _pymongo: https://api.mongodb.com/python/current/


Take screenshot of item
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/items.rst
Expand Up @@ -166,7 +166,7 @@ If your item contains mutable_ values like lists or dictionaries, a shallow
copy will keep references to the same mutable values across all different
copies.

.. _mutable: https://docs.python.org/glossary.html#term-mutable
.. _mutable: https://docs.python.org/3/glossary.html#term-mutable

For example, if you have an item with a list of tags, and you create a shallow
copy of that item, both the original item and the copy have the same list of
Expand All @@ -177,7 +177,7 @@ If that is not the desired behavior, use a deep copy instead.

See the `documentation of the copy module`_ for more information.

.. _documentation of the copy module: https://docs.python.org/library/copy.html
.. _documentation of the copy module: https://docs.python.org/3/library/copy.html

To create a shallow copy of an item, you can either call
:meth:`~scrapy.item.Item.copy` on an existing item
Expand Down
10 changes: 5 additions & 5 deletions docs/topics/leaks.rst
Expand Up @@ -206,7 +206,7 @@ objects. If this is your case, and you can't find your leaks using ``trackref``,
you still have another resource: the `Guppy library`_.
If you're using Python3, see :ref:`topics-leaks-muppy`.

.. _Guppy library: https://pypi.python.org/pypi/guppy
.. _Guppy library: https://pypi.org/project/guppy/

If you use ``pip``, you can install Guppy with the following command::

Expand Down Expand Up @@ -311,9 +311,9 @@ though neither Scrapy nor your project are leaking memory. This is due to a
(not so well) known problem of Python, which may not return released memory to
the operating system in some cases. For more information on this issue see:

* `Python Memory Management <http://www.evanjones.ca/python-memory.html>`_
* `Python Memory Management Part 2 <http://www.evanjones.ca/python-memory-part2.html>`_
* `Python Memory Management Part 3 <http://www.evanjones.ca/python-memory-part3.html>`_
* `Python Memory Management <https://www.evanjones.ca/python-memory.html>`_
* `Python Memory Management Part 2 <https://www.evanjones.ca/python-memory-part2.html>`_
* `Python Memory Management Part 3 <https://www.evanjones.ca/python-memory-part3.html>`_

The improvements proposed by Evan Jones, which are detailed in `this paper`_,
got merged in Python 2.5, but this only reduces the problem, it doesn't fix it
Expand All @@ -327,7 +327,7 @@ completely. To quote the paper:
to move to a compacting garbage collector, which is able to move objects in
memory. This would require significant changes to the Python interpreter.*

.. _this paper: http://www.evanjones.ca/memoryallocator/
.. _this paper: https://www.evanjones.ca/memoryallocator/

To keep memory consumption reasonable you can split the job into several
smaller jobs or enable :ref:`persistent job queue <topics-jobs>`
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/request-response.rst
Expand Up @@ -396,7 +396,7 @@ The FormRequest class extends the base :class:`Request` with functionality for
dealing with HTML forms. It uses `lxml.html forms`_ to pre-populate form
fields with form data from :class:`Response` objects.

.. _lxml.html forms: http://lxml.de/lxmlhtml.html#forms
.. _lxml.html forms: https://lxml.de/lxmlhtml.html#forms

.. class:: FormRequest(url, [formdata, ...])

Expand Down
17 changes: 8 additions & 9 deletions docs/topics/selectors.rst
Expand Up @@ -35,12 +35,11 @@ defines selectors to associate those styles with specific HTML elements.
in speed and parsing accuracy to lxml.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: http://lxml.de/
.. _lxml: https://lxml.de/
.. _ElementTree: https://docs.python.org/2/library/xml.etree.elementtree.html
.. _cssselect: https://pypi.python.org/pypi/cssselect/
.. _XPath: https://www.w3.org/TR/xpath
.. _XPath: https://www.w3.org/TR/xpath/all/
.. _CSS: https://www.w3.org/TR/selectors
.. _parsel: https://parsel.readthedocs.io/
.. _parsel: https://parsel.readthedocs.io/en/latest/

Using selectors
===============
Expand Down Expand Up @@ -255,7 +254,7 @@ that Scrapy (parsel) implements a couple of **non-standard pseudo-elements**:
They will most probably not work with other libraries like
`lxml`_ or `PyQuery`_.

.. _PyQuery: https://pypi.python.org/pypi/pyquery
.. _PyQuery: https://pypi.org/project/pyquery/

Examples:

Expand Down Expand Up @@ -309,7 +308,7 @@ Examples:
make much sense: text nodes do not have attributes, and attribute values
are string values already and do not have children nodes.

.. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors
.. _CSS Selectors: https://www.w3.org/TR/selectors-3/#selectors

.. _topics-selectors-nesting-selectors:

Expand Down Expand Up @@ -504,7 +503,7 @@ Another common case would be to extract all direct ``<p>`` children:
For more details about relative XPaths see the `Location Paths`_ section in the
XPath specification.

.. _Location Paths: https://www.w3.org/TR/xpath#location-paths
.. _Location Paths: https://www.w3.org/TR/xpath/all/#location-paths

When querying by class, consider using CSS
------------------------------------------
Expand Down Expand Up @@ -612,7 +611,7 @@ But using the ``.`` to mean the node, works:
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

.. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions
.. _`XPath string function`: https://www.w3.org/TR/xpath/all/#section-String-Functions

.. _topics-selectors-xpath-variables:

Expand Down Expand Up @@ -764,7 +763,7 @@ Set operations
These can be handy for excluding parts of a document tree before
extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product)
Example extracting microdata (sample content taken from https://schema.org/Product)
with groups of itemscopes and corresponding itemprops::

>>> doc = u"""
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/shell.rst
Expand Up @@ -41,7 +41,7 @@ variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`::

.. _IPython: https://ipython.org/
.. _IPython installation guide: https://ipython.org/install.html
.. _bpython: https://www.bpython-interpreter.org/
.. _bpython: https://bpython-interpreter.org/

Launch the shell
================
Expand Down Expand Up @@ -142,7 +142,7 @@ Example of shell session
========================

Here's an example of a typical shell session where we start by scraping the
https://scrapy.org page, and then proceed to scrape the https://reddit.com
https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/
page. Finally, we modify the (Reddit) request method to POST and re-fetch it
getting an error. We end the session by typing Ctrl-D (in Unix systems) or
Ctrl-Z in Windows.
Expand Down Expand Up @@ -182,7 +182,7 @@ After that, we can start playing with the objects:
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("https://reddit.com")
>>> fetch("https://old.reddit.com/")

>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
Expand Down

0 comments on commit a34c366

Please sign in to comment.