Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation linkcheck run, fixing some links. #4361

Merged
merged 1 commit into from Feb 25, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/conf.py
Expand Up @@ -281,6 +281,7 @@

intersphinx_mapping = {
'coverage': ('https://coverage.readthedocs.io/en/stable', None),
'cssselect': ('https://cssselect.readthedocs.io/en/latest', None),
'pytest': ('https://docs.pytest.org/en/latest', None),
'python': ('https://docs.python.org/3', None),
'sphinx': ('https://www.sphinx-doc.org/en/master', None),
Expand Down
6 changes: 3 additions & 3 deletions docs/contributing.rst
Expand Up @@ -143,7 +143,7 @@ by running ``git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE``
(replace 'upstream' with a remote name for scrapy repository,
``$PR_NUMBER`` with an ID of the pull request, and ``$BRANCH_NAME_TO_CREATE``
with a name of the branch you want to create locally).
See also: https://help.github.com/articles/checking-out-pull-requests-locally/#modifying-an-inactive-pull-request-locally.
See also: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally.

When writing GitHub pull requests, try to keep titles short but descriptive.
E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests"
Expand All @@ -168,7 +168,7 @@ Scrapy:

* Don't put your name in the code you contribute; git provides enough
metadata to identify author of the code.
See https://help.github.com/articles/setting-your-username-in-git/ for
See https://help.github.com/en/github/using-git/setting-your-username-in-git for
setup instructions.

.. _documentation-policies:
Expand Down Expand Up @@ -266,5 +266,5 @@ And their unit-tests are in::
.. _tests/: https://github.com/scrapy/scrapy/tree/master/tests
.. _open issues: https://github.com/scrapy/scrapy/issues
.. _PEP 257: https://www.python.org/dev/peps/pep-0257/
.. _pull request: https://help.github.com/en/articles/creating-a-pull-request
.. _pull request: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request
.. _pytest-xdist: https://github.com/pytest-dev/pytest-xdist
6 changes: 3 additions & 3 deletions docs/faq.rst
Expand Up @@ -22,8 +22,8 @@ In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
comparing `jinja2`_ to `Django`_.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: http://lxml.de/
.. _jinja2: http://jinja.pocoo.org/
.. _lxml: https://lxml.de/
.. _jinja2: https://palletsprojects.com/p/jinja/
.. _Django: https://www.djangoproject.com/

Can I use Scrapy with BeautifulSoup?
Expand Down Expand Up @@ -269,7 +269,7 @@ The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
more info on how it works see `this page`_. Also, here's an `example spider`_
which scrapes one of these sites.

.. _this page: http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
.. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py

What's the best way to parse big XML/CSV data feeds?
Expand Down
12 changes: 6 additions & 6 deletions docs/intro/install.rst
Expand Up @@ -65,7 +65,7 @@ please refer to their respective installation instructions:
* `lxml installation`_
* `cryptography installation`_

.. _lxml installation: http://lxml.de/installation.html
.. _lxml installation: https://lxml.de/installation.html
.. _cryptography installation: https://cryptography.io/en/latest/installation/


Expand Down Expand Up @@ -253,11 +253,11 @@ For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.
.. _Python: https://www.python.org/
.. _pip: https://pip.pypa.io/en/latest/installing/
.. _lxml: https://lxml.de/index.html
.. _parsel: https://pypi.python.org/pypi/parsel
.. _w3lib: https://pypi.python.org/pypi/w3lib
.. _twisted: https://twistedmatrix.com/
.. _cryptography: https://cryptography.io/
.. _pyOpenSSL: https://pypi.python.org/pypi/pyOpenSSL
.. _parsel: https://pypi.org/project/parsel/
.. _w3lib: https://pypi.org/project/w3lib/
.. _twisted: https://twistedmatrix.com/trac/
.. _cryptography: https://cryptography.io/en/latest/
.. _pyOpenSSL: https://pypi.org/project/pyOpenSSL/
.. _setuptools: https://pypi.python.org/pypi/setuptools
.. _AUR Scrapy package: https://aur.archlinux.org/packages/scrapy/
.. _homebrew: https://brew.sh/
Expand Down
4 changes: 2 additions & 2 deletions docs/intro/tutorial.rst
Expand Up @@ -306,7 +306,7 @@ with a selector (see :ref:`topics-developer-tools`).
visually selected elements, which works in many browsers.

.. _regular expressions: https://docs.python.org/3/library/re.html
.. _Selector Gadget: http://selectorgadget.com/
.. _Selector Gadget: https://selectorgadget.com/


XPath: a brief intro
Expand Down Expand Up @@ -337,7 +337,7 @@ recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.

.. _XPath: https://www.w3.org/TR/xpath
.. _XPath: https://www.w3.org/TR/xpath/all/
.. _CSS: https://www.w3.org/TR/selectors

Extracting quotes and authors
Expand Down
19 changes: 9 additions & 10 deletions docs/news.rst
Expand Up @@ -26,7 +26,7 @@ Backward-incompatible changes
* Python 3.4 is no longer supported, and some of the minimum requirements of
Scrapy have also changed:

* cssselect_ 0.9.1
* :doc:`cssselect <cssselect:index>` 0.9.1
* cryptography_ 2.0
* lxml_ 3.5.0
* pyOpenSSL_ 16.2.0
Expand Down Expand Up @@ -1616,7 +1616,7 @@ Deprecations and Removals
+ ``scrapy.utils.datatypes.SiteNode``

- The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and
replaced by `pydispatcher <https://pypi.python.org/pypi/PyDispatcher>`_.
replaced by `pydispatcher <https://pypi.org/project/PyDispatcher/>`_.


Relocations
Expand Down Expand Up @@ -2450,7 +2450,7 @@ Other
~~~~~

- Dropped Python 2.6 support (:issue:`448`)
- Add `cssselect`_ python package as install dependency
- Add :doc:`cssselect <cssselect:index>` python package as install dependency
- Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on.
- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
- Running test suite now requires ``mock`` python library (:issue:`390`)
Expand Down Expand Up @@ -3047,17 +3047,16 @@ Scrapy 0.7
First release of Scrapy.


.. _AJAX crawleable urls: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?csw=1
.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
.. _botocore: https://github.com/boto/botocore
.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
.. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/
.. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request
.. _cryptography: https://cryptography.io/en/latest/
.. _cssselect: https://github.com/scrapy/cssselect/
.. _docstrings: https://docs.python.org/glossary.html#term-docstring
.. _KeyboardInterrupt: https://docs.python.org/library/exceptions.html#KeyboardInterrupt
.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring
.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt
.. _LevelDB: https://github.com/google/leveldb
.. _lxml: http://lxml.de/
.. _lxml: https://lxml.de/
.. _marshal: https://docs.python.org/2/library/marshal.html
.. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator
.. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator
Expand All @@ -3068,11 +3067,11 @@ First release of Scrapy.
.. _queuelib: https://github.com/scrapy/queuelib
.. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml
.. _resource: https://docs.python.org/2/library/resource.html
.. _robots.txt: http://www.robotstxt.org/
.. _robots.txt: https://www.robotstxt.org/
.. _scrapely: https://github.com/scrapy/scrapely
.. _service_identity: https://service-identity.readthedocs.io/en/stable/
.. _six: https://six.readthedocs.io/
.. _tox: https://pypi.python.org/pypi/tox
.. _tox: https://pypi.org/project/tox/
.. _Twisted: https://twistedmatrix.com/trac/
.. _Twisted - hello, asynchronous programming: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming/
.. _w3lib: https://github.com/scrapy/w3lib
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/broad-crawls.rst
Expand Up @@ -188,7 +188,7 @@ AjaxCrawlMiddleware helps to crawl them correctly.
It is turned OFF by default because it has some performance overhead,
and enabling it for focused crawls doesn't make much sense.

.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
.. _ajax crawlable: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started

.. _broad-crawls-bfo:

Expand Down
12 changes: 6 additions & 6 deletions docs/topics/downloader-middleware.rst
Expand Up @@ -709,7 +709,7 @@ HttpCompressionMiddleware
provided `brotlipy`_ is installed.

.. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt
.. _brotlipy: https://pypi.python.org/pypi/brotlipy
.. _brotlipy: https://pypi.org/project/brotlipy/

HttpCompressionMiddleware Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -1038,7 +1038,7 @@ Based on `RobotFileParser
* is Python's built-in robots.txt_ parser

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* lacks support for wildcard matching

Expand All @@ -1061,7 +1061,7 @@ Based on `Reppy <https://github.com/seomoz/reppy/>`_:
<https://github.com/seomoz/rep-cpp>`_

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* supports wildcard matching

Expand All @@ -1086,7 +1086,7 @@ Based on `Robotexclusionrulesparser <http://nikitathespider.com/python/rerp/>`_:
* implemented in Python

* is compliant with `Martijn Koster's 1996 draft specification
<http://www.robotstxt.org/norobots-rfc.txt>`_
<https://www.robotstxt.org/norobots-rfc.txt>`_

* supports wildcard matching

Expand Down Expand Up @@ -1115,7 +1115,7 @@ implementing the methods described below.
.. autoclass:: RobotParser
:members:

.. _robots.txt: http://www.robotstxt.org/
.. _robots.txt: https://www.robotstxt.org/

DownloaderStats
---------------
Expand Down Expand Up @@ -1155,7 +1155,7 @@ AjaxCrawlMiddleware

Middleware that finds 'AJAX crawlable' page variants based
on meta-fragment html tag. See
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
https://developers.google.com/search/docs/ajax-crawling/docs/getting-started
for more info.

.. note::
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/dynamic-content.rst
Expand Up @@ -241,12 +241,12 @@ along with `scrapy-selenium`_ for seamless integration.
.. _headless browser: https://en.wikipedia.org/wiki/Headless_browser
.. _JavaScript: https://en.wikipedia.org/wiki/JavaScript
.. _js2xml: https://github.com/scrapinghub/js2xml
.. _json.loads: https://docs.python.org/library/json.html#json.loads
.. _json.loads: https://docs.python.org/3/library/json.html#json.loads
.. _pytesseract: https://github.com/madmaze/pytesseract
.. _regular expression: https://docs.python.org/library/re.html
.. _regular expression: https://docs.python.org/3/library/re.html
.. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium
.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
.. _Selenium: https://www.seleniumhq.org/
.. _Selenium: https://www.selenium.dev/
.. _Splash: https://github.com/scrapinghub/splash
.. _tabula-py: https://github.com/chezou/tabula-py
.. _wget: https://www.gnu.org/software/wget/
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/item-pipeline.rst
Expand Up @@ -158,8 +158,8 @@ method and how to clean up the resources properly.::
self.db[self.collection_name].insert_one(dict(item))
return item

.. _MongoDB: https://www.mongodb.org/
.. _pymongo: https://api.mongodb.org/python/current/
.. _MongoDB: https://www.mongodb.com/
.. _pymongo: https://api.mongodb.com/python/current/


Take screenshot of item
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/items.rst
Expand Up @@ -166,7 +166,7 @@ If your item contains mutable_ values like lists or dictionaries, a shallow
copy will keep references to the same mutable values across all different
copies.

.. _mutable: https://docs.python.org/glossary.html#term-mutable
.. _mutable: https://docs.python.org/3/glossary.html#term-mutable

For example, if you have an item with a list of tags, and you create a shallow
copy of that item, both the original item and the copy have the same list of
Expand All @@ -177,7 +177,7 @@ If that is not the desired behavior, use a deep copy instead.

See the `documentation of the copy module`_ for more information.

.. _documentation of the copy module: https://docs.python.org/library/copy.html
.. _documentation of the copy module: https://docs.python.org/3/library/copy.html

To create a shallow copy of an item, you can either call
:meth:`~scrapy.item.Item.copy` on an existing item
Expand Down
10 changes: 5 additions & 5 deletions docs/topics/leaks.rst
Expand Up @@ -206,7 +206,7 @@ objects. If this is your case, and you can't find your leaks using ``trackref``,
you still have another resource: the `Guppy library`_.
If you're using Python3, see :ref:`topics-leaks-muppy`.

.. _Guppy library: https://pypi.python.org/pypi/guppy
.. _Guppy library: https://pypi.org/project/guppy/

If you use ``pip``, you can install Guppy with the following command::

Expand Down Expand Up @@ -311,9 +311,9 @@ though neither Scrapy nor your project are leaking memory. This is due to a
(not so well) known problem of Python, which may not return released memory to
the operating system in some cases. For more information on this issue see:

* `Python Memory Management <http://www.evanjones.ca/python-memory.html>`_
* `Python Memory Management Part 2 <http://www.evanjones.ca/python-memory-part2.html>`_
* `Python Memory Management Part 3 <http://www.evanjones.ca/python-memory-part3.html>`_
* `Python Memory Management <https://www.evanjones.ca/python-memory.html>`_
* `Python Memory Management Part 2 <https://www.evanjones.ca/python-memory-part2.html>`_
* `Python Memory Management Part 3 <https://www.evanjones.ca/python-memory-part3.html>`_

The improvements proposed by Evan Jones, which are detailed in `this paper`_,
got merged in Python 2.5, but this only reduces the problem, it doesn't fix it
Expand All @@ -327,7 +327,7 @@ completely. To quote the paper:
to move to a compacting garbage collector, which is able to move objects in
memory. This would require significant changes to the Python interpreter.*

.. _this paper: http://www.evanjones.ca/memoryallocator/
.. _this paper: https://www.evanjones.ca/memoryallocator/

To keep memory consumption reasonable you can split the job into several
smaller jobs or enable :ref:`persistent job queue <topics-jobs>`
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/request-response.rst
Expand Up @@ -396,7 +396,7 @@ The FormRequest class extends the base :class:`Request` with functionality for
dealing with HTML forms. It uses `lxml.html forms`_ to pre-populate form
fields with form data from :class:`Response` objects.

.. _lxml.html forms: http://lxml.de/lxmlhtml.html#forms
.. _lxml.html forms: https://lxml.de/lxmlhtml.html#forms

.. class:: FormRequest(url, [formdata, ...])

Expand Down
17 changes: 8 additions & 9 deletions docs/topics/selectors.rst
Expand Up @@ -35,12 +35,11 @@ defines selectors to associate those styles with specific HTML elements.
in speed and parsing accuracy to lxml.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
.. _lxml: http://lxml.de/
.. _lxml: https://lxml.de/
.. _ElementTree: https://docs.python.org/2/library/xml.etree.elementtree.html
.. _cssselect: https://pypi.python.org/pypi/cssselect/
.. _XPath: https://www.w3.org/TR/xpath
.. _XPath: https://www.w3.org/TR/xpath/all/
.. _CSS: https://www.w3.org/TR/selectors
.. _parsel: https://parsel.readthedocs.io/
.. _parsel: https://parsel.readthedocs.io/en/latest/

Using selectors
===============
Expand Down Expand Up @@ -255,7 +254,7 @@ that Scrapy (parsel) implements a couple of **non-standard pseudo-elements**:
They will most probably not work with other libraries like
`lxml`_ or `PyQuery`_.

.. _PyQuery: https://pypi.python.org/pypi/pyquery
.. _PyQuery: https://pypi.org/project/pyquery/

Examples:

Expand Down Expand Up @@ -309,7 +308,7 @@ Examples:
make much sense: text nodes do not have attributes, and attribute values
are string values already and do not have children nodes.

.. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors
.. _CSS Selectors: https://www.w3.org/TR/selectors-3/#selectors

.. _topics-selectors-nesting-selectors:

Expand Down Expand Up @@ -504,7 +503,7 @@ Another common case would be to extract all direct ``<p>`` children:
For more details about relative XPaths see the `Location Paths`_ section in the
XPath specification.

.. _Location Paths: https://www.w3.org/TR/xpath#location-paths
.. _Location Paths: https://www.w3.org/TR/xpath/all/#location-paths

When querying by class, consider using CSS
------------------------------------------
Expand Down Expand Up @@ -612,7 +611,7 @@ But using the ``.`` to mean the node, works:
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

.. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions
.. _`XPath string function`: https://www.w3.org/TR/xpath/all/#section-String-Functions

.. _topics-selectors-xpath-variables:

Expand Down Expand Up @@ -764,7 +763,7 @@ Set operations
These can be handy for excluding parts of a document tree before
extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product)
Example extracting microdata (sample content taken from https://schema.org/Product)
with groups of itemscopes and corresponding itemprops::

>>> doc = u"""
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/shell.rst
Expand Up @@ -41,7 +41,7 @@ variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`::

.. _IPython: https://ipython.org/
.. _IPython installation guide: https://ipython.org/install.html
.. _bpython: https://www.bpython-interpreter.org/
.. _bpython: https://bpython-interpreter.org/

Launch the shell
================
Expand Down Expand Up @@ -142,7 +142,7 @@ Example of shell session
========================

Here's an example of a typical shell session where we start by scraping the
https://scrapy.org page, and then proceed to scrape the https://reddit.com
https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/
page. Finally, we modify the (Reddit) request method to POST and re-fetch it
getting an error. We end the session by typing Ctrl-D (in Unix systems) or
Ctrl-Z in Windows.
Expand Down Expand Up @@ -182,7 +182,7 @@ After that, we can start playing with the objects:
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("https://reddit.com")
>>> fetch("https://old.reddit.com/")

>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
Expand Down