diff --git a/docs/conf.py b/docs/conf.py index c3418cfb332..6e2399f6610 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -281,6 +281,7 @@ intersphinx_mapping = { 'coverage': ('https://coverage.readthedocs.io/en/stable', None), + 'cssselect': ('https://cssselect.readthedocs.io/en/latest', None), 'pytest': ('https://docs.pytest.org/en/latest', None), 'python': ('https://docs.python.org/3', None), 'sphinx': ('https://www.sphinx-doc.org/en/master', None), diff --git a/docs/contributing.rst b/docs/contributing.rst index f40a6bba29c..aed5ab92eb8 100644 --- a/docs/contributing.rst +++ b/docs/contributing.rst @@ -143,7 +143,7 @@ by running ``git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE`` (replace 'upstream' with a remote name for scrapy repository, ``$PR_NUMBER`` with an ID of the pull request, and ``$BRANCH_NAME_TO_CREATE`` with a name of the branch you want to create locally). -See also: https://help.github.com/articles/checking-out-pull-requests-locally/#modifying-an-inactive-pull-request-locally. +See also: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally. When writing GitHub pull requests, try to keep titles short but descriptive. E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests" @@ -168,7 +168,7 @@ Scrapy: * Don't put your name in the code you contribute; git provides enough metadata to identify author of the code. - See https://help.github.com/articles/setting-your-username-in-git/ for + See https://help.github.com/en/github/using-git/setting-your-username-in-git for setup instructions. .. _documentation-policies: @@ -266,5 +266,5 @@ And their unit-tests are in:: .. _tests/: https://github.com/scrapy/scrapy/tree/master/tests .. _open issues: https://github.com/scrapy/scrapy/issues .. _PEP 257: https://www.python.org/dev/peps/pep-0257/ -.. _pull request: https://help.github.com/en/articles/creating-a-pull-request +.. _pull request: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request .. _pytest-xdist: https://github.com/pytest-dev/pytest-xdist diff --git a/docs/faq.rst b/docs/faq.rst index f72e4cf0157..75a0f4864ff 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -22,8 +22,8 @@ In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like comparing `jinja2`_ to `Django`_. .. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/ -.. _lxml: http://lxml.de/ -.. _jinja2: http://jinja.pocoo.org/ +.. _lxml: https://lxml.de/ +.. _jinja2: https://palletsprojects.com/p/jinja/ .. _Django: https://www.djangoproject.com/ Can I use Scrapy with BeautifulSoup? @@ -269,7 +269,7 @@ The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see `this page`_. Also, here's an `example spider`_ which scrapes one of these sites. -.. _this page: http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm +.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm .. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py What's the best way to parse big XML/CSV data feeds? diff --git a/docs/intro/install.rst b/docs/intro/install.rst index 49968437cd3..871281460e7 100644 --- a/docs/intro/install.rst +++ b/docs/intro/install.rst @@ -65,7 +65,7 @@ please refer to their respective installation instructions: * `lxml installation`_ * `cryptography installation`_ -.. _lxml installation: http://lxml.de/installation.html +.. _lxml installation: https://lxml.de/installation.html .. _cryptography installation: https://cryptography.io/en/latest/installation/ @@ -253,11 +253,11 @@ For details, see `Issue #2473 `_. .. _Python: https://www.python.org/ .. _pip: https://pip.pypa.io/en/latest/installing/ .. _lxml: https://lxml.de/index.html -.. _parsel: https://pypi.python.org/pypi/parsel -.. _w3lib: https://pypi.python.org/pypi/w3lib -.. _twisted: https://twistedmatrix.com/ -.. _cryptography: https://cryptography.io/ -.. _pyOpenSSL: https://pypi.python.org/pypi/pyOpenSSL +.. _parsel: https://pypi.org/project/parsel/ +.. _w3lib: https://pypi.org/project/w3lib/ +.. _twisted: https://twistedmatrix.com/trac/ +.. _cryptography: https://cryptography.io/en/latest/ +.. _pyOpenSSL: https://pypi.org/project/pyOpenSSL/ .. _setuptools: https://pypi.python.org/pypi/setuptools .. _AUR Scrapy package: https://aur.archlinux.org/packages/scrapy/ .. _homebrew: https://brew.sh/ diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst index 798fe4a7a71..1768badbb83 100644 --- a/docs/intro/tutorial.rst +++ b/docs/intro/tutorial.rst @@ -306,7 +306,7 @@ with a selector (see :ref:`topics-developer-tools`). visually selected elements, which works in many browsers. .. _regular expressions: https://docs.python.org/3/library/re.html -.. _Selector Gadget: http://selectorgadget.com/ +.. _Selector Gadget: https://selectorgadget.com/ XPath: a brief intro @@ -337,7 +337,7 @@ recommend `this tutorial to learn XPath through examples `_, and `this tutorial to learn "how to think in XPath" `_. -.. _XPath: https://www.w3.org/TR/xpath +.. _XPath: https://www.w3.org/TR/xpath/all/ .. _CSS: https://www.w3.org/TR/selectors Extracting quotes and authors diff --git a/docs/news.rst b/docs/news.rst index e4b985c77e1..338b53dc4f5 100644 --- a/docs/news.rst +++ b/docs/news.rst @@ -26,7 +26,7 @@ Backward-incompatible changes * Python 3.4 is no longer supported, and some of the minimum requirements of Scrapy have also changed: - * cssselect_ 0.9.1 + * :doc:`cssselect ` 0.9.1 * cryptography_ 2.0 * lxml_ 3.5.0 * pyOpenSSL_ 16.2.0 @@ -1616,7 +1616,7 @@ Deprecations and Removals + ``scrapy.utils.datatypes.SiteNode`` - The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and - replaced by `pydispatcher `_. + replaced by `pydispatcher `_. Relocations @@ -2450,7 +2450,7 @@ Other ~~~~~ - Dropped Python 2.6 support (:issue:`448`) -- Add `cssselect`_ python package as install dependency +- Add :doc:`cssselect ` python package as install dependency - Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on. - Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support. - Running test suite now requires ``mock`` python library (:issue:`390`) @@ -3047,17 +3047,16 @@ Scrapy 0.7 First release of Scrapy. -.. _AJAX crawleable urls: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?csw=1 +.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1 .. _botocore: https://github.com/boto/botocore .. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding .. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/ .. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request .. _cryptography: https://cryptography.io/en/latest/ -.. _cssselect: https://github.com/scrapy/cssselect/ -.. _docstrings: https://docs.python.org/glossary.html#term-docstring -.. _KeyboardInterrupt: https://docs.python.org/library/exceptions.html#KeyboardInterrupt +.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring +.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt .. _LevelDB: https://github.com/google/leveldb -.. _lxml: http://lxml.de/ +.. _lxml: https://lxml.de/ .. _marshal: https://docs.python.org/2/library/marshal.html .. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator .. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator @@ -3068,11 +3067,11 @@ First release of Scrapy. .. _queuelib: https://github.com/scrapy/queuelib .. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml .. _resource: https://docs.python.org/2/library/resource.html -.. _robots.txt: http://www.robotstxt.org/ +.. _robots.txt: https://www.robotstxt.org/ .. _scrapely: https://github.com/scrapy/scrapely .. _service_identity: https://service-identity.readthedocs.io/en/stable/ .. _six: https://six.readthedocs.io/ -.. _tox: https://pypi.python.org/pypi/tox +.. _tox: https://pypi.org/project/tox/ .. _Twisted: https://twistedmatrix.com/trac/ .. _Twisted - hello, asynchronous programming: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming/ .. _w3lib: https://github.com/scrapy/w3lib diff --git a/docs/topics/broad-crawls.rst b/docs/topics/broad-crawls.rst index 4922694ee4b..63b60312ea1 100644 --- a/docs/topics/broad-crawls.rst +++ b/docs/topics/broad-crawls.rst @@ -188,7 +188,7 @@ AjaxCrawlMiddleware helps to crawl them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused crawls doesn't make much sense. -.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started +.. _ajax crawlable: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started .. _broad-crawls-bfo: diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst index a83cedcfde1..0297ef3a064 100644 --- a/docs/topics/downloader-middleware.rst +++ b/docs/topics/downloader-middleware.rst @@ -709,7 +709,7 @@ HttpCompressionMiddleware provided `brotlipy`_ is installed. .. _brotli-compressed: https://www.ietf.org/rfc/rfc7932.txt -.. _brotlipy: https://pypi.python.org/pypi/brotlipy +.. _brotlipy: https://pypi.org/project/brotlipy/ HttpCompressionMiddleware Settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -1038,7 +1038,7 @@ Based on `RobotFileParser * is Python's built-in robots.txt_ parser * is compliant with `Martijn Koster's 1996 draft specification - `_ + `_ * lacks support for wildcard matching @@ -1061,7 +1061,7 @@ Based on `Reppy `_: `_ * is compliant with `Martijn Koster's 1996 draft specification - `_ + `_ * supports wildcard matching @@ -1086,7 +1086,7 @@ Based on `Robotexclusionrulesparser `_: * implemented in Python * is compliant with `Martijn Koster's 1996 draft specification - `_ + `_ * supports wildcard matching @@ -1115,7 +1115,7 @@ implementing the methods described below. .. autoclass:: RobotParser :members: -.. _robots.txt: http://www.robotstxt.org/ +.. _robots.txt: https://www.robotstxt.org/ DownloaderStats --------------- @@ -1155,7 +1155,7 @@ AjaxCrawlMiddleware Middleware that finds 'AJAX crawlable' page variants based on meta-fragment html tag. See - https://developers.google.com/webmasters/ajax-crawling/docs/getting-started + https://developers.google.com/search/docs/ajax-crawling/docs/getting-started for more info. .. note:: diff --git a/docs/topics/dynamic-content.rst b/docs/topics/dynamic-content.rst index 1c3607860f1..b981336764c 100644 --- a/docs/topics/dynamic-content.rst +++ b/docs/topics/dynamic-content.rst @@ -241,12 +241,12 @@ along with `scrapy-selenium`_ for seamless integration. .. _headless browser: https://en.wikipedia.org/wiki/Headless_browser .. _JavaScript: https://en.wikipedia.org/wiki/JavaScript .. _js2xml: https://github.com/scrapinghub/js2xml -.. _json.loads: https://docs.python.org/library/json.html#json.loads +.. _json.loads: https://docs.python.org/3/library/json.html#json.loads .. _pytesseract: https://github.com/madmaze/pytesseract -.. _regular expression: https://docs.python.org/library/re.html +.. _regular expression: https://docs.python.org/3/library/re.html .. _scrapy-selenium: https://github.com/clemfromspace/scrapy-selenium .. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash -.. _Selenium: https://www.seleniumhq.org/ +.. _Selenium: https://www.selenium.dev/ .. _Splash: https://github.com/scrapinghub/splash .. _tabula-py: https://github.com/chezou/tabula-py .. _wget: https://www.gnu.org/software/wget/ diff --git a/docs/topics/item-pipeline.rst b/docs/topics/item-pipeline.rst index cdc4953c273..801d48fd51d 100644 --- a/docs/topics/item-pipeline.rst +++ b/docs/topics/item-pipeline.rst @@ -158,8 +158,8 @@ method and how to clean up the resources properly.:: self.db[self.collection_name].insert_one(dict(item)) return item -.. _MongoDB: https://www.mongodb.org/ -.. _pymongo: https://api.mongodb.org/python/current/ +.. _MongoDB: https://www.mongodb.com/ +.. _pymongo: https://api.mongodb.com/python/current/ Take screenshot of item diff --git a/docs/topics/items.rst b/docs/topics/items.rst index 15313775b04..44643cb67f9 100644 --- a/docs/topics/items.rst +++ b/docs/topics/items.rst @@ -166,7 +166,7 @@ If your item contains mutable_ values like lists or dictionaries, a shallow copy will keep references to the same mutable values across all different copies. -.. _mutable: https://docs.python.org/glossary.html#term-mutable +.. _mutable: https://docs.python.org/3/glossary.html#term-mutable For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item and the copy have the same list of @@ -177,7 +177,7 @@ If that is not the desired behavior, use a deep copy instead. See the `documentation of the copy module`_ for more information. -.. _documentation of the copy module: https://docs.python.org/library/copy.html +.. _documentation of the copy module: https://docs.python.org/3/library/copy.html To create a shallow copy of an item, you can either call :meth:`~scrapy.item.Item.copy` on an existing item diff --git a/docs/topics/leaks.rst b/docs/topics/leaks.rst index 9fee333aca3..c0c83fc84dc 100644 --- a/docs/topics/leaks.rst +++ b/docs/topics/leaks.rst @@ -206,7 +206,7 @@ objects. If this is your case, and you can't find your leaks using ``trackref``, you still have another resource: the `Guppy library`_. If you're using Python3, see :ref:`topics-leaks-muppy`. -.. _Guppy library: https://pypi.python.org/pypi/guppy +.. _Guppy library: https://pypi.org/project/guppy/ If you use ``pip``, you can install Guppy with the following command:: @@ -311,9 +311,9 @@ though neither Scrapy nor your project are leaking memory. This is due to a (not so well) known problem of Python, which may not return released memory to the operating system in some cases. For more information on this issue see: -* `Python Memory Management `_ -* `Python Memory Management Part 2 `_ -* `Python Memory Management Part 3 `_ +* `Python Memory Management `_ +* `Python Memory Management Part 2 `_ +* `Python Memory Management Part 3 `_ The improvements proposed by Evan Jones, which are detailed in `this paper`_, got merged in Python 2.5, but this only reduces the problem, it doesn't fix it @@ -327,7 +327,7 @@ completely. To quote the paper: to move to a compacting garbage collector, which is able to move objects in memory. This would require significant changes to the Python interpreter.* -.. _this paper: http://www.evanjones.ca/memoryallocator/ +.. _this paper: https://www.evanjones.ca/memoryallocator/ To keep memory consumption reasonable you can split the job into several smaller jobs or enable :ref:`persistent job queue ` diff --git a/docs/topics/request-response.rst b/docs/topics/request-response.rst index f009facd62f..c4c2845c953 100644 --- a/docs/topics/request-response.rst +++ b/docs/topics/request-response.rst @@ -396,7 +396,7 @@ The FormRequest class extends the base :class:`Request` with functionality for dealing with HTML forms. It uses `lxml.html forms`_ to pre-populate form fields with form data from :class:`Response` objects. -.. _lxml.html forms: http://lxml.de/lxmlhtml.html#forms +.. _lxml.html forms: https://lxml.de/lxmlhtml.html#forms .. class:: FormRequest(url, [formdata, ...]) diff --git a/docs/topics/selectors.rst b/docs/topics/selectors.rst index c3d431e2a14..1f7802c98f9 100644 --- a/docs/topics/selectors.rst +++ b/docs/topics/selectors.rst @@ -35,12 +35,11 @@ defines selectors to associate those styles with specific HTML elements. in speed and parsing accuracy to lxml. .. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/ -.. _lxml: http://lxml.de/ +.. _lxml: https://lxml.de/ .. _ElementTree: https://docs.python.org/2/library/xml.etree.elementtree.html -.. _cssselect: https://pypi.python.org/pypi/cssselect/ -.. _XPath: https://www.w3.org/TR/xpath +.. _XPath: https://www.w3.org/TR/xpath/all/ .. _CSS: https://www.w3.org/TR/selectors -.. _parsel: https://parsel.readthedocs.io/ +.. _parsel: https://parsel.readthedocs.io/en/latest/ Using selectors =============== @@ -255,7 +254,7 @@ that Scrapy (parsel) implements a couple of **non-standard pseudo-elements**: They will most probably not work with other libraries like `lxml`_ or `PyQuery`_. -.. _PyQuery: https://pypi.python.org/pypi/pyquery +.. _PyQuery: https://pypi.org/project/pyquery/ Examples: @@ -309,7 +308,7 @@ Examples: make much sense: text nodes do not have attributes, and attribute values are string values already and do not have children nodes. -.. _CSS Selectors: https://www.w3.org/TR/css3-selectors/#selectors +.. _CSS Selectors: https://www.w3.org/TR/selectors-3/#selectors .. _topics-selectors-nesting-selectors: @@ -504,7 +503,7 @@ Another common case would be to extract all direct ``

`` children: For more details about relative XPaths see the `Location Paths`_ section in the XPath specification. -.. _Location Paths: https://www.w3.org/TR/xpath#location-paths +.. _Location Paths: https://www.w3.org/TR/xpath/all/#location-paths When querying by class, consider using CSS ------------------------------------------ @@ -612,7 +611,7 @@ But using the ``.`` to mean the node, works: >>> sel.xpath("//a[contains(., 'Next Page')]").getall() ['Click here to go to the Next Page'] -.. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions +.. _`XPath string function`: https://www.w3.org/TR/xpath/all/#section-String-Functions .. _topics-selectors-xpath-variables: @@ -764,7 +763,7 @@ Set operations These can be handy for excluding parts of a document tree before extracting text elements for example. -Example extracting microdata (sample content taken from http://schema.org/Product) +Example extracting microdata (sample content taken from https://schema.org/Product) with groups of itemscopes and corresponding itemprops:: >>> doc = u""" diff --git a/docs/topics/shell.rst b/docs/topics/shell.rst index 3cf8311a67a..8f7518b19d5 100644 --- a/docs/topics/shell.rst +++ b/docs/topics/shell.rst @@ -41,7 +41,7 @@ variable; or by defining it in your :ref:`scrapy.cfg `:: .. _IPython: https://ipython.org/ .. _IPython installation guide: https://ipython.org/install.html -.. _bpython: https://www.bpython-interpreter.org/ +.. _bpython: https://bpython-interpreter.org/ Launch the shell ================ @@ -142,7 +142,7 @@ Example of shell session ======================== Here's an example of a typical shell session where we start by scraping the -https://scrapy.org page, and then proceed to scrape the https://reddit.com +https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/ page. Finally, we modify the (Reddit) request method to POST and re-fetch it getting an error. We end the session by typing Ctrl-D (in Unix systems) or Ctrl-Z in Windows. @@ -182,7 +182,7 @@ After that, we can start playing with the objects: >>> response.xpath('//title/text()').get() 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' ->>> fetch("https://reddit.com") +>>> fetch("https://old.reddit.com/") >>> response.xpath('//title/text()').get() 'reddit: the front page of the internet' diff --git a/docs/topics/spiders.rst b/docs/topics/spiders.rst index b0fb14e2444..e0f33de6655 100644 --- a/docs/topics/spiders.rst +++ b/docs/topics/spiders.rst @@ -299,8 +299,8 @@ The spider will not do any parsing on its own. If you were to set the ``start_urls`` attribute from the command line, you would have to parse it on your own into a list using something like -`ast.literal_eval `_ -or `json.loads `_ +`ast.literal_eval `_ +or `json.loads `_ and then set it as an attribute. Otherwise, you would cause iteration over a ``start_urls`` string (a very common python pitfall) @@ -811,6 +811,6 @@ Combine SitemapSpider with other sources of urls:: .. _Sitemaps: https://www.sitemaps.org/index.html .. _Sitemap index files: https://www.sitemaps.org/protocol.html#index -.. _robots.txt: http://www.robotstxt.org/ +.. _robots.txt: https://www.robotstxt.org/ .. _TLD: https://en.wikipedia.org/wiki/Top-level_domain .. _Scrapyd documentation: https://scrapyd.readthedocs.io/en/latest/