Skip to content

Commit

Permalink
Merge pull request #5432 from Farsene1/master
Browse files Browse the repository at this point in the history
Recommend Common Crawl instead of Google Cache
  • Loading branch information
wRAR committed Mar 11, 2022
2 parents e264cc3 + ccdbb79 commit 2d6042b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/topics/practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ Here are some tips to keep in mind when dealing with these kinds of sites:
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
cookies to spot bot behaviour
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites
* if possible, use `Common Crawl`_ to fetch pages, instead of hitting the sites
directly
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a
Expand All @@ -277,7 +277,7 @@ If you are still unable to prevent your bot getting banned, consider contacting
.. _Tor project: https://www.torproject.org/
.. _commercial support: https://scrapy.org/support/
.. _ProxyMesh: https://proxymesh.com/
.. _Google cache: http://www.googleguide.com/cached_pages.html
.. _Common Crawl: https://commoncrawl.org/
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _scrapoxy: https://scrapoxy.io/
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/

0 comments on commit 2d6042b

Please sign in to comment.