4
4
Common Practices
5
5
================
6
6
7
- The section documents sommon common practices when using Scrapy. These are
8
- things that don't often fall into other specific sections, or cover many of
9
- them.
7
+ This section documents common practices when using Scrapy. These are things
8
+ that cover many topics and don't often fall into any other specific section.
10
9
11
10
.. _run-from-script :
12
11
13
12
Run Scrapy from a script
14
13
========================
15
14
16
- You can use the :ref: `API <topics-api >` to run script from a script, instead of
15
+ You can use the :ref: `API <topics-api >` to run Scrapy from a script, instead of
17
16
the typical way of running Scrapy via ``scrapy crawl ``.
18
17
19
18
What follows is a working example of how to do that, using the `testspiders `_
20
- project as example. Remember that Scrapy is asynchronous so you need run inside
21
- the Twisted reactor.
19
+ project as example. Remember that Scrapy is built on top of the Twisted
20
+ asynchronous networking library, so you need run it inside the Twisted reactor.
22
21
23
22
::
24
23
@@ -36,12 +35,14 @@ the Twisted reactor.
36
35
log.start()
37
36
reactor.run() # the script will block here
38
37
38
+ .. seealso :: `Twisted Reactor Overview`_.
39
+
39
40
Running multiple spiders in the same process
40
41
============================================
41
42
42
43
By default, Scrapy runs a single spider per process when you run ``scrapy
43
- crawl ``. However, Scrapy supports running multiple spiders per process if you
44
- use the :ref: `internal API <topics-api >`.
44
+ crawl ``. However, Scrapy supports running multiple spiders per process using
45
+ the :ref: `internal API <topics-api >`.
45
46
46
47
Here is an example, using the `testspiders `_ project:
47
48
@@ -65,33 +66,33 @@ Here is an example, using the `testspiders`_ project:
65
66
log.start()
66
67
reactor.run()
67
68
68
- See also : :ref: `run-from-script `.
69
+ .. seealso : : :ref:`run-from-script`.
69
70
70
71
.. _distributed-crawls :
71
72
72
73
Distributed crawls
73
74
==================
74
75
75
- Scrapy doesn't provide any built-in facility to distribute crawls, however
76
- there are some ways to distribute crawls, depending on what kind of crawling
77
- you do .
76
+ Scrapy doesn't provide any built-in facility for running crawls in a distribute
77
+ (multi-server) manner. However, there are some ways to distribute crawls, which
78
+ vary depending on how you plan to distribute them .
78
79
79
80
If you have many spiders, the obvious way to distribute the load is to setup
80
81
many Scrapyd instances and distribute spider runs among those.
81
82
82
83
If you instead want to run a single (big) spider through many machines, what
83
- you usually do is to partition the urls to crawl and send them to each separate
84
+ you usually do is partition the urls to crawl and send them to each separate
84
85
spider. Here is a concrete example:
85
86
86
- First, you prepare a list of urls to crawl and put them into separate
87
+ First, you prepare the list of urls to crawl and put them into separate
87
88
files/urls::
88
89
89
90
http://somedomain.com/urls-to-crawl/spider1/part1.list
90
91
http://somedomain.com/urls-to-crawl/spider1/part2.list
91
92
http://somedomain.com/urls-to-crawl/spider1/part3.list
92
93
93
- Then you would fire a spider run on 3 different Scrapyd servers. The spider
94
- would receive a spider argument ``part `` with the number of the partition to
94
+ Then you fire a spider run on 3 different Scrapyd servers. The spider would
95
+ receive a ( spider) argument ``part `` with the number of the partition to
95
96
crawl::
96
97
97
98
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
@@ -115,7 +116,7 @@ Here are some tips to keep in mind when dealing with these kind of sites:
115
116
* disable cookies (see :setting: `COOKIES_ENABLED `) as some sites may use
116
117
cookies to spot bot behaviour
117
118
* use download delays (2 or higher). See :setting: `DOWNLOAD_DELAY ` setting.
118
- * is possible, use `Google cache `_ to fetch pages, instead of hitting the sites
119
+ * if possible, use `Google cache `_ to fetch pages, instead of hitting the sites
119
120
directly
120
121
* use a pool of rotating IPs. For example, the free `Tor project `_ or paid
121
122
services like `ProxyMesh `_
@@ -128,3 +129,4 @@ If you are still unable to prevent your bot getting banned, consider contacting
128
129
.. _ProxyMesh : http://proxymesh.com/
129
130
.. _Google cache : http://www.googleguide.com/cached_pages.html
130
131
.. _testspiders : https://github.com/scrapinghub/testspiders
132
+ .. _Twisted Reactor Overview : http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html
0 commit comments