Skip to content

Commit 243be84

Browse files
committed
fixed doc typos
1 parent 1fbb715 commit 243be84

File tree

2 files changed

+20
-18
lines changed

2 files changed

+20
-18
lines changed

docs/topics/broad-crawls.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Increase concurrency
4343
====================
4444

4545
Concurrency is the number of requests that are processed in parallel. There is
46-
a global limit a per-domain limit.
46+
a global limit and a per-domain limit.
4747

4848
The default global concurrency limit in Scrapy is not suitable for crawling
4949
many different domains in parallel, so you will want to increase it. How much

docs/topics/practices.rst

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,20 @@
44
Common Practices
55
================
66

7-
The section documents sommon common practices when using Scrapy. These are
8-
things that don't often fall into other specific sections, or cover many of
9-
them.
7+
This section documents common practices when using Scrapy. These are things
8+
that cover many topics and don't often fall into any other specific section.
109

1110
.. _run-from-script:
1211

1312
Run Scrapy from a script
1413
========================
1514

16-
You can use the :ref:`API <topics-api>` to run script from a script, instead of
15+
You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
1716
the typical way of running Scrapy via ``scrapy crawl``.
1817

1918
What follows is a working example of how to do that, using the `testspiders`_
20-
project as example. Remember that Scrapy is asynchronous so you need run inside
21-
the Twisted reactor.
19+
project as example. Remember that Scrapy is built on top of the Twisted
20+
asynchronous networking library, so you need run it inside the Twisted reactor.
2221

2322
::
2423

@@ -36,12 +35,14 @@ the Twisted reactor.
3635
log.start()
3736
reactor.run() # the script will block here
3837

38+
.. seealso:: `Twisted Reactor Overview`_.
39+
3940
Running multiple spiders in the same process
4041
============================================
4142

4243
By default, Scrapy runs a single spider per process when you run ``scrapy
43-
crawl``. However, Scrapy supports running multiple spiders per process if you
44-
use the :ref:`internal API <topics-api>`.
44+
crawl``. However, Scrapy supports running multiple spiders per process using
45+
the :ref:`internal API <topics-api>`.
4546

4647
Here is an example, using the `testspiders`_ project:
4748

@@ -65,33 +66,33 @@ Here is an example, using the `testspiders`_ project:
6566
log.start()
6667
reactor.run()
6768

68-
See also: :ref:`run-from-script`.
69+
.. seealso:: :ref:`run-from-script`.
6970

7071
.. _distributed-crawls:
7172

7273
Distributed crawls
7374
==================
7475

75-
Scrapy doesn't provide any built-in facility to distribute crawls, however
76-
there are some ways to distribute crawls, depending on what kind of crawling
77-
you do.
76+
Scrapy doesn't provide any built-in facility for running crawls in a distribute
77+
(multi-server) manner. However, there are some ways to distribute crawls, which
78+
vary depending on how you plan to distribute them.
7879

7980
If you have many spiders, the obvious way to distribute the load is to setup
8081
many Scrapyd instances and distribute spider runs among those.
8182

8283
If you instead want to run a single (big) spider through many machines, what
83-
you usually do is to partition the urls to crawl and send them to each separate
84+
you usually do is partition the urls to crawl and send them to each separate
8485
spider. Here is a concrete example:
8586

86-
First, you prepare a list of urls to crawl and put them into separate
87+
First, you prepare the list of urls to crawl and put them into separate
8788
files/urls::
8889

8990
http://somedomain.com/urls-to-crawl/spider1/part1.list
9091
http://somedomain.com/urls-to-crawl/spider1/part2.list
9192
http://somedomain.com/urls-to-crawl/spider1/part3.list
9293

93-
Then you would fire a spider run on 3 different Scrapyd servers. The spider
94-
would receive a spider argument ``part`` with the number of the partition to
94+
Then you fire a spider run on 3 different Scrapyd servers. The spider would
95+
receive a (spider) argument ``part`` with the number of the partition to
9596
crawl::
9697

9798
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
@@ -115,7 +116,7 @@ Here are some tips to keep in mind when dealing with these kind of sites:
115116
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
116117
cookies to spot bot behaviour
117118
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
118-
* is possible, use `Google cache`_ to fetch pages, instead of hitting the sites
119+
* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites
119120
directly
120121
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
121122
services like `ProxyMesh`_
@@ -128,3 +129,4 @@ If you are still unable to prevent your bot getting banned, consider contacting
128129
.. _ProxyMesh: http://proxymesh.com/
129130
.. _Google cache: http://www.googleguide.com/cached_pages.html
130131
.. _testspiders: https://github.com/scrapinghub/testspiders
132+
.. _Twisted Reactor Overview: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html

0 commit comments

Comments
 (0)