Skip to content
Permalink
Browse files
Fixed minor grammar issues.
  • Loading branch information
mvj3 authored and dangra committed Dec 30, 2015
1 parent b71f677 commit 7f4ddd5d8d2db3b18c983e9a3eb9896748d8f30c
@@ -144,7 +144,7 @@ I get "Filtered offsite request" messages. How can I fix them?
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
problem, so you may not need to fix them.

Those message are thrown by the Offsite Spider Middleware, which is a spider
Those messages are thrown by the Offsite Spider Middleware, which is a spider
middleware (enabled by default) whose purpose is to filter out requests to
domains outside the ones covered by the spider.

@@ -34,7 +34,7 @@ These are some common properties often found in broad crawls:

As said above, Scrapy default settings are optimized for focused crawls, not
broad crawls. However, due to its asynchronous architecture, Scrapy is very
well suited for performing fast broad crawls. This page summarize some things
well suited for performing fast broad crawls. This page summarizes some things
you need to keep in mind when using Scrapy for doing broad crawls, along with
concrete suggestions of Scrapy settings to tune in order to achieve an
efficient broad crawl.
@@ -46,7 +46,7 @@ Concurrency is the number of requests that are processed in parallel. There is
a global limit and a per-domain limit.

The default global concurrency limit in Scrapy is not suitable for crawling
many different domains in parallel, so you will want to increase it. How much
many different domains in parallel, so you will want to increase it. How much
to increase it will depend on how much CPU you crawler will have available. A
good starting point is ``100``, but the best way to find out is by doing some
trials and identifying at what concurrency your Scrapy process gets CPU
@@ -17,7 +17,7 @@ Extensions use the :ref:`Scrapy settings <topics-settings>` to manage their
settings, just like any other Scrapy code.

It is customary for extensions to prefix their settings with their own name, to
avoid collision with existing (and future) extensions. For example, an
avoid collision with existing (and future) extensions. For example, a
hypothetic extension to handle `Google Sitemaps`_ would use settings like
`GOOGLESITEMAP_ENABLED`, `GOOGLESITEMAP_DEPTH`, and so on.

@@ -145,7 +145,7 @@ Here is the code of such extension::
self.items_scraped += 1
if self.items_scraped % self.item_count == 0:
logger.info("scraped %d items", self.items_scraped)


.. _topics-extensions-ref:

@@ -95,7 +95,7 @@ contain a price::
Write items to a JSON file
--------------------------

The following pipeline stores all scraped items (from all spiders) into a a
The following pipeline stores all scraped items (from all spiders) into a
single ``items.jl`` file, containing one item per line serialized in JSON
format::

@@ -61,7 +61,7 @@ the example above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there is no reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different components, and
defined in :class:`Field` objects could be used by a different component, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
@@ -97,7 +97,7 @@ subclasses):
A real example
--------------

Let's see a concrete example of an hypothetical case of memory leaks.
Let's see a concrete example of a hypothetical case of memory leaks.
Suppose we have some spider with a line similar to this one::

return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
@@ -228,7 +228,7 @@ with varying degrees of sophistication. Getting around those measures can be
difficult and tricky, and may sometimes require special infrastructure. Please
consider contacting `commercial support`_ if in doubt.

Here are some tips to keep in mind when dealing with these kind of sites:
Here are some tips to keep in mind when dealing with these kinds of sites:

* rotate your user agent from a pool of well-known ones from browsers (google
around to get a list of them)
@@ -579,7 +579,7 @@ Built-in Selectors reference
is used together with ``text``.

If ``type`` is ``None`` and a ``response`` is passed, the selector type is
inferred from the response type as follow:
inferred from the response type as follows:

* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
@@ -757,7 +757,7 @@ nodes can be accessed directly by their names::
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...

If you wonder why the namespace removal procedure isn't called always by default
If you wonder why the namespace removal procedure isn't always called by default
instead of having to call it manually, this is because of two reasons, which, in order
of relevance, are:

0 comments on commit 7f4ddd5

Please sign in to comment.