Skip to content

Commit

Permalink
Merge pull request #1172 from bagratte/docs
Browse files Browse the repository at this point in the history
minor corrections in documentation.
  • Loading branch information
kmike committed Apr 19, 2015
2 parents bb4c8c3 + 1312bcd commit 1794a89
Show file tree
Hide file tree
Showing 9 changed files with 56 additions and 53 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ venv
build
dist
.idea

# Windows
Thumbs.db
4 changes: 2 additions & 2 deletions docs/intro/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ define the three main mandatory attributes:
listed here. The subsequent URLs will be generated successively from data
contained in the start URLs.

* :meth:`~scrapy.spider.Spider.parse` a method of the spider, which will
* :meth:`~scrapy.spider.Spider.parse`: a method of the spider, which will
be called with the downloaded :class:`~scrapy.http.Response` object of each
start URL. The response is passed to the method as the first and only
argument.
Expand Down Expand Up @@ -248,7 +248,7 @@ To start a shell, you must go to the project's top level directory and run::

.. note::

Remember to always enclose urls with quotes in running Scrapy shell from
Remember to always enclose urls in quotes when running Scrapy shell from
command-line, otherwise urls containing arguments (ie. ``&`` character)
will not work.

Expand Down
12 changes: 6 additions & 6 deletions docs/topics/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,8 @@ some usage help and the available commands::
fetch Fetch a URL using the Scrapy downloader
[...]

The first line will print the currently active project, if you're inside a
Scrapy project. In this, it was run from outside a project. If run from inside
The first line will print the currently active project if you're inside a
Scrapy project. In this example it was run from outside a project. If run from inside
a project it would have printed something like this::

Scrapy X.Y - project: myproject
Expand Down Expand Up @@ -135,7 +135,7 @@ Available tool commands
=======================

This section contains a list of the available built-in commands with a
description and some usage examples. Remember you can always get more info
description and some usage examples. Remember, you can always get more info
about each command by running::

scrapy <command> -h
Expand Down Expand Up @@ -196,7 +196,7 @@ genspider

Create a new spider in the current project.

This is just a convenient shortcut command for creating spiders based on
This is just a convenience shortcut command for creating spiders based on
pre-defined templates, but certainly not the only way to create spiders. You
can just create the spider source code files yourself, instead of using this
command.
Expand Down Expand Up @@ -298,7 +298,7 @@ edit
Edit the given spider using the editor defined in the :setting:`EDITOR`
setting.

This command is provided only as a convenient shortcut for the most common
This command is provided only as a convenience shortcut for the most common
case, the developer is of course free to choose any tool or IDE to write and
debug his spiders.

Expand All @@ -318,7 +318,7 @@ Downloads the given URL using the Scrapy downloader and writes the contents to
standard output.

The interesting thing about this command is that it fetches the page how the
spider would download it. For example, if the spider has an ``USER_AGENT``
spider would download it. For example, if the spider has a ``USER_AGENT``
attribute which overrides the User Agent, it will use that one.

So this command can be used to "see" how your spider would fetch a certain page.
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/feed-exports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Feed exports

One of the most frequently required features when implementing scrapers is
being able to store the scraped data properly and, quite often, that means
generating a "export file" with the scraped data (commonly called "export
generating an "export file" with the scraped data (commonly called "export
feed") to be consumed by other systems.

Scrapy provides this functionality out of the box with the Feed Exports, which
Expand All @@ -21,7 +21,7 @@ Serialization formats
=====================

For serializing the scraped data, the feed exports use the :ref:`Item exporters
<topics-exporters>` and these formats are supported out of the box:
<topics-exporters>`. These formats are supported out of the box:

* :ref:`topics-feed-format-json`
* :ref:`topics-feed-format-jsonlines`
Expand Down
10 changes: 5 additions & 5 deletions docs/topics/item-pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ Item Pipeline
=============

After an item has been scraped by a spider, it is sent to the Item Pipeline
which process it through several components that are executed sequentially.
which processes it through several components that are executed sequentially.

Each item pipeline component (sometimes referred as just "Item Pipeline") is a
Python class that implements a simple method. They receive an item and perform
an action over it, also deciding if the item should continue through the
pipeline or be dropped and no longer processed.

Typical use for item pipelines are:
Typical uses of item pipelines are:

* cleansing HTML data
* validating scraped data (checking that the items contain certain fields)
Expand Down Expand Up @@ -167,7 +167,7 @@ Duplicates filter
-----------------

A filter that looks for duplicate items, and drops those items that were
already processed. Let say that our items have an unique id, but our spider
already processed. Let's say that our items have a unique id, but our spider
returns multiples items with the same id::


Expand Down Expand Up @@ -198,6 +198,6 @@ To activate an Item Pipeline component you must add its class to the
}

The integer values you assign to classes in this setting determine the
order they run in- items go through pipelines from order number low to
high. It's customary to define these numbers in the 0-1000 range.
order in which they run: items go through from lower valued to higher
valued classes. It's customary to define these numbers in the 0-1000 range.

2 changes: 1 addition & 1 deletion docs/topics/link-extractors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ LxmlLinkExtractor
module.
:type deny_extensions: list

:param restrict_xpaths: is a XPath (or list of XPath's) which defines
:param restrict_xpaths: is an XPath (or list of XPath's) which defines
regions inside the response where links should be extracted from.
If given, only the text selected by those XPath will be scanned for
links. See examples below.
Expand Down
12 changes: 6 additions & 6 deletions docs/topics/loaders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Item Loaders

Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
<topics-items>`. Even though Items can be populated using their own
dictionary-like API, the Item Loaders provide a much more convenient API for
dictionary-like API, Item Loaders provide a much more convenient API for
populating them from a scraping process, by automating some common tasks like
parsing the raw extracted data before assigning it.

Expand All @@ -25,7 +25,7 @@ Using Item Loaders to populate items
====================================

To use an Item Loader, you must first instantiate it. You can either
instantiate it with an dict-like object (e.g. Item or dict) or without one, in
instantiate it with a dict-like object (e.g. Item or dict) or without one, in
which case an Item is automatically instantiated in the Item Loader constructor
using the Item class specified in the :attr:`ItemLoader.default_item_class`
attribute.
Expand Down Expand Up @@ -67,7 +67,7 @@ and finally the ``last_update`` field is populated directly with a literal value
(``today``) using a different method: :meth:`~ItemLoader.add_value`.

Finally, when all data is collected, the :meth:`ItemLoader.load_item` method is
called which actually populates and returns the item populated with the data
called which actually returns the item populated with the data
previously extracted and collected with the :meth:`~ItemLoader.add_xpath`,
:meth:`~ItemLoader.add_css`, and :meth:`~ItemLoader.add_value` calls.

Expand Down Expand Up @@ -565,8 +565,8 @@ Here is a list of all built-in processors:
.. class:: Identity

The simplest processor, which doesn't do anything. It returns the original
values unchanged. It doesn't receive any constructor arguments nor accepts
Loader contexts.
values unchanged. It doesn't receive any constructor arguments, nor does it
accept Loader contexts.

Example::

Expand All @@ -579,7 +579,7 @@ Here is a list of all built-in processors:

Returns the first non-null/non-empty value from the values received,
so it's typically used as an output processor to single-valued fields.
It doesn't receive any constructor arguments, nor accept Loader contexts.
It doesn't receive any constructor arguments, nor does it accept Loader contexts.

Example::

Expand Down
52 changes: 26 additions & 26 deletions docs/topics/selectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ achieve this:
HTML code and also deals with bad markup reasonably well, but it has one
drawback: it's slow.

* `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
API based on `ElementTree`_ (which is not part of the Python standard
library).
* `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
API based on `ElementTree`_. (lxml is not part of the Python standard
library.)

Scrapy comes with its own mechanism for extracting data. They're called
selectors because they "select" certain parts of the HTML document specified
Expand Down Expand Up @@ -72,7 +72,7 @@ Constructing from response::
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

For convenience, response objects exposes a selector on `.selector` attribute,
For convenience, response objects expose a selector on `.selector` attribute,
it's totally OK to use this shortcut when possible::

>>> response.selector.xpath('//span/text()').extract()
Expand Down Expand Up @@ -114,17 +114,17 @@ page, let's construct an XPath for selecting the text inside the title tag::
>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

Querying responses using XPath and CSS is so common that responses includes two
convenient shortcuts: ``response.xpath()`` and ``response.css()``::
Querying responses using XPath and CSS is so common that responses include two
convenience shortcuts: ``response.xpath()`` and ``response.css()``::

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

As you can see, ``.xpath()`` and ``.css()`` methods returns an
As you can see, ``.xpath()`` and ``.css()`` methods return a
:class:`~scrapy.selector.SelectorList` instance, which is a list of new
selectors. This API can be used quickly for selecting nested data::
selectors. This API can be used for quickly selecting nested data::

>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
Expand Down Expand Up @@ -196,7 +196,7 @@ Now we're going to get the base URL and some image links::
Nesting selectors
-----------------

The selection methods (``.xpath()`` or ``.css()``) returns a list of selectors
The selection methods (``.xpath()`` or ``.css()``) return a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here's an example::

Expand All @@ -221,12 +221,12 @@ too. Here's an example::
Using selectors with regular expressions
----------------------------------------

:class:`~scrapy.selector.Selector` also have a ``.re()`` method for extracting
:class:`~scrapy.selector.Selector` also has a ``.re()`` method for extracting
data using regular expressions. However, unlike using ``.xpath()`` or
``.css()`` methods, ``.re()`` method returns a list of unicode strings. So you
``.css()`` methods, ``.re()`` returns a list of unicode strings. So you
can't construct nested ``.re()`` calls.

Here's an example used to extract images names from the :ref:`HTML code
Here's an example used to extract image names from the :ref:`HTML code
<topics-selectors-htmlcode>` above::

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
Expand Down Expand Up @@ -295,7 +295,7 @@ set \http://exslt.org/sets `set manipulation`_
Regular expressions
~~~~~~~~~~~~~~~~~~~

The ``test()`` function for example can prove quite useful when XPath's
The ``test()`` function, for example, can prove quite useful when XPath's
``starts-with()`` or ``contains()`` are not sufficient.

Example selecting links in list item with a "class" attribute ending with a digit::
Expand Down Expand Up @@ -440,7 +440,7 @@ you may want to take a look first at this `XPath tutorial`_.
Using text nodes in a condition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you need to use the text content as argument to a `XPath string function`_,
When you need to use the text content as argument to an `XPath string function`_,
avoid using ``.//text()`` and use just ``.`` instead.

This is because the expression ``.//text()`` yields a collection of text elements -- a *node-set*.
Expand Down Expand Up @@ -478,7 +478,7 @@ But using the ``.`` to mean the node, works::

.. _`XPath string function`: http://www.w3.org/TR/xpath/#section-String-Functions

Beware the difference between //node[1] and (//node)[1]
Beware of the difference between //node[1] and (//node)[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``//node[1]`` selects all the nodes occurring first under their respective parents.
Expand Down Expand Up @@ -559,7 +559,7 @@ Built-in Selectors reference
An instance of :class:`Selector` is a wrapper over response to select
certain parts of its content.

``response`` is a :class:`~scrapy.http.HtmlResponse` or
``response`` is an :class:`~scrapy.http.HtmlResponse` or an
:class:`~scrapy.http.XmlResponse` object that will be used for selecting and
extracting data.

Expand Down Expand Up @@ -593,7 +593,7 @@ Built-in Selectors reference

.. note::

For convenience this method can be called as ``response.xpath()``
For convenience, this method can be called as ``response.xpath()``

.. method:: css(query)

Expand Down Expand Up @@ -644,7 +644,7 @@ SelectorList objects

.. class:: SelectorList

The :class:`SelectorList` class is subclass of the builtin ``list``
The :class:`SelectorList` class is a subclass of the builtin ``list``
class, which provides a few additional methods.

.. method:: xpath(query)
Expand Down Expand Up @@ -680,17 +680,17 @@ Selector examples on HTML response
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here's a couple of :class:`Selector` examples to illustrate several concepts.
In all cases, we assume there is already an :class:`Selector` instantiated with
In all cases, we assume there is already a :class:`Selector` instantiated with
a :class:`~scrapy.http.HtmlResponse` object like this::

sel = Selector(html_response)

1. Select all ``<h1>`` elements from a HTML response body, returning a list of
1. Select all ``<h1>`` elements from an HTML response body, returning a list of
:class:`Selector` objects (ie. a :class:`SelectorList` object)::

sel.xpath("//h1")

2. Extract the text of all ``<h1>`` elements from a HTML response body,
2. Extract the text of all ``<h1>`` elements from an HTML response body,
returning a list of unicode strings::

sel.xpath("//h1").extract() # this includes the h1 tag
Expand All @@ -705,12 +705,12 @@ Selector examples on XML response
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here's a couple of examples to illustrate several concepts. In both cases we
assume there is already an :class:`Selector` instantiated with a
assume there is already a :class:`Selector` instantiated with an
:class:`~scrapy.http.XmlResponse` object like this::

sel = Selector(xml_response)

1. Select all ``<product>`` elements from a XML response body, returning a list
1. Select all ``<product>`` elements from an XML response body, returning a list
of :class:`Selector` objects (ie. a :class:`SelectorList` object)::

sel.xpath("//product")
Expand Down Expand Up @@ -752,12 +752,12 @@ nodes can be accessed directly by their names::
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...

If you wonder why the namespace removal procedure is not always called, instead
of having to call it manually. This is because of two reasons which, in order
If you wonder why the namespace removal procedure isn't called always by default
instead of having to call it manually, this is because of two reasons, which, in order
of relevance, are:

1. Removing namespaces requires to iterate and modify all nodes in the
document, which is a reasonably expensive operation to performs for all
document, which is a reasonably expensive operation to perform for all
documents crawled by Scrapy

2. There could be some cases where using namespaces is actually required, in
Expand Down
10 changes: 5 additions & 5 deletions docs/topics/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ scrapy.Spider
dicts or :class:`~scrapy.item.Item` objects.

:param response: the response to parse
:type response: :class:~scrapy.http.Response`
:type response: :class:`~scrapy.http.Response`

.. method:: log(message, [level, component])

Expand Down Expand Up @@ -297,10 +297,10 @@ See `Scrapyd documentation`_.
Generic Spiders
===============

Scrapy comes with some useful generic spiders that you can use, to subclass
Scrapy comes with some useful generic spiders that you can use to subclass
your spiders from. Their aim is to provide convenient functionality for a few
common scraping cases, like following all links on a site based on certain
rules, crawling from `Sitemaps`_, or parsing a XML/CSV feed.
rules, crawling from `Sitemaps`_, or parsing an XML/CSV feed.

For the examples used in the following spiders, we'll assume you have a project
with a ``TestItem`` declared in a ``myproject.items`` module::
Expand Down Expand Up @@ -342,7 +342,7 @@ CrawlSpider
.. method:: parse_start_url(response)

This method is called for the start_urls responses. It allows to parse
the initial responses and must return either a
the initial responses and must return either an
:class:`~scrapy.item.Item` object, a :class:`~scrapy.http.Request`
object, or an iterable containing any of them.

Expand Down Expand Up @@ -417,7 +417,7 @@ Let's now take a look at an example CrawlSpider with rules::
This spider would start crawling example.com's home page, collecting category
links, and item links, parsing the latter with the ``parse_item`` method. For
each item response, some data will be extracted from the HTML using XPath, and
a :class:`~scrapy.item.Item` will be filled with it.
an :class:`~scrapy.item.Item` will be filled with it.

XMLFeedSpider
-------------
Expand Down

0 comments on commit 1794a89

Please sign in to comment.