Skip to content

Commit

Permalink
Merge pull request #7 from vladcalin/rework-config
Browse files Browse the repository at this point in the history
Update docs
  • Loading branch information
vladcalin committed Jan 1, 2018
2 parents 1ce7485 + 4c65928 commit 90f4180
Show file tree
Hide file tree
Showing 13 changed files with 410 additions and 33 deletions.
18 changes: 8 additions & 10 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,7 @@ crawlster - small and light web crawlers
.. image:: https://travis-ci.org/vladcalin/crawlster.svg?branch=master
:target: https://travis-ci.org/vladcalin/crawlster

A simple crawler framework

.. note::

This is a work in progress

A simple, lightweight web crawling framework


Features:
Expand All @@ -26,8 +21,8 @@ Features:
What is crawlster?
------------------

Crawlster is a web crawling library designed to save precious development
time. It is very extensible and provides many shortcuts for the most common
Crawlster is a web crawling library designed to build lightweight and reusable
web crawlers. It is very extensible and provides many shortcuts for the most common
tasks in a web crawler, such as HTTP request sending and parsing and info
extraction.

Expand Down Expand Up @@ -59,15 +54,18 @@ Quick example

This is the hello world equivalent for this library:

import crawlster
from crawlster.handlers import JsonLinesHandler

::

import crawlster
from crawlster.handlers import JsonLinesHandler


class MyCrawler(crawlster.Crawlster):
# items will be saved to items.jsonl
item_handler = JsonLinesHandler('items.jsonl')

@crawlster.start
def step_start(self, url):
resp = self.http.get(url)
# we select elements with the expression and we are interested
Expand Down
9 changes: 5 additions & 4 deletions crawlster/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from .config import Configuration
from .core import Crawlster, Job
from .config import Configuration, JsonConfiguration
from .core import Crawlster, Job, start

__all__ = [
'Crawlster',
'Job',
'Configuration'
'Configuration',
'JsonConfiguration',
'start'
]
6 changes: 3 additions & 3 deletions crawlster/core.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
import datetime
import queue
import threading

import time

import sys
import traceback

Expand All @@ -18,6 +16,9 @@


def start(method):
"""Decorator for specifying the start step.
Must decorate a single method from the crawler class"""
method._crawlster_start_step = True
return method

Expand Down Expand Up @@ -85,7 +86,6 @@ class Crawlster(object):
log = LoggingHelper()
http = RequestsHelper()
queue = QueueHelper(strategy='lifo')
# Various utility helpers
urls = UrlsHelper()
regex = RegexHelper()
extract = ExtractHelper()
Expand Down
15 changes: 12 additions & 3 deletions docs/source/howto/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,8 @@ Configuration keys are populated from all helpers when the crawling starts.

The following configuration keys are the default:

- ``core.start_step`` - the name of the method where the crawling process
starts. Defaults to ``start_step``.
- ``core.start_urls`` - a list of urls that will be firstly processed in the
start step.
start step. Is required.
- ``core.workers`` - the number of worker threads to be used. Defaults to
the number of CPU core.

Expand Down Expand Up @@ -53,3 +51,14 @@ Json configuration:
"core.workers": 10,
"core.start_urls": ["http://example.com", "http://example2.com"]
}

We then pass the configuration object on the crawler class initialisation

::

configuration = Configuration(...)
crawler = MyCrawlerClass(configuration)
crawler.start()

This is very useful when we need to reuse the same crawler to crawl multiple
sites and only some configuration differs.
83 changes: 82 additions & 1 deletion docs/source/howto/extending.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,83 @@
Extending the crawler with helpers
==================================
==================================

The ``crawlster`` library makes very easy to extend the functionality
of the crawler through helpers. A helper is only a utility class that is
attached to the crawler instance.

Core helpers:

- :py:class:`crawlster.helpers.RequestsHelper` available as ``http``.
- :py:class:`crawlster.helpers.UrlsHelper` available as ``urls``.
- :py:class:`crawlster.helpers.ExtractHelper` available as ``extract``.
- :py:class:`crawlster.helpers.StatsHelper` available as ``stats``.
- :py:class:`crawlster.helpers.LoggingHelper` available as ``log``.
- :py:class:`crawlster.helpers.QueueHelper` available as ``queue``.
- :py:class:`crawlster.helpers.RegexHelper` available as ``regex``.


Create your own helper
----------------------

In order to create your own helper to enhance your crawler with super powers
you need to subclass the :py:class:`crawlster.helpers.BaseHelper` base class.

Then you can start implementing the functionality you need.


Methods
-------

There is no required method that has to be overwritten, but there are some
methods that can be overwritten to act as hooks. So far the only two
available hooks are

- :py:meth:`crawlster.helpers.BaseHelper.initialize` that performs actions
on crawler start.
- :py:meth:`crawlster.helpers.BaseHelper.finalize` that performs actions
on crawler stop (when there are no more items to process).


Configuration
-------------

Helpers can take advantage of the configuration system the library provides by
providing the ``config_options`` attribute, a mapping of option name and
option value.


Attributes
----------

The two attributes that are available inside the helper are
``config`` and ``crawler``.

The ``config`` attribute will hold the ``Configuration`` instance used to
initialize the crawler. You can get values from the configuration using
the ``self.config.get(option_name)`` method.

The ``crawler`` attribute holds the current crawler instance through which
the helper can access other helpers. Although it is recommended to make
the helper as independent as possible, sometimes you would need to use
the functionality already provided by some already existent helper (stats
aggregation, logging, etc).

Attaching the helper to the crawler
-----------------------------------

In the crawler definition, provide the helper instance as a class attribute


::

class MyCrawler(Crawlster):

my_helper = MyHelperClass()

# ...

def some_step(self, url):
# ...
self.my_helper.do_amazing_things()
# ...

1 change: 1 addition & 0 deletions docs/source/howto/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
How to
======

Here you will find some more in-depth guides on various topics.

.. toctree::
:maxdepth: 2
Expand Down
2 changes: 2 additions & 0 deletions docs/source/howto/parsing.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
Parsing requests and extracting data
====================================

We can parse the response data (or basically any string or bytes sequences) using
the core ``.extract`` helper (:py:class:`crawlster.helpers.ExtractHelper`)
4 changes: 2 additions & 2 deletions docs/source/howto/requests.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Making HTTP requests
====================

Http requests are made through the ``.http`` helper which is a :
:py:class:`crawlster.helpers.RequestsHelper` instance.
Http requests are made through the ``.http`` helper which is
a :py:class:`crawlster.helpers.RequestsHelper` instance.
10 changes: 10 additions & 0 deletions docs/source/howto/results.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
Submitting results
==================

Submitting results is done via the :py:meth:`crawlster.Crawlster.submit_item`
method. The single argument must be a :py:class:`dict` that represents the item.

After being submitted, the item will be passed through all the defined item
handlers.

.. seealso::

The module reference for :py:mod:`crawlster.handlers` for more details and
all the available item handler classes.
15 changes: 14 additions & 1 deletion docs/source/modules/crawlster.helpers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Helpers
Http helpers
------------

.. autoclass:: crawlster.helpers.http.requests.RequestsHelper
.. autoclass:: crawlster.helpers.RequestsHelper
:members:

Http requests
Expand All @@ -24,4 +24,17 @@ Http responses
.. autoclass:: crawlster.helpers.http.response.HttpResponse
:members:

Extract helpers
---------------

.. autoclass:: crawlster.helpers.ExtractHelper
:members:

Utility classes
^^^^^^^^^^^^^^^

.. autoclass:: crawlster.helpers.extract.Content
:members:



1 change: 1 addition & 0 deletions docs/source/modules/crawlster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ The crawlster module
.. autoclass:: crawlster.Crawlster
:members:

.. autofunction:: crawlster.start

0 comments on commit 90f4180

Please sign in to comment.