Merge pull request #7 from vladcalin/rework-config

Update docs
vladcalin · Jan 1, 2018 · 90f4180 · 90f4180
2 parents 1ce7485 + 4c65928
commit 90f4180
Show file tree

Hide file tree

Showing 13 changed files with 410 additions and 33 deletions.
diff --git a/README.rst b/README.rst
@@ -8,12 +8,7 @@ crawlster - small and light web crawlers
 .. image:: https://travis-ci.org/vladcalin/crawlster.svg?branch=master
    :target: https://travis-ci.org/vladcalin/crawlster
 
-A simple crawler framework
-
-.. note::
-
-    This is a work in progress
-
+A simple, lightweight web crawling framework
 
 
 Features:
@@ -26,8 +21,8 @@ Features:
 What is crawlster?
 ------------------
 
-Crawlster is a web crawling library designed to save precious development
-time. It is very extensible and provides many shortcuts for the most common
+Crawlster is a web crawling library designed to build lightweight and reusable
+web crawlers. It is very extensible and provides many shortcuts for the most common
 tasks in a web crawler, such as HTTP request sending and parsing and info
 extraction.
 
@@ -59,15 +54,18 @@ Quick example
 
 This is the hello world equivalent for this library:
 
-import crawlster
-from crawlster.handlers import JsonLinesHandler
 
 ::
 
+   import crawlster
+   from crawlster.handlers import JsonLinesHandler
+
+
    class MyCrawler(crawlster.Crawlster):
        # items will be saved to items.jsonl
        item_handler = JsonLinesHandler('items.jsonl')
 
+       @crawlster.start
        def step_start(self, url):
            resp = self.http.get(url)
            # we select elements with the expression and we are interested

diff --git a/crawlster/__init__.py b/crawlster/__init__.py
@@ -1,8 +1,9 @@
-from .config import Configuration
-from .core import Crawlster, Job
+from .config import Configuration, JsonConfiguration
+from .core import Crawlster, Job, start
 
 __all__ = [
     'Crawlster',
-    'Job',
-    'Configuration'
+    'Configuration',
+    'JsonConfiguration',
+    'start'
 ]
diff --git a/crawlster/core.py b/crawlster/core.py
@@ -1,9 +1,7 @@
 import datetime
 import queue
 import threading
-
 import time
-
 import sys
 import traceback
 
@@ -18,6 +16,9 @@
 
 
 def start(method):
+    """Decorator for specifying the start step. 
+    
+    Must decorate a single method from the crawler class"""
     method._crawlster_start_step = True
     return method
 
@@ -85,7 +86,6 @@ class Crawlster(object):
     log = LoggingHelper()
     http = RequestsHelper()
     queue = QueueHelper(strategy='lifo')
-    # Various utility helpers
     urls = UrlsHelper()
     regex = RegexHelper()
     extract = ExtractHelper()

diff --git a/docs/source/howto/configuration.rst b/docs/source/howto/configuration.rst
@@ -17,10 +17,8 @@ Configuration keys are populated from all helpers when the crawling starts.
 
 The following configuration keys are the default:
 
-- ``core.start_step`` - the name of the method where the crawling process
-  starts. Defaults to ``start_step``.
 - ``core.start_urls`` - a list of urls that will be firstly processed in the
-  start step.
+  start step. Is required.
 - ``core.workers`` - the number of worker threads to be used. Defaults to
   the number of CPU core.
 
@@ -53,3 +51,14 @@ Json configuration:
         "core.workers": 10,
         "core.start_urls": ["http://example.com", "http://example2.com"]
     }
+
+We then pass the configuration object on the crawler class initialisation
+
+::
+
+    configuration = Configuration(...)
+    crawler = MyCrawlerClass(configuration)
+    crawler.start()
+
+This is very useful when we need to reuse the same crawler to crawl multiple
+sites and only some configuration differs.
diff --git a/docs/source/howto/extending.rst b/docs/source/howto/extending.rst
@@ -1,2 +1,83 @@
 Extending the crawler with helpers
-==================================
+==================================
+
+The ``crawlster`` library makes very easy to extend the functionality
+of the crawler through helpers. A helper is only a utility class that is
+attached to the crawler instance.
+
+Core helpers:
+
+- :py:class:`crawlster.helpers.RequestsHelper` available as ``http``.
+- :py:class:`crawlster.helpers.UrlsHelper` available as ``urls``.
+- :py:class:`crawlster.helpers.ExtractHelper` available as ``extract``.
+- :py:class:`crawlster.helpers.StatsHelper` available as ``stats``.
+- :py:class:`crawlster.helpers.LoggingHelper` available as ``log``.
+- :py:class:`crawlster.helpers.QueueHelper` available as ``queue``.
+- :py:class:`crawlster.helpers.RegexHelper` available as ``regex``.
+
+
+Create your own helper
+----------------------
+
+In order to create your own helper to enhance your crawler with super powers
+you need to subclass the :py:class:`crawlster.helpers.BaseHelper` base class.
+
+Then you can start implementing the functionality you need.
+
+
+Methods
+-------
+
+There is no required method that has to be overwritten, but there are some
+methods that can be overwritten to act as hooks. So far the only two
+available hooks are
+
+- :py:meth:`crawlster.helpers.BaseHelper.initialize` that performs actions
+  on crawler start.
+- :py:meth:`crawlster.helpers.BaseHelper.finalize` that performs actions
+  on crawler stop (when there are no more items to process).
+
+
+Configuration
+-------------
+
+Helpers can take advantage of the configuration system the library provides by
+providing the ``config_options`` attribute, a mapping of option name and
+option value.
+
+
+Attributes
+----------
+
+The two attributes that are available inside the helper are
+``config`` and ``crawler``.
+
+The ``config`` attribute will hold the ``Configuration`` instance used to
+initialize the crawler. You can get values from the configuration using
+the ``self.config.get(option_name)`` method.
+
+The ``crawler`` attribute holds the current crawler instance through which
+the helper can access other helpers. Although it is recommended to make
+the helper as independent as possible, sometimes you would need to use
+the functionality already provided by some already existent helper (stats
+aggregation, logging, etc).
+
+Attaching the helper to the crawler
+-----------------------------------
+
+In the crawler definition, provide the helper instance as a class attribute
+
+
+::
+
+    class MyCrawler(Crawlster):
+
+        my_helper = MyHelperClass()
+
+    # ...
+
+    def some_step(self, url):
+        # ...
+        self.my_helper.do_amazing_things()
+        # ...
+
diff --git a/docs/source/howto/index.rst b/docs/source/howto/index.rst
@@ -1,6 +1,7 @@
 How to
 ======
 
+Here you will find some more in-depth guides on various topics.
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source/howto/parsing.rst b/docs/source/howto/parsing.rst
@@ -1,3 +1,5 @@
 Parsing requests and extracting data
 ====================================
 
+We can parse the response data (or basically any string or bytes sequences) using
+the core ``.extract`` helper (:py:class:`crawlster.helpers.ExtractHelper`)
diff --git a/docs/source/howto/requests.rst b/docs/source/howto/requests.rst
@@ -1,5 +1,5 @@
 Making HTTP requests
 ====================
 
-Http requests are made through the ``.http`` helper which is a :
-:py:class:`crawlster.helpers.RequestsHelper` instance.
+Http requests are made through the ``.http`` helper which is
+a :py:class:`crawlster.helpers.RequestsHelper` instance.
diff --git a/docs/source/howto/results.rst b/docs/source/howto/results.rst
@@ -1,3 +1,13 @@
 Submitting results
 ==================
 
+Submitting results is done via the :py:meth:`crawlster.Crawlster.submit_item`
+method. The single argument must be a :py:class:`dict` that represents the item.
+
+After being submitted, the item will be passed through all the defined item
+handlers.
+
+.. seealso::
+
+   The module reference for :py:mod:`crawlster.handlers` for more details and
+   all the available item handler classes.
diff --git a/docs/source/modules/crawlster.helpers.rst b/docs/source/modules/crawlster.helpers.rst
@@ -6,7 +6,7 @@ Helpers
 Http helpers
 ------------
 
-   .. autoclass:: crawlster.helpers.http.requests.RequestsHelper
+   .. autoclass:: crawlster.helpers.RequestsHelper
       :members:
 
 Http requests
@@ -24,4 +24,17 @@ Http responses
    .. autoclass:: crawlster.helpers.http.response.HttpResponse
       :members:
 
+Extract helpers
+---------------
+
+   .. autoclass:: crawlster.helpers.ExtractHelper
+      :members:
+
+Utility classes
+^^^^^^^^^^^^^^^
+
+   .. autoclass:: crawlster.helpers.extract.Content
+      :members:
+
+
 
diff --git a/docs/source/modules/crawlster.rst b/docs/source/modules/crawlster.rst
@@ -7,3 +7,4 @@ The crawlster module
   .. autoclass:: crawlster.Crawlster
     :members:
 
+  .. autofunction:: crawlster.start