Skip to content

Commit

Permalink
More documentation stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
Erik Bernhardsson committed Mar 15, 2015
1 parent 9b37a0e commit 331dfbf
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 13 deletions.
14 changes: 8 additions & 6 deletions doc/more_info.rst
Expand Up @@ -20,7 +20,7 @@ We learned a lot from our mistakes and some design decisions include:
- A web server that renders the dependency graph and does locking etc for free.
- Trivial to extend with new file systems, file formats and job types.
You can easily write jobs that inserts a Tokyo Cabinet into Cassandra.
Adding broad support S3, MySQL or Hive should be a stroll in the park.
Adding support for new systems is generally not very hard.
(Feel free to send us a patch when you're done!)
- Date algebra included.
- Lots of unit tests of the most basic stuff
Expand All @@ -32,12 +32,14 @@ It wouldn't be fair not to mention some limitations with the current design:
- The assumption is that a each task is a sizable chunk of work.
While you can probably schedule a few thousand jobs,
it's not meant to scale beyond tens of thousands.
- Luigi maintains a strict separation between scheduling tasks and running them.
Dynamic for-loops and branches are non-trivial to implement.
For instance, it's tricky to iterate a numerical computation task until it converges.
- Luigi does not support distribution of execution.
When you have workers running thousands of jobs daily, this starts to matter,
because the worker nodes get overloaded.
There are some ways to mitigate this (trigger from many nodes, use resources),
but none of them is ideal
- Luigi does not come with built-in triggering, and you still need to rely on something like
crontab to trigger workflows periodically.

It should actually be noted that all these limitations are not fundamental in any way.
However, it would take some major refactoring work.

Also it should be mentioned that Luigi is named after the world's second most famous plumber.

Expand Down
29 changes: 26 additions & 3 deletions doc/parameters.rst
Expand Up @@ -30,6 +30,11 @@ i.e.
will return the same date that the object was constructed with.
Same goes if you invoke Luigi on the command line.

.. _Parameter-instance-caching:

Instance caching
^^^^^^^^^^^^^^^^

Tasks are uniquely identified by their class name and values of their
parameters.
In fact, within the same worker, two tasks of the same class with
Expand All @@ -55,7 +60,10 @@ parameters of the same values are not just equal, but the same instance:
>>> c is d
True
However, if a parameter is created with ``significant=False``,
Insignificant parameters
^^^^^^^^^^^^^^^^^^^^^^^^

If a parameter is created with ``significant=False``,
it is ignored as far as the Task signature is concerned.
Tasks created with only insignificant parameters differing have the same signature but
are not the same instance:
Expand All @@ -80,11 +88,23 @@ are not the same instance:
>>> hash(c) == hash(d)
True
Parameter types
^^^^^^^^^^^^^^^

In the examples above, the *type* of the parameter is determined by using different
subclasses of :class:`~luigi.parameter.Parameter`. There are a few of them, like
:class:`~luigi.parameter.DateParameter`,
:class:`~luigi.parameter.DateIntervalParameter`,
:class:`~luigi.parameter.IntParameter`,
:class:`~luigi.parameter.FloatParameter`, etc.

Python is not a strongly typed language and you don't have to specify the types
of any of your parameters.
You can simply use the base class :class:`~luigi.parameter.Parameter` if you don't care.
In fact, the reason :class:`~luigi.parameter.DateParameter` et al exist is just in order to
support command line interaction and make sure to convert the input to

The reason you would use a subclass like :class:`~luigi.parameter.DateParameter`
is that Luigi needs to know its type for the command line interaction.
That's how it knows how to convert a string provided on the command line to
the corresponding type (i.e. datetime.date instead of a string).

Setting parameter value for other classes
Expand Down Expand Up @@ -120,6 +140,9 @@ For instance, you can put this in the config:
Just as in the previous case, this will set the value of ``TaskA.x`` to 45 on the *class* level.
And likewise, it is still possible to override it inside Python if you instantiate ``TaskA(x=44)``.

Parameter resolution order
^^^^^^^^^^^^^^^^^^^^^^^^^^

Parameters are resolved in the following order of decreasing priority:

1. Any value passed to the constructor, or task level value set on the command line (applies on an instance level)
Expand Down
2 changes: 2 additions & 0 deletions doc/tasks.rst
Expand Up @@ -102,6 +102,7 @@ An example:
g.write('%s\n', ''.join(reversed(line.strip().split()))
g.close() # needed because files are atomic
.. _Task.input:
Task.input
Expand Down Expand Up @@ -258,3 +259,4 @@ In addition to the stuff mentioned above,
Luigi also does some metaclass logic so that
if e.g. ``DailyReport(datetime.date(2012, 5, 10))`` is instantiated twice in the code,
it will in fact result in the same object.
See :ref:`Parameter-instance-caching` for more info
19 changes: 15 additions & 4 deletions doc/workflows.rst
Expand Up @@ -16,13 +16,21 @@ Actually, the only method that Targets have to implement is the *exists*
method which returns True if and only if the Target exists.

In practice, implementing Target subclasses is rarely needed.
You can probably get pretty far with the :class:`~luigi.file.LocalTarget` and :class:`~luigi.hdfs.HdfsTarget`
classes that are available out of the box.
These directly map to a file on the local drive or a file in HDFS, respectively.
Luigi comes a toolbox of several useful Targets.
In particular, :class:`~luigi.file.LocalTarget` and :class:`~luigi.hdfs.HdfsTarget`,
but there is also support for
:class:`S3 luigi.s3.S3Target`,
:class:`SSH luigi.contrib.ssh.RemoteTarget`,
:class:`FTP luigi.ftp.RemoteTarget`,
:class:`MySQL luigi.contrib.mysqldb.MySqlTarget`,
:class:`Redshift luigi.redshift.RedshiftTarget`, and several more.

Most of these targets, are file system-like.
For instance, :class:`~luigi.file.LocalTarget` and :class:`~luigi.hdfs.HdfsTarget` map to a file on the local drive or a file in HDFS.
In addition these also wrap the underlying operations to make them atomic.
They both implement the :func:`~luigi.file.LocalTarget.open` method which returns a stream object that
could be read (``mode='r'``) from or written to (``mode='w'``).
Both :class:`~luigi.file.LocalTarget` and :class:`~luigi.hdfs.HdfsTarget` also optionally take a format parameter.

Luigi comes with Gzip support by providing ``format=format.Gzip``.
Adding support for other formats is pretty simple.

Expand Down Expand Up @@ -59,6 +67,7 @@ The Task class corresponds to some type of job that is run, but in
general you want to allow some form of parametrization of it.
For instance, if your Task class runs a Hadoop job to create a report every night,
you probably want to make the date a parameter of the class.
See `/parameters` for more info.

.. figure:: task_parameters.png
:alt: Tasks with parameters
Expand All @@ -78,3 +87,5 @@ For instance, some examples of the dependencies you might encounter:

.. figure:: parameters_enum.png
:alt: Dependencies with enums

(These diagrams are from a `Luigi presentation in late 2014 at NYC Data Science meetup <www.slideshare.net/erikbern/luigi-presentation-nyc-data-science>`)

0 comments on commit 331dfbf

Please sign in to comment.