Skip to content

Commit

Permalink
Document DataSources and Datasinks, fix #88
Browse files Browse the repository at this point in the history
  • Loading branch information
schuderer committed Mar 27, 2020
1 parent 4b5118d commit d56dcc7
Show file tree
Hide file tree
Showing 10 changed files with 534 additions and 177 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.rst
Expand Up @@ -25,6 +25,9 @@ and this project adheres to `Semantic Versioning <https://semver.org/spec/v2.0.0
Unreleased
------------------------------------------------------------------------------

* |Enhancement| Document DataSources and DataSinks,
`issue #88 <https://github.com/schuderer/mllaunchpad/issues/88>`_,
by `Andreas Schuderer <https://github.com/schuderer>`_.
* |Enhancement| Document configuration,
`issue #67 <https://github.com/schuderer/mllaunchpad/issues/67>`_,
by `Andreas Schuderer <https://github.com/schuderer>`_.
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Expand Up @@ -45,7 +45,7 @@ management functionality.

It creates a separation between machine learning
models and their environment. This way, you can run your model with
different data sources and on different environments, by just swapping
different :doc:`data sources <datasources>` and on different environments, by just swapping
out the configuration, no code changes required. ML Launchpad makes your
model available as a business-facing *RESTful API*
without extra coding.
Expand Down Expand Up @@ -94,7 +94,7 @@ consists of at least three files:
* ``<examplename>.raml``: example’s RESTful API specification.
Used, among others, to parse and validate parameters.

* There are also some extra files, like CSV files to use, or datasource
* There are also some extra files, like CSV files to use, or :doc:`datasource <datasources>`
extensions.

The subfolder ``testserver`` contains an example for running a REST API
Expand Down
2 changes: 1 addition & 1 deletion docs/about.rst
Expand Up @@ -185,7 +185,7 @@ need, as a minimum:

* Training data and test data for your model (in a format and location
that is
accessible for the built-in DataSources). Side note: Validation data
accessible for the built-in :doc:`DataSources <datasources>`). Side note: Validation data
here counts as a part of training data because validation happens during
the model creation phase.
* A python module (``.py`` file) containing the implementation
Expand Down
27 changes: 24 additions & 3 deletions docs/config.rst
Expand Up @@ -7,7 +7,7 @@ Configuration
The configuration file is the glue that holds your ML-Launchpad-based application
together. It links the things on the "inside", that is, your model's
implementation, to the things on the "outside", such as the data connection
(``DataSource``, ``DataSink``), as well as the API configuration.
(:doc:`datasources`), as well as the API configuration.

**Sidenote**: You can use this to your advantage when developing and testing your machine learning
algorithm by using different configuration files for different purposes of your
Expand Down Expand Up @@ -37,6 +37,8 @@ in code when using ``mllaunchpad`` functionality
**Note**: Besides ``LAUNCHPAD_CFG``, there is also the ``LAUNCHPAD_LOG`` environment
variable, which, if provided, will be used as the `logging configuration file <https://docs.python.org/3.8/library/logging.config.html>`_.

.. _config_file:

Config File
------------------------------------------------------------------------------

Expand All @@ -47,6 +49,10 @@ Here's an example configuration with comments:

.. code-block:: yaml
plugins: # Optionally specify any additional imports (only external DataSources/-Sinks for now, cf. ``DataSources``)
- bogusdatasource
- records_datasource
datasources: # This section is optional. Places to get data from, and how.
petals: # Name by which you want to refer to the datasource, e.g. using ``data_sources["petals"]``/
# The properties ``type``, ``expires``, ``options`` and ``tags`` are present
Expand Down Expand Up @@ -93,8 +99,23 @@ Here's an example configuration with comments:
Details on how to configure specific types of ``DataSources`` and ``DataSinks`` can be found
in their respective documentation (which still needs to be written, contributions welcome! For now please see the
`examples <https://github.com/schuderer/mllaunchpad/tree/master/examples>`_ [:download:`download <_static/examples.zip>`]).
on the page :doc:`datasources`.

.. _plugins:

Plugins
------------------------------------------------------------------------------

In your :ref:`config_file`, you can optionally use a top-level ``plugins:`` key to
specify (a list of) modules that should be imported by ML Launchpad (currently only used
while initializing the :doc:`datasources`). If any of these plugins are in conflict
with other plugins or built-ins, the last-imported one has precedence over
the previous ones.

For example, if several :doc:`DataSource <datasources>` plugins offer to serve the
same type (e.g. ``csv``), the last one in the ``plugins:`` list will be chosen as the
designated ``csv`` handler, overruling both the built-in :class:`~mllaunchpad.resource.FileDataSource`
as well as any other ``csv``-serving DataSources listed before the one in question.

RAML API Definition
------------------------------------------------------------------------------
Expand Down
144 changes: 144 additions & 0 deletions docs/datasources.rst
@@ -0,0 +1,144 @@
.. highlight:: yaml

==============================================================================
DataSources and DataSinks
==============================================================================

:class:`DataSources <mllaunchpad.resource.DataSource>` and
:class:`DataSinks <mllaunchpad.resource.DataSink>` are what loosely couples
your model's code to the data.

From your model's code, instead of accessing
your data locations directly, you access your data via the
:class:`DataSources <mllaunchpad.resource.DataSource>` and
:class:`DataSinks <mllaunchpad.resource.DataSink>` that are provided by ML Launchpad.
To your code, all :class:`DataSources <mllaunchpad.resource.DataSource>` and
:class:`DataSinks <mllaunchpad.resource.DataSink>` behave the same and are used
the same way for the same data format (``DataFrames``, raw files, etc.).

For example, to obtain a pandas ``DataFrame``,
use the :class:`~mllaunchpad.resource.DataSource`'s
:meth:`~mllaunchpad.resource.DataSource.get_dataframe` method.
Your code does not need to know
whether your data was originally obtained from a database, file, or web service.

As :class:`DataSources <mllaunchpad.resource.DataSource>` and
:class:`DataSinks <mllaunchpad.resource.DataSink>` are very similar, we will
use the term :class:`~mllaunchpad.resource.DataSource` below meaning either.

Different subclasses of :class:`DataSources <mllaunchpad.resource.DataSource>`
provide you with different kinds of connections. In the following sections, you can find
lists of built-in and external :class:`DataSources <mllaunchpad.resource.DataSource>`
and :class:`DataSinks <mllaunchpad.resource.DataSink>`.

Each subclass of :class:`DataSources <mllaunchpad.resource.DataSource>` (e.g. :class:`~mllaunchpad.resource.FileDataSource`,
:class:`~mllaunchpad.resource.OracleDataSource`)
serves one or several ``types`` (e.g. ``csv``, ``euro_csv``, ``dbms.oracle``).
You specify the ``type`` in your DataSource's :doc:`configuration <config>`.
The same ``type`` can even be served by several different
:class:`~mllaunchpad.resource.DataSource` subclasses, in which case the
:doc:`last-imported plugin <config>`.

Here is an example for a configured
:class:`~mllaunchpad.resource.FileDataSource`::

datasources:
my_datasource:
type: csv
path: ./iris_train.csv
expires: 0
options: {}
tags: train

Where the parts of this examples are:

* ``datasources`` (or ``datasinks``; optional): Can contain
as many child elements (configured DataSources or DataSinks) as you like.
* ``my_datasource``: The name by which you want to refer to a specific configured DataSource.
Used to get data, e.g.: ``data_sources["my_datasource"].get_dataframe()``. This name is up to you to choose.
* ``type`` (required in every DataSource): the ``type`` that a DataSource needs to
serve in order to be chosen for you. In this case, when ML Launchpad looks up
which DataSources serve the ``csv`` type, it finds
:class:`~mllaunchpad.resource.FileDataSource` and will use it.
* ``path`` (specific to :class:`~mllaunchpad.resource.FileDataSource`):
The path of the file. Every DataSource has its own specific properties
which are part of the DataSource's documentation (see the next section for built-ins).
* ``expires`` (required in every DataSource): This controls caching of the data.
0 means that it expires immediately, -1 that it never expires, and another
number specifies the number of seconds after the cached data expires and
should be gotten afresh from the source.
* ``tags`` (required in every DataSource): a combination of one or several of
the possible tags ``train``, ``test`` and ``predict`` (use [brackets] around
more than one tag). This determines the model function(s) the DataSource will be
made available for.

For more complete examples, have a look at the :ref:`tutorial` or at the
`examples <https://github.com/schuderer/mllaunchpad/blob/master/examples/>`_ (:download:`download <_static/examples.zip>`).

Please note that :class:`DataSources <mllaunchpad.resource.DataSource>`
and :class:`DataSinks <mllaunchpad.resource.DataSink>` will be initialized
for you by ML Launchpad depending on your configuration.
Your code will just get "some" DataSource, but won't have to import, initialize, or
even know the name of the DataSource class that is used under the hood.

When needing to use e.g. several tables that reside in the same data base, it
is useful to not have to configure their connection details for every one
of the DataSources that correspond with those tables, but configure a
connection only once. For this, you specify a separate ``dbms:`` section in your
configuration where you give each connection a name (e.g. ``my_connection``) which
you can refer to in your ``datasource`` config by a type like e.g. ``dmbs.my_connection``.
See :class:`~mllaunchpad.resource.OracleDataSource` below for an example.

Built-in DataSources and DataSinks
------------------------------------------------------------------------------
When you ``pip install mllaunchpad``, it comes with a number of built-in
DataSources and DataSinks that are ready to use without needing to specify
any ``plugins: []`` in the :doc:`config`.

Their documentation follows hereunder.

.. autoclass:: mllaunchpad.resource.FileDataSource
:noindex:
:members:
:inherited-members:
:undoc-members:

.. autoclass:: mllaunchpad.resource.FileDataSink
:noindex:
:members:
:inherited-members:
:undoc-members:

.. autoclass:: mllaunchpad.resource.OracleDataSource
:noindex:
:members:
:inherited-members:
:undoc-members:

.. autoclass:: mllaunchpad.resource.OracleDataSink
:noindex:
:members:
:inherited-members:
:undoc-members:


External DataSources and DataSinks
------------------------------------------------------------------------------
The datasources here are not part of core ML Launchpad.

To be able to use them:

1. install them if they support it, or copy them into your project's source code directory;
2. add their import statement to the ``plugins:`` section of your configuration, e.g.

.. code-block:: yaml
plugins: # Add this line if it's not already in your config
- some_module.my_datasource # for a file in the 'some_module' directory called 'my_datasource.py'
Their documentation follows hereunder.

.. autoclass:: examples.records_datasource.RecordsDbDataSource
:members:
:undoc-members:

1 change: 1 addition & 0 deletions docs/index.rst
Expand Up @@ -15,6 +15,7 @@ Contents:
usage
mllaunchpad
config
DataSources <datasources>
contributing
Code of Conduct <conduct>
authors
Expand Down
8 changes: 4 additions & 4 deletions docs/usage.rst
Expand Up @@ -221,8 +221,8 @@ Here, we'll make use of the method arguments ``data_sources`` and ``model``.
See :mod:`~mllaunchpad.model_interface` for details on all available
arguments.

If we call our training :class:`~mllaunchpad.resource.DataSource` ``petals`` and our test
DataSource ``petals_test``, our completed ``tree_model.py`` looks
If we call our training :doc:`DataSource <datasources>` ``petals`` and our test
:doc:`DataSource <datasources>` ``petals_test``, our completed ``tree_model.py`` looks
like this (we highlight changed code with ``#comments``):

.. code-block:: python
Expand Down Expand Up @@ -298,7 +298,7 @@ The three methods return the same things as our own functions:

Next, we will configure some extra info about our model,
as well as tell ML Launchpad where to find
the ``petal`` and ``petal_test`` :class:`~mllaunchpad.resource.DataSource` s.
the ``petal`` and ``petal_test`` :doc:`DataSources <datasources>`.

Create a file called ``tree_cfg.yml``::

Expand Down Expand Up @@ -348,7 +348,7 @@ Here, we define our ``datasources`` so ML Launchpad knows where to find the
data we refer to from our model. Besides ``csv`` files,
other types of DataSources are supported, and
:ref:`extending DataSources <extending>` is also possible.
(see module :class:`~mllaunchpad.resource` for more information on supported
(see :doc:`datasources` for more information on supported
builtin :class:`~mllaunchpad.resource.DataSources`).

The ``model_store`` is just a directory where all trained models will
Expand Down

0 comments on commit d56dcc7

Please sign in to comment.