Skip to content

Commit

Permalink
Big documentation additions
Browse files Browse the repository at this point in the history
Including fancy-ass logo
  • Loading branch information
stephen-bunn committed Oct 20, 2017
1 parent 1353dcd commit d18ee6c
Show file tree
Hide file tree
Showing 9 changed files with 257 additions and 25 deletions.
2 changes: 1 addition & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ pyexcel-xlsx = "*"
ptpython = "*"
cprofilev = "*"
sphinx = "*"
sphinx-readable-theme = "*"
flask-sphinx-themes = "*"
15 changes: 8 additions & 7 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file added docs/source/_static/logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
119 changes: 119 additions & 0 deletions docs/source/available-rules.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,122 @@
===============
Available Rules
===============

Below are a list of available rules that can be attached to a :class:`~sandpaper.sandpaper.SandPaper` instance.
All of these rules first must pass several optional filters discussed in :ref:`getting_started-rule-filters`.

**In the following examples of these rules the symbol □ represents whitespace.**

lstrip
------
A basic rule that strips all *left* whitespace from a value.

.. code-block:: python
SandPaper().lstrip()
====== ======
Input Output
====== ======
□□data data
====== ======


rstrip
------
A basic rule that strips all *right* whitespace from a value.

.. code-block:: python
SandPaper().rstrip()
====== ======
Input Output
====== ======
data□□ data
====== ======


strip
-----
A basic rule that strips *all* whitespace from a value.

.. code-block:: python
SandPaper().strip()
====== ======
Input Output
====== ======
□data□ data
====== ======


substitute
----------
A substitution rule that replaces regex matches with specified values.

.. code-block:: python
SandPaper().substitute(
substitutes={
r'FL': 'Florida',
r'NC': 'North Carolina'
}
)
====== ==============
Input Output
====== ==============
FL Florida
NC North Carolina
====== ==============


translate_text
--------------
A translation rule that translate regex matches to a specified format.

.. code-block:: python
SandPaper().translate_text(
from_regex=r'group_(?P<group_id>\d+)$',
to_format='{group_id}'
)
========= ==============
Input Output
========= ==============
group_47 47
group_123 123
group_0 0
========= ==============


translate_date
--------------
A translation rule that translate greedily evaluated dates to a specified datetime format.

.. note:: This rule is very greedy and can potentailly evaluate dates incorrectly.
It is **highly recommended** that at the very least a ``column_filter`` is supplied with this rule.

.. code-block:: python
SandPaper().translate_date(
from_formats=['%Y-%m-%d', '%Y-%m', '%Y'],
to_format='%Y'
)
========== ==============
Input Output
========== ==============
2017-01-32 2017
2017-01 2017
2017 2017
========== ==============
14 changes: 9 additions & 5 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
import os
import sys

import sphinx_readable_theme
import flask_sphinx_themes


sys.path.insert(0, os.path.abspath('../..'))
Expand Down Expand Up @@ -80,7 +80,7 @@
exclude_patterns = []

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
pygments_style = 'flask_sphinx_themes.pygments.FlaskyStyle'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
Expand All @@ -91,14 +91,18 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme_path = [sphinx_readable_theme.get_html_theme_path()]
html_theme = 'readable'
html_theme_path = [flask_sphinx_themes.get_path()]
html_theme = 'flask'

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
html_theme_options = {
'github_fork': 'stephen-bunn/sandpaper',
'index_logo': 'logo.png',
'index_logo_height': '200px'
}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand Down
118 changes: 115 additions & 3 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Now that you have the package installed, you will have to run the ``setup.py`` s
Usage
-----
Using SandPaper is *fairly* simple and straightforward.
First things first, in order to normalize any data you have to create an instance of the SandPaper object to group together your normalization rules.
First things first, in order to normalize any data you have to create an instance of the :class:`~sandpaper.sandpaper.SandPaper` object to group together your normalization rules.

.. code-block:: python
Expand All @@ -49,6 +49,118 @@ First things first, in order to normalize any data you have to create an instanc
my_sandpaper = SandPaper()
Now that you have a SandPaper instance, you can start chaining in rules that should be applied in order to normalize the data.
.. _getting_started-chaining-rules:

.. tip:: For a full list of available rules, check out the list of rules `here <available-rules.html>`_.
Chaining Rules
''''''''''''''

Now that you have a :class:`~sandpaper.sandpaper.SandPaper` instance, you can start chaining in rules that should be applied in order to normalize the data.

.. tip:: For a full list of available rules, check out the list of rules `here <available-rules.html>`__.

Rule can be applied by simply chaining the ordered normalization processes directly off of a :class:`~sandpaper.sandpaper.SandPaper` isntance.

.. code-block:: python
my_sandpaper.strip()
This will apply the :func:`~sandpaper.sandpaper.SandPaper.strip` rule to the ``my_sandpaper`` instance.
The way it is now, the ``my_sandpaper`` instance will strip all whitespace from all values (since no filters were given).

We can add another rule to ``my_sandpaper`` by simply calling it.

.. code-block:: python
my_sandpaper.substitute(
substitutes={
r'FL': 'Florida',
r'NC': 'North Carolina'
},
column_filter=r'state'
)
This will apply the :func:`~sandpaper.sandpaper.SandPaper.substitute` rule to the ``my_sandpaper`` instance.

Since the :func:`~sandpaper.sandpaper.SandPaper.strip` rule has already been applied, stripping of all whitespace will occur before this rule is applied.
The :func:`~sandpaper.sandpaper.SandPaper.substitute` rule will substitute the regular expression matches ``FL`` and ``NC`` with the values ``Florida`` and ``North Carolina`` respectively only in the column matching the filter ``state``.


The current state of the ``my_sandpaper`` instance could have also been initialized in one go using the chaining feature that rules provide.

.. code-block:: python
my_sandpaper = SandPaper('my-sandpaper')
.strip()
.substitute(
substitutes={
r'FL': 'Florida',
r'NC': 'North Carolina'
},
column_filter=r'state'
)
---

In order to run this :class:`~sandpaper.sandpaper.SandPaper` instance you need to call the :func:`~sandpaper.sandpaper.SandPaper.apply` method to a glob of files.

.. code-block:: python
my_sandpaper.apply('~/data_*{01..99}.csv')
.. note:: We use fancy brace expansion in our glob evaluation!
You can take very interesting glob shortcuts with brace expansion; which you can learn about `here <https://pypi.python.org/pypi/braceexpand>`__.

In this instance the whitespace stripping will be applied to all ``.csv`` files starting with ``data_`` and ending with a number between ``01`` and ``99``.
*However*, because :func:`~sandpaper.sandpaper.SandPaper.apply` is actually a generator, in order to run the normalization you need to iterate over the method call.

.. code-block:: python
for output_filepath in my_sandpaper.apply('~/data_*{01..99}.csv'):
print(output_filepath)
.. _getting_started-rule-filters:

Rule Filters
''''''''''''

An important thing to note about rules is that every value has to first pass several optional filters if the rule is to be applied to that value.

``column_filter`` : regex
A regular expression filter applied to the column name of the value (*must have a match to pass*)

``value_filter`` : regex
A regular expression filter applied to the value (*must have a match to pass*)

``callable_filter`` : callable
A callable reference that is executed for each value (*must evaluate to true to pass*)

.. note:: This callable should expect to receive the parameters ``record``, ``column`` in that order, as well as any specified rule kwargs.
The callable should return a boolean value which is True if the rule should be applied, otherwise False.

These filters are processed in the order presented and are completely optional.
**If no filters are specified, then the rule is applied.**


.. _getting_started-saving-sandpapers:

Saving SandPapers
'''''''''''''''''

It is possible to export a :class:`~sandpaper.sandpaper.SandPaper` instance using the :func:`~sandpaper.sandpaper.SandPaper.export` function.
This exports the configuration of the intance to a `json <http://www.json.org>`__ format either to a provided filepath or to stdout.

.. code-block:: python
# for exporting to a file
my_sandpaper.export('/home/USER/my-sandpaper.json')
# for writing the export to stdout
my_sandpaper.export()
This exported format can be used to bootstrap a new :class:`~sandpaper.sandpaper.SandPaper` instance by providing the filepath where the exported data is stored to the :func:`~sandpaper.sandpaper.SandPaper.load` method.

.. code-block:: python
new_sandpaper = SandPaper.load('/home/USER/my-sandpaper.json')
10 changes: 3 additions & 7 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,20 @@
SandPaper
=========

*A simplified table-type data normalization module.*


.. toctree::
:maxdepth: 2
:caption: Contents:

getting-started
available-rules


.. toctree::
:maxdepth: 2
:caption: Reference:
:maxdepth: 3
:caption: Module Reference:

modules



Indices
=======

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ cprofilev==1.0.7 --hash=sha256:8791748b1f3d3468c2c927c3fd5f905080b84d8f2d217ca76
docopt==0.6.2 --hash=sha256:49b3a825280bd66b3aa83585ef59c4a8c82f2c8a522dbe754a8bc8d08c85c491
docutils==0.14 --hash=sha256:7a4bd47eaf6596e1295ecb11361139febe29b084a87bf005bf899f9a42edc3c6 --hash=sha256:02aec4bd92ab067f6ff27a38a38a41173bf01bed8f89157768c1573f53e474a6 --hash=sha256:51e64ef2ebfb29cae1faa133b3710143496eca21c530f3f71424d77687764274
flake8==3.4.1 --hash=sha256:f1a9d8886a9cbefb52485f4f4c770832c7fb569c084a9a314fb1eaa37c0c2c86 --hash=sha256:c20044779ff848f67f89c56a0e4624c04298cd476e25253ac0c36f910a1a11d8
flask-sphinx-themes==1.0.1 --hash=sha256:3a5d94f4a28044641153c9fd4823c2d93f2beecc69127f7be736e9d2e29e1a05 --hash=sha256:a83eebca95fc5b8adbae4e65926961912edd52a1b6a422c0301a750d1ae31747
idna==2.6 --hash=sha256:8c7309c718f94b3a625cb648ace320157ad16ff131ae0af362c9f21b80ef6ec4 --hash=sha256:2c6a5de3089009e3da7c5dde64a141dbc8551d5b7f6cf4ed7c2568d0cc520a8f
imagesize==0.7.1 --hash=sha256:6ebdc9e0ad188f9d1b2cdd9bc59cbe42bf931875e829e7a595e6b3abdc05cdfb --hash=sha256:0ab2c62b87987e3252f89d30b7cedbec12a01af9274af9ffa48108f2c13c6062
jedi==0.11.0 --hash=sha256:3af518490ffcd00a3074c135b42511e081575e9abd115c216a34491411ceebb0 --hash=sha256:f6d5973573e76b1fd2ea75f6dcd6445d02d41ff3af5fc61b275b4e323d1dd396
Expand All @@ -24,7 +25,6 @@ requests==2.18.4 --hash=sha256:6a1b267aa90cac58ac3a765d067950e7dbbf75b1da07e895d
six==1.11.0 --hash=sha256:832dc0e10feb1aa2c68dcc57dbb658f1c7e65b9b61af69048abc87a2db00a0eb --hash=sha256:70e8a77beed4562e7f14fe23a786b54f6296e34344c23bc42f07b15018ff98e9
snowballstemmer==1.2.1 --hash=sha256:9f3bcd3c401c3e862ec0ebe6d2c069ebc012ce142cce209c098ccb5b09136e89 --hash=sha256:919f26a68b2c17a7634da993d91339e288964f93c274f1343e3bbbe2096e1128
sphinx==1.6.4 --hash=sha256:3e70eb94f7e81b47e0545ebc26b758193b6c8b222e152ded99b9c972e971c731 --hash=sha256:f101efd87fbffed8d8aca6ef307fec57693334f39d32efcbc2fc96ed129f4a3e
sphinx-readable-theme==1.3.0 --hash=sha256:f5fe65a2e112cb956b366df41e0fc894ff6b6f0e4a4814fcbff692566db47fc0
sphinxcontrib-websupport==1.0.1 --hash=sha256:f4932e95869599b89bf4f80fc3989132d83c9faa5bf633e7b5e0c25dffb75da2 --hash=sha256:7a85961326aa3a400cd4ad3c816d70ed6f7c740acd7ce5d78cd0a67825072eb9
urllib3==1.22 --hash=sha256:06330f386d6e4b195fbfc736b297f58c5a892e4440e54d294d7004e3a9bbea1b --hash=sha256:cc44da8e1145637334317feebd728bd869a35285b93cbb4cca2577da7e62db4f
wcwidth==0.1.7 --hash=sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c --hash=sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e
Expand Down
2 changes: 1 addition & 1 deletion sandpaper/sandpaper.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ def _filter_allowed(
if not value_filter.match(str(value)):
continue
if callable(callable_filter):
if not callable_filter(**kwargs):
if not callable_filter(record, column, **kwargs):
continue

yield (column, value,)
Expand Down

0 comments on commit d18ee6c

Please sign in to comment.