Skip to content

Commit

Permalink
Adding limitations to the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
stephen-bunn committed Oct 27, 2017
1 parent 58d91cb commit 4129676
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The format is based on `Keep a Changelog <http://keepachangelog.com/en/1.0.0/>`_

*unreleased*
------------
* added enforcement for strict date parsing in ``translate_date`` rule


`0.0.4`_ (*2017-10-26*)
Expand Down
50 changes: 50 additions & 0 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,9 @@ This exported format can be used to bootstrap a new :class:`~sandpaper.sandpaper
new_sandpaper = SandPaper.load(serialized)
.. _getting_started-be-explicit:

Be Explicit
-----------

Expand Down Expand Up @@ -197,3 +200,50 @@ However, when the user tries to apply the :class:`~sandpaper.sandpaper.SandPaper
This is due to how the ``substitutes`` are stored as ``kwargs`` rather than ``args`` to the ``substitute`` function.

**TLDR:** *Be explicit with the parameters of all rules!*


.. _getting_started-limitations:

Limitations
-----------

Several limitations to the effectiveness of the reading and writing of normalized data still exist within this module.
These are described in the subsections below...


.. _getting_started-reading-as-records:

Reading as Records
''''''''''''''''''

In order to provide all of the lovely filtering (:ref:`getting_started-rule-filters`) that make specifying advanced normalization rules much easier, SandPaper reads rows of table type data in as records (:class:`collections.OrderedDict`).
This allows us to tie row entries to column names easily but unfortunately causes limitations on the format of data that can be properly read in.
The main limitation is that **table sheets with duplicate column names cannot be read properly**.

Because `pyexcel <https://pyexcel.readthedocs.io/en/latest/>`_ reads records as :class:`~collections.OrderedDict`, the last column with a duplicate name is the only column considered.

For example the following table data...

========= =========
my_column my_column
========= =========
1 2
3 4
========= =========

will only output the last ``my_column`` column (with values 2 and 4) in the resulting ``sanded`` data.
This is because the reading of the record first reads the first column and then overwrites it with the second column.

A fix for this issue is possible, however would cause a lot of refactoring and additional testing which (obviously) has not been done.


.. _getting_started-translating-dates:

Translating Dates
'''''''''''''''''

The :func:`~sandpaper.sandpaper.SandPaper.translate_date` rule is quite nifty, but also has a couple limitations that need to be considered.
We utilize the clever `dateparser <https://dateparser.readthedocs.io/en/latest/>`_ library to handle date parsing which can be greedy at times.
In order to counteract this greediness we specify the ``STRICT_PARSING`` setting in order to limit the format matching to only those provided to the :func:`~sandpaper.sandpaper.SandPaper.translate_date` rule.

However, because this parsing takes a considerable amount of time (when executed for many many items) it is recommended to also specify at least a ``column_filter`` for all instances of the rule.
3 changes: 2 additions & 1 deletion sandpaper/sandpaper.py
Original file line number Diff line number Diff line change
Expand Up @@ -693,7 +693,8 @@ def translate_date(
value = record[column]
parsed_date = dateparser.parse(
str(value),
date_formats=from_formats
date_formats=from_formats,
settings={'STRICT_PARSING': True}
)
if parsed_date is not None:
return parsed_date.strftime(to_format)
Expand Down

0 comments on commit 4129676

Please sign in to comment.