Skip to content

Commit

Permalink
Cookbook: Levenshtein edit distance. Closes #172.
Browse files Browse the repository at this point in the history
  • Loading branch information
onyxfish committed Aug 31, 2015
1 parent 0218d41 commit 77897f8
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 12 deletions.
2 changes: 1 addition & 1 deletion agate/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
class NullCalculationError(Exception): # pragma: no cover
"""
Exception raised if a calculation which can not logically
account for null values is attempted on a :class:`Column containing
account for null values is attempted on a :class:`Column` containing
nulls.
"""
pass
Expand Down
81 changes: 70 additions & 11 deletions docs/cookbook/calculations.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
=======================
Performing calculations
=======================
====================
Computing new values
====================

Computing annual change
========================
Annual change
=============

You could use a :class:`.Formula` to calculate percent change, however, for your convenience agate has a built-in shortcut. For example, if your spreadsheet has a column with values for each year you could do:

Expand All @@ -27,8 +27,8 @@ Or, better yet, compute the whole decade using a loop:
new_table = table.compute(computations)
Computing annual percent change
===============================
Annual percent change
=====================
Want percent change instead of value change? Just swap out the :class:`.Aggregation`:
Expand All @@ -42,8 +42,8 @@ Want percent change instead of value change? Just swap out the :class:`.Aggregat
new_table = table.compute(computations)
Computing indexed change
========================
Indexed/cumulative change
=========================
Need your change indexed to a starting year? Just fix the first argument:
Expand All @@ -59,8 +59,8 @@ Need your change indexed to a starting year? Just fix the first argument:
Of course you can also use :class:`.PercentChange` if you need percents rather than values.
Rounding to two decimal places
==============================
Round to two decimal places
===========================
agate stores numerical values using Python's :class:`decimal.Decimal` type. This data type ensures numerical precision beyond what is supported by the native :func:`float` type, however, because of this we can not use Python's builtin :func:`round` function. Instead we must use :meth:`decimal.Decimal.quantize`.
Expand All @@ -81,3 +81,62 @@ We can use :meth:`.Table.compute` to apply the quantize to generate a rounded co
])
To round to one decimal place you would simply change :code:`0.01` to :code:`0.1`.
Levenshtein edit distance
=========================
The Levenshtein edit distance is a common measure of string similarity. It can be used, for instance, to check for typos between manually-entered names and a version that is known to be spelled correctly.
Implementing Levenshtein requires writing a custom :class:`.Computation`. To save ourselves building the whole thing from scratch, we will lean on the `python-Levenshtein <https://pypi.python.org/pypi/python-Levenshtein/>`_ library for the actual algorithm.
.. code-block:: python
import agate
from Levenshtein.StringMatcher import StringMatcher
import six
class LevenshteinDistance(agate.Computation):
"""
Computes Levenshtein edit distance between the column and a given string.
"""
def __init__(self, column_name, compare_string):
self._column_name = column_name
self._matcher = StringMatcher(seq2=six.text_type(compare_string))
def get_computed_column_type(self, table):
"""
The return value is a numerical distance.
"""
return agate.NumberType()
def prepare(self, table):
"""
Verify the comparison column is a TextColumn.
"""
column = table.columns[self._column_name]
if not isinstance(column, agate.TextColumn):
raise agate.UnsupportedComputationError(self, column)
def run(self, row):
"""
Find the distance, returning null when the input column was null.
"""
val = row[self._column_name]
if val is None:
return None
self._matcher.set_seq1(val)
return self._matcher.distance()
This code can now be applied to any :class:`.Table` just as any other :class:`.Computation` would be:
.. code-block:: python
new_table = table.compute([
('distance', LevenshteinDistance('column_name', 'string to compare'))
])
The resulting column will contain an integer measuring the edit distance between the value in the column and the comparison string.

0 comments on commit 77897f8

Please sign in to comment.