Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 3.28 KB

fuzzy-matching.rst

File metadata and controls

122 lines (90 loc) · 3.28 KB

datatest

How to Validate Fuzzy Matches

When comparing strings of text, it can sometimes be useful to check that values are similar instead of asserting that they are exactly the same. Datatest provides options for approximate string matching (also called "fuzzy matching").

When checking mappings or sequences of values, you can accept approximate matches with the accepted.fuzzy acceptance:

Using Acceptance

from datatest import validate, accepted

linked_record = {
    'id165': 'Saint Louis',
    'id382': 'Raliegh',
    'id592': 'Austin',
    'id720': 'Cincinatti',
    'id826': 'Philadelphia',
}

master_record = {
    'id165': 'St. Louis',
    'id382': 'Raleigh',
    'id592': 'Austin',
    'id720': 'Cincinnati',
    'id826': 'Philadelphia',
}

with accepted.fuzzy(cutoff=0.6):
    validate(linked_record, master_record)

No Acceptance

from datatest import validate

linked_record = {
    'id165': 'Saint Louis',
    'id382': 'Raliegh',
    'id592': 'Austin',
    'id720': 'Cincinatti',
    'id826': 'Philadelphia',
}

master_record = {
    'id165': 'St. Louis',
    'id382': 'Raleigh',
    'id592': 'Austin',
    'id720': 'Cincinnati',
    'id826': 'Philadelphia',
}

validate(linked_record, master_record)
Traceback (most recent call last):
  File "example.py", line 19, in <module>
    validate(linked_record, master_record)
datatest.ValidationError: does not satisfy mapping requirements (3 differences): {
    'id165': Invalid('Saint Louis', expected='St. Louis'),
    'id382': Invalid('Raliegh', expected='Raleigh'),
    'id720': Invalid('Cincinatti', expected='Cincinnati'),
}

If variation is an inherent, natural feature of the data and does not necessarily represent a defect, it may be appropriate to use validate.fuzzy instead of the acceptance shown previously:

from datatest import validate

linked_record = {
    'id165': 'Saint Louis',
    'id382': 'Raliegh',
    'id592': 'Austin',
    'id720': 'Cincinatti',
    'id826': 'Philadelphia',
}

master_record = {
    'id165': 'St. Louis',
    'id382': 'Raleigh',
    'id592': 'Austin',
    'id720': 'Cincinnati',
    'id826': 'Philadelphia',
}

validate.fuzzy(linked_record, master_record, cutoff=0.6)

That said, it's probably more appropriate to use an acceptance for this specific example.