# Fuzzy Match

This example demonstrates using the [SimilarityJoinTransform](https://arc.tripl.ai/transform/#similarityjointransform) stage to perform [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching) (a.k.a. Fuzzy Matching) to compare two datasets for similar records.

In this case the reference dataset is a very small subset of the [PSMA Geocoded National Address File (G-NAF)](https://data.gov.au/dataset/ds-dga-19432f89-dc3a-4ef3-b943-5326ef1dbecc/details) which contains all official Australian addresses. 

The second dataset is an example dataset which could be extracted from your Customer Reference Management system.

The `SimilarityJoinTransform` stage uses [Locality Sensitive Hashing (LSH)](https://databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html) to efficiently compare and join the datasets based on their `similarity` value. The `threshold` parameter can be used to configure how similar the values must be before they are returns as a 'joined' record in the output. The `shingleLength` parameter can be used to tweak the comparison and can be experimented with for your dataset.

In [None]:
%env
ETL_CONF_BASE_PATH=/home/jovyan/examples/fuzzy_match

In [None]:
{
  "type": "DelimitedExtract",
  "name": "load Geocoded National Address File extract",
  "environments": [
    "production",
    "test"
  ],
  "inputURI": ${ETL_CONF_BASE_PATH}"/gnaf.csv",
  "outputView": "gnaf",
  "header": true
}

In [None]:
{
  "type": "DelimitedExtract",
  "name": "load addresses from customer master system",
  "environments": [
    "production",
    "test"
  ],
  "inputURI": ${ETL_CONF_BASE_PATH}"/addresses.csv",
  "outputView": "addresses",
  "header": true
}

In [None]:
{
  "type": "SimilarityJoinTransform",
  "name": "look up addresses against the national address database",
  "environments": [
    "production",
    "test"
  ],
  "leftView": "gnaf",
  "leftFields": ["flat_number", "number_first", "street_name", "street_type", "locality_name", "postcode", "state"],
  "rightView": "addresses",
  "rightFields": ["street", "state_postcode_suburb"],
  "outputView": "matches",
  "threshold": 0.50,
  "shingleLength": 3
}