## Fuzzy Match POC with Apache Spark
The objective of this project is to test the execution of native spark functions to perform string similarity analysis, with variated similarity analysis algorithms

### Approaches

- 1st Approach: Use of native Scala Spark SQL fuzzy match algorithms, crossjoining the input dataset with the target dataset, generating a quatratic computational time
- 2nd Approach: Use of Term Frequency, Inverse Document Frequency (TF-IDF) and only then applying native Scala Spark SQL fuzzy match algorithms

References:
- [Josh Taylor: Fuzzy matching at scale](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536#:~:text=The%20problem%20with%20Fuzzy%20Matching%20on%20large%20data&text=In%20computer%20science%2C%20this%20is,that%20works%20in%20quadratic%20time.)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import names
import os
import sys
import re

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

sc = SparkContext().getOrCreate()

spark = SparkSession.builder.appName(
    'Fuzzy Match POC').getOrCreate()

### Prepare datasets
`targets` is the name list that will be looked up inside `comparison`

In [2]:
vogals = ['A', 'E', 'I', 'O', 'U']
targets = []
comparison = []

for i in range(5000):
    name = names.get_full_name().upper()
    targets.append({'SimilarityWith': name})
    comparison.append({'Name': name})

    #replace vogals
    for vogal in vogals:
        for aux_vogal in vogals:
            if vogal != aux_vogal:
                shuffled_name = name.replace(vogal, aux_vogal)
                
                if shuffled_name not in list(map(lambda x: x['Name'], comparison)):
                    comparison.append({'Name': shuffled_name})

targets_df = spark.createDataFrame(targets).alias('t')
comparison_df = spark.createDataFrame(comparison).alias('c')

print(len(targets))
print(len(comparison))

print(f"CrossJoined dataset size: {len(targets) * len(comparison)}")

CrossJoined dataset size: 319455000


### Conventional CrossJoin Approach

In [34]:
lev = F.expr('1 - (levenshtein(Name, SimilarityWith) / array_max(array(length(Name), length(SimilarityWith))) )')
jaro = F.expr('balogo_jarowinkler(Name, SimilarityWith)')

In [35]:
df = targets_df.crossJoin(comparison_df)
df = df.withColumn('Similarity', (lev + jaro) / 2).filter(F.expr('Similarity > 0.8')).cache()

print(f"Filtered Fuzzy Match count: {df.count()}")
df.show()

Filtered Fuzzy Match count: 81992
+--------------+--------------+------------------+
|SimilarityWith|          Name|        Similarity|
+--------------+--------------+------------------+
| FLOYD GAMBILL| FLOYD GAMBILL|               1.0|
| FLOYD GAMBILL| FLOYD GEMBILL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GIMBILL|0.9336538461538462|
| FLOYD GAMBILL| FLOYD GOMBILL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GUMBILL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GAMBALL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GAMBELL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GAMBOLL|0.9461538461538461|
| FLOYD GAMBILL| FLOYD GAMBULL|0.9461538461538461|
| FLOYD GAMBILL| FLAYD GAMBILL|0.9132478632478633|
| FLOYD GAMBILL| FLEYD GAMBILL| 0.941025641025641|
| FLOYD GAMBILL| FLIYD GAMBILL| 0.941025641025641|
| FLOYD GAMBILL| FLUYD GAMBILL| 0.941025641025641|
|OZIE HOLSWORTH|OZIE HOLSWORTH|               1.0|
|OZIE HOLSWORTH|OZIA HOLSWORTH|0.9476190476190476|
|OZIE HOLSWORTH|OZII HOLSWORTH|0.947619047619047

Benchmarks - Conventional Approach

|Targets Count|Comparison Count|CrossJoined Dataset Size| Hit Count (< 80%)|Duration
|--|--|--|--|--|
|5.000|63.891|319.455.000|81.992|32.9s (2m 2.9s)|

* Spark seems to take about 1m 30s initialization time independently of the dataset size, at the computer i'm currently running the script

### Term Frequency, Inverse Document Frequency (TF-IDF) Approach 

In [28]:
def ngrams(string, n=3):
    ngs = zip(*[string[i:] for i in range(n)])
    return [''.join(n) for n in ngs]

ngs = ngrams(' OZIE HOLSWORTH ')
print(ngs)

[' OZ', 'OZI', 'ZIE', 'IE ', 'E H', ' HO', 'HOL', 'OLS', 'LSW', 'SWO', 'WOR', 'ORT', 'RTH', 'TH ']
