## Fuzzy Match POC with Apache Spark
The objective of this project is to test the execution of native spark functions to perform string similarity analysis, with variated similarity analysis algorithms

### Approaches

- 1st Approach: Use of native Scala Spark SQL fuzzy match algorithms, crossjoining the input dataset with the target dataset, generating a quatratic computational time

References:
- [Josh Taylor: Fuzzy matching at scale](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536#:~:text=The%20problem%20with%20Fuzzy%20Matching%20on%20large%20data&text=In%20computer%20science%2C%20this%20is,that%20works%20in%20quadratic%20time.)

In [2]:
import os
import sys
import names
import pyspark.sql.functions as F
from pyspark import SparkContext
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

sc = SparkContext().getOrCreate()

spark = SparkSession.builder.appName(
    'Fuzzy Match POC').getOrCreate()

### Prepare datasets
`targets` is the name list that will be looked up inside `comparison`

In [16]:
shuffle = ['A', 'E', 'I', 'O', 'U', 'H' , 'R', 'P' ,'B', 'J', 'N', 'M', 'G', '.', '/', './ ', '-', '- LTDA']
targets = []
comparison = []


for i in range(100):
    name = names.get_full_name().upper()
    targets.append({'SimilarityWith': name})
    comparison.append({'Name': name})

    #replace shuffle
    for shuffle_char in shuffle:
        for shuffle_char_aux in shuffle:
            if shuffle_char != shuffle_char_aux:
                shuffled_name = name.replace(shuffle_char, shuffle_char_aux)
                comparison.append({'Name': shuffled_name})

targets_df = spark.createDataFrame(targets).alias('t')
comparison_df = spark.createDataFrame(comparison).alias('c')

print(len(targets))
print(len(comparison))

print(f"CrossJoined dataset size: {len(targets) * len(comparison)}")

10
3070
CrossJoined dataset size: 30700


<hr>

### Conventional CrossJoin Approach

In [13]:
#lev = F.expr('1 - (levenshtein(Name, SimilarityWith) / array_max(array(length(Name), length(SimilarityWith))) )')
jaro = F.expr('jarowinkler(Name, SimilarityWith)')

In [14]:
min_similarity = 0.7
df = targets_df.crossJoin(comparison_df)
df = df.withColumn('Similarity', jaro) \
    .filter(F.expr(f'Similarity >= {min_similarity}')) \
    .dropDuplicates() \
    .cache()

print(f"Filtered Fuzzy Match count: {df.count()}")
df.show()

Filtered Fuzzy Match count: 13333
+----------------+--------------------+------------------+
|  SimilarityWith|                Name|        Similarity|
+----------------+--------------------+------------------+
|   HARRY HENSHAW|       HARRY HJNSHAW|0.9692307692307692|
|KATHRYN MCDONALD|KATH- LTDAYN MCDO...|0.8570238095238096|
|  THELMA NEIDICH|      THELUA NEIDICH|0.9714285714285714|
|     GEORGE BEAN|         GEORGE BENN|0.9636363636363636|
|KATHRYN MCDONALD|       RAYMOND NLAIR|0.7233391608391608|
|KATHRYN MCDONALD|         IARY CANNON|0.7009680134680135|
|     WALTER GOOD|       W-LTER T-YLOR|0.7524475524475526|
|      DONNA CURL|         LANA CURIUL|0.7311688311688312|
|    TINA PURSLEY|        TINA PURSLIY|0.9666666666666666|
|    TINA PURSLEY|        TINA PURSLHY|0.9666666666666666|
|   CHARLES WEDEL|       CHRRLES WEDEL| 0.958974358974359|
|      KEN LENNON|          KE- LE--O-|0.7866666666666667|
|    NANCY MILLER|        OAOCY MILLER| 0.888888888888889|
|   MARGARET SHEA|    

In [15]:
pandas_df = df.toPandas()
pandas_df.to_csv('conventional_randomized.csv', index=False)