## Fuzzy Match POC with Apache Spark
The objective of this project is to test the execution of native spark functions to perform string similarity analysis, with variated similarity analysis algorithms

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import names
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

sc = SparkContext().getOrCreate()

spark = SparkSession.builder.appName(
    'Fuzzy Match POC').getOrCreate()

### Prepare datasets
`targets` is the name list that will be looked up inside `comparison`

In [2]:
vogals = ['a', 'e', 'i', 'o', 'u']
targets = []
comparison = []

for i in range(10):
    name = names.get_full_name()
    targets.append({'SimilarityWith': name})
    comparison.append({'Name': name})

    #replace vogals
    for vogal in vogals:
        for aux_vogal in vogals:
            if vogal != aux_vogal:
                shuffled_name = name.replace(vogal, aux_vogal)
                
                if shuffled_name not in list(map(lambda x: x['Name'], comparison)):
                    comparison.append({'Name': shuffled_name})

targets_df = spark.createDataFrame(targets).alias('t')
comparison_df = spark.createDataFrame(comparison).alias('c')

print('Targets to be searched:')
targets_df.show(n=2)

print('\Search list:')
comparison_df.show(n=2)

df = targets_df.crossJoin(comparison_df)

df = df.withColumn('Similarity', F.expr(
    'balogo_jarowinkler(SimilarityWith, Name)'))

df.show()

Targets to be searched:
+---------------+
| SimilarityWith|
+---------------+
| Allison Stroot|
|Theresa Joachim|
+---------------+
only showing top 2 rows

\Search list:
+--------------+
|          Name|
+--------------+
|Allison Stroot|
|Allason Stroot|
+--------------+
only showing top 2 rows

+--------------+---------------+-------------------+
|SimilarityWith|           Name|         Similarity|
+--------------+---------------+-------------------+
|Allison Stroot| Allison Stroot|                1.0|
|Allison Stroot| Allason Stroot| 0.9666666666666667|
|Allison Stroot| Alleson Stroot| 0.9666666666666667|
|Allison Stroot| Alloson Stroot| 0.8948717948717949|
|Allison Stroot| Alluson Stroot| 0.9666666666666667|
|Allison Stroot| Allisan Straat| 0.9142857142857143|
|Allison Stroot| Allisen Street| 0.9142857142857143|
|Allison Stroot| Allisin Striit| 0.9142857142857143|
|Allison Stroot| Allisun Struut| 0.9142857142857143|
|Allison Stroot|Theresa Joachim| 0.3603174603174603|
|Allison Stro