## Fuzzy Match POC with Apache Spark
The objective of this project is to test the execution of native spark functions to perform string similarity analysis, with variated similarity analysis algorithms

### Approaches

- 1st Approach: Use of native Scala Spark SQL fuzzy match algorithms, crossjoining the input dataset with the target dataset, generating a quatratic computational time

References:
- [Josh Taylor: Fuzzy matching at scale](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536#:~:text=The%20problem%20with%20Fuzzy%20Matching%20on%20large%20data&text=In%20computer%20science%2C%20this%20is,that%20works%20in%20quadratic%20time.)

In [1]:
import os
import sys
import pyspark.sql.functions as F
from pyspark import SparkContext
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

sc = SparkContext().getOrCreate()

spark = SparkSession.builder.appName(
    'Fuzzy Match POC').getOrCreate()

### Prepare datasets
`targets` is the name list that will be looked up inside `comparison`

In [2]:
targets = []
comparison = []

targets.append({'SimilarityWith': 'HP PARTICIPACOES S/A'})
targets.append({'SimilarityWith': 'SAMSUMG'})
targets.append({'SimilarityWith': 'JOAO PEDRO PAULO'})
targets.append({'SimilarityWith': 'PEDRO CHRISTIAN DA SILVA JR.'})
targets.append({'SimilarityWith': 'LUIS CARLOS JR'})
targets.append({'SimilarityWith': 'JOAO PEDRO FREDERICH HESRV'})
targets.append({'SimilarityWith': 'CAMARGO CORREA SA'})
targets.append({'SimilarityWith': 'SANTOS BRASIL TERMINAL PORTUARIO DE EXPORTACAO'})
targets.append({'SimilarityWith': 'CONCAIS TERMINAL PORTUARIO'})
targets.append({'SimilarityWith': 'JUNIOR RONALD FLINDSMAN'})
targets.append({'SimilarityWith': 'COCA COLA LTDA'})
targets.append({'SimilarityWith': 'BRASTEMP PRODUTOS TECNOLOGICOS LTDA'})
targets.append({'SimilarityWith': 'MINERVA EXPORTACAO DE CARNES'})
targets.append({'SimilarityWith': 'CENTRO DE CONVENCOES JESUS LUZ'})
targets.append({'SimilarityWith': 'JUAN MERCADO DA SILVA'})

#-----------------------------------NOME 2--------------------------------------
comparison.append({'Name': 'HAP PARCTICIPACOES S/A'})
comparison.append({'Name': 'HPPY PARTY S/A'})
comparison.append({'Name': 'HIPPIES PATTERN S/A'})
comparison.append({'Name': 'SAMSUNG'})
comparison.append({'Name': 'JOHN PEDRO PAULO'})
comparison.append({'Name': 'JOAO ROBERTO DA SILVA'})
comparison.append({'Name': 'JOANA PEDROSO FERRAZ'})
comparison.append({'Name': 'PEDRO SILVA JUNIOR'})
comparison.append({'Name': 'PERSIO FERREIRA JUNIOR'})
comparison.append({'Name': 'LUIS CARLOS OLIVERIA JUNIOR'})
comparison.append({'Name': 'LUISA CAROLINA BORGES'})
comparison.append({'Name': 'JOAO FRED HESRV'})
comparison.append({'Name': 'FREDERICO HENRIQUE'})
comparison.append({'Name': 'FREDERICO PEDROZO DE MORAES'})
comparison.append({'Name': 'C CORREA EMPREENDIMENTOS SA'})
comparison.append({'Name': 'CENTRO CORREAS E ACESSORIOS'})
comparison.append({'Name': 'CISNEI CISCORREA EMPREITEIRA SA'})
comparison.append({'Name': 'TRASNPETRO EXPORTACAO'})
comparison.append({'Name': 'CONCAIS TERMINAL'})
comparison.append({'Name': 'RONALD FLINDSMAN'})
comparison.append({'Name': 'RONALDO FLETCHER ARMANDO'})
comparison.append({'Name': 'COCA COLA LTDA'})
comparison.append({'Name': 'COCADA DA MARIA LTDA'})
comparison.append({'Name': 'BRASTEMP TECH LTDA'})
comparison.append({'Name': 'BRASIL TECNOLOGIA LTDA'})
comparison.append({'Name': 'MINERVA COMEX SERVICOS ALIMENTICIOS'})
comparison.append({'Name': 'MINERIO EXPLORACAO E CAVAGEM'})
comparison.append({'Name': 'MINAS GERAIS EXP'})
comparison.append({'Name': 'CENTRO DE CONVENCOES SAO PAULO EXPOCENTER'})
comparison.append({'Name': 'CELTIC CONNECTION SAO PAULO'})
comparison.append({'Name': 'MERCADO DA SILVIA'})
comparison.append({'Name': 'JOANA DA SILVA'})

targets_df = spark.createDataFrame(targets).alias('t')
comparison_df = spark.createDataFrame(comparison).alias('c')

print(len(targets))
print(len(comparison))

print(f"CrossJoined dataset size: {len(targets) * len(comparison)}")

15
32
CrossJoined dataset size: 480


<hr>

### Conventional CrossJoin Approach

In [3]:
#lev = F.expr('1 - (levenshtein(Name, SimilarityWith) / array_max(array(length(Name), length(SimilarityWith))) )')
jaro = F.expr('jarowinkler(Name, SimilarityWith)')

In [5]:
min_similarity = 0.7
df = targets_df.crossJoin(comparison_df)
df = df.withColumn('Similarity', jaro).filter(F.expr(f'Similarity >= {min_similarity}')).cache()

print(f"Filtered Fuzzy Match count: {df.count()}")
df.orderByDesc("Similarity").show()

Filtered Fuzzy Match count: 22
+--------------------+--------------------+------------------+
|      SimilarityWith|                Name|        Similarity|
+--------------------+--------------------+------------------+
|MINERVA EXPORTACA...|C CORREA EMPREEND...|0.7153278819945487|
|JOAO PEDRO FREDER...|FREDERICO PEDROZO...| 0.718966218966219|
|CONCAIS TERMINAL ...|TRASNPETRO EXPORT...|0.7195131257631258|
|JUNIOR RONALD FLI...|    RONALD FLINDSMAN|0.7318840579710145|
|JUAN MERCADO DA S...|   MERCADO DA SILVIA|0.7343604108309991|
|HP PARTICIPACOES S/A| HIPPIES PATTERN S/A|0.7613815789473684|
|JUAN MERCADO DA S...|      JOANA DA SILVA|0.8181318681318681|
|    JOAO PEDRO PAULO|    JOHN PEDRO PAULO| 0.819047619047619|
|    JOAO PEDRO PAULO|JOANA PEDROSO FERRAZ|0.8191666666666667|
|BRASTEMP PRODUTOS...|  BRASTEMP TECH LTDA|0.8254563492063491|
|MINERVA EXPORTACA...|MINERVA COMEX SER...|0.8348447204968945|
|HP PARTICIPACOES S/A|HAP PARCTICIPACOE...|0.8377272727272728|
|    JOAO PEDRO PAULO|JO

In [5]:
pandas_df = df.toPandas()
pandas_df.to_csv('conventional.csv', index=False)