## Fuzzy Match POC with Apache Spark
The objective of this project is to test the execution of native spark functions to perform string similarity analysis, with variated similarity analysis algorithms

### Approaches

- 1st Approach: Use of native Scala Spark SQL fuzzy match algorithms, crossjoining the input dataset with the target dataset, generating a quatratic computational time
- 2nd Approach: Use of Term Frequency, Inverse Document Frequency (TF-IDF) and only then applying native Scala Spark SQL fuzzy match algorithms

References:
- [Josh Taylor: Fuzzy matching at scale](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536#:~:text=The%20problem%20with%20Fuzzy%20Matching%20on%20large%20data&text=In%20computer%20science%2C%20this%20is,that%20works%20in%20quadratic%20time.)

In [2]:
%%time

import os
import sys
import names
import pyspark.sql.functions as F
import numpy as np
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SparkSession
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

sc = SparkContext().getOrCreate()

spark = SparkSession.builder.appName(
    'Fuzzy Match POC').getOrCreate()

CPU times: total: 2.83 s
Wall time: 15.3 s


### Prepare datasets
`targets` is the name list that will be looked up inside `comparison`

In [3]:
%%time

shuffle = ['A', 'E', 'I', 'O', 'U', 'H' , 'R', 'P' ,'B', 'J', 'N', 'M', 'G', '.', '/', './ ', '-', '- LTDA']
targets = []
comparison = []

targets.append({'SimilarityWith': 'HP PARTICIPACOES S/A'})
comparison.append({'Name': 'HAP PARCTICIPACOES S/A'})

for i in range(2):
    name = names.get_full_name().upper()
    targets.append({'SimilarityWith': name})
    comparison.append({'Name': name})

    #replace shuffle
    for shuffle_char in shuffle:
        for shuffle_char_aux in shuffle:
            if shuffle_char != shuffle_char_aux:
                shuffled_name = name.replace(shuffle_char, shuffle_char_aux)
                comparison.append({'Name': shuffled_name})

targets_df = spark.createDataFrame(targets).alias('t')
comparison_df = spark.createDataFrame(comparison).alias('c')

print(len(targets))
print(len(comparison))

print(f"CrossJoined dataset size: {len(targets) * len(comparison)}")

3
615
CrossJoined dataset size: 1845
CPU times: total: 78.1 ms
Wall time: 2.25 s


<hr>

### Conventional CrossJoin Approach

In [4]:
%%time

lev = F.expr('1 - (levenshtein(Name, SimilarityWith) / array_max(array(length(Name), length(SimilarityWith))) )')
jaro = F.expr('bis_brasil_jarowinkler(Name, SimilarityWith)')


CPU times: total: 0 ns
Wall time: 102 ms


In [5]:
%%time

df = targets_df.crossJoin(comparison_df)
df = df.withColumn('Similarity', (lev + jaro) / 2).filter(F.expr('Similarity > 0.8')).cache()

print(f"Filtered Fuzzy Match count: {df.count()}")
df.show()

Filtered Fuzzy Match count: 602


+--------------------+--------------------+------------------+
|      SimilarityWith|                Name|        Similarity|
+--------------------+--------------------+------------------+
|HP PARTICIPACOES S/A|HAP PARCTICIPACOE...|0.8734090909090909|
|          KAREN HILL|          KAREN HILL|               1.0|
|          KAREN HILL|          KEREN HILL|0.9033333333333333|
|          KAREN HILL|          KIREN HILL|0.9199999999999999|
|          KAREN HILL|          KOREN HILL|0.9199999999999999|
|          KAREN HILL|          KUREN HILL|0.9199999999999999|
|          KAREN HILL|          KHREN HILL|0.9199999999999999|
|          KAREN HILL|          KRREN HILL|0.9199999999999999|
|          KAREN HILL|          KPREN HILL|0.9199999999999999|
|          KAREN HILL|          KBREN HILL|0.9199999999999999|
|          KAREN HILL|          KJREN HILL|0.9199999999999999|
|          KAREN HILL|          KNREN HILL|             0.895|
|          KAREN HILL|          KMREN HILL|0.9199999999

In [6]:
%%time

pandas_df = df.toPandas()
pandas_df.to_csv('conventional.csv', index=False)

CPU times: total: 109 ms
Wall time: 332 ms


<hr>

### Term Frequency, Inverse Document Frequency (TF-IDF) Approach 


Function that generates list of 3 char length ngrams from full string 

In [7]:
%%time

def ngrams(string, n=2):
    ngs = zip(*[string[i:] for i in range(n)])
    return [''.join(n) for n in ngs]

CPU times: total: 0 ns
Wall time: 0 ns


In [8]:
%%time

targets_list = list(set(map(lambda x: x['SimilarityWith'], targets)))
comparison_list = list(set(map(lambda x: x['Name'], comparison)))

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(targets_list)
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)

def getNearestN(query):
  queryTFIDF_ = vectorizer.transform(query)
  distances, indices = nbrs.kneighbors(queryTFIDF_)
  return distances, indices

distances, indices = getNearestN(comparison_list)
comparison_list = list(comparison_list)

matches = []
for i,j in enumerate(indices):
  temp = [round(distances[i][0],2), targets_list[j[0]], comparison_list[i]]
  matches.append(temp)

CPU times: total: 15.6 ms
Wall time: 22 ms


In [9]:
%%time

matches_df = pd.DataFrame(matches, columns=['Distance','Target name','Dataset name'])

#matches_df = matches_df.loc[matches_df['Distance'] > 0.3]
matches_df = matches_df.drop_duplicates().sort_values(
    by=['Distance'], ascending=True)
    
matches_df.to_csv('tf-idf.csv', index=False)
print(len(matches_df))

224
CPU times: total: 15.6 ms
Wall time: 5.02 ms


Benchmarks

|Approach|Search Targets Count|Dataset Names Count|CrossJoined Dataset Size| Hit Count | Score Filter | Duration |
|--|--|--|--|--|--|--|
|Conventional|5.000|785.000|3.925.000.000|1.257.568|80% > |~11m (12m 30s)|
|TFIDF|5.000|785.000|-|728.326|0.3 >|56s|
|Conventional|10.000|126.304|1.263.040.000|199.549|80% > |~3m 11s (4m 41s total)|
|TFIDF|10.000|126.304|-|91.622|0.3 >|15.7s|
|TFIDF|10.000|126.304|-|126.304|N/A|15s|
|TFIDF|5.000|63.635|-|46.411|0.3 >|4.6s|
|Conventional|5.000|63.635|318.175.000|82.011|80% >|~31s (2m 1s total)|

* Spark seems to take about 1m 30s initialization time independently of the dataset size, at the computer i'm currently running the script