# Fuzzy Matching

We are faced with a dataset that has names of people,companies or products, etc. that all have mispellings or variations. How do we lump like items as one despite their variations?

Let's explore a few different techniques to group items that are obviously similar.

In [4]:
## import libraries
import pandas as pd
import re


In [3]:
## read dataset
df = pd.read_csv("https://raw.githubusercontent.com/sandeepmj/datasets/main/regex/variant-names.csv")
df

Unnamed: 0,name
0,Albert Einstein
1,"Einsten, Albert"
2,coke
3,"Mr. Einstein, Al"
4,Marie Salomea Skłodowska–Curie
5,Curie Marie
6,coca-cola
7,Marie S. Curie
8,Dr. Marie Curry
9,The Coca-Cola Company


### Pre-process data

It's always a good idea to pre-process text to avoid the basic variation in letter casing.

- Lowercase everything: It's always a good idea to pre-process text to avoid the basic variation in letter casing.
- remove titles and honorifics (list of <a href="https://gist.github.com/neilhawkins/c7bb94e5b7ae558e826989d330418938">English honorifics</a>)

In [6]:
## But we'll target what we see:
## note that we are leaving company and Inc. untouched for now.

remove_words = ["dr\.", "prof\.", "mrs\.", "mr\."]
remove_words

['dr\\.', 'prof\\.', 'mrs\\.', 'mr\\.']

In [7]:
## run search and replace with regex 
## to look for any of these terms and remove them regardless of position
df["name"] = df["name"].str.lower().str.replace('|'.join(remove_words), "", regex = True)
df

Unnamed: 0,name
0,albert einstein
1,"einsten, albert"
2,coke
3,"einstein, al"
4,marie salomea skłodowska–curie
5,curie marie
6,coca-cola
7,marie s. curie
8,marie curry
9,the coca-cola company


## Fuzzy Matching

Fuzzy matching is a mathematical technique used to determine the similarity between two strings. 

A very simplified explaination is that it calcuates the number of changes (additions, deletions and moves) are required to make one string identical to another. 

<a href="https://nanonets.com/blog/fuzzy-matching-fuzzy-logic/">Read more</a> about how it works.



### pip install FuzzyWuzzy
This package that implements a similiarity algorithm – the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance similarity ratio</a> – to determine how similar two strings are.

In [8]:
pip install fuzzywuzzy

Note: you may need to restart the kernel to use updated packages.


### pip install python-Levenshtein 
This package is not required but I recommend it because it speeds up the analysis up to 10X. 

In [9]:
pip install python-Levenshtein

Note: you may need to restart the kernel to use updated packages.


In [10]:
## import various packages

import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

### ```fuzz.ratio(string1, string2)```

This functions returns the similarty ratio between two strings. 

In [11]:
## what's the fuzz ratio between "Al Einten" and "Albert Einstein"
fuzz.ratio("Al Einten", "Albert Einstein")

75

In [12]:
## what's the fuzz ratio between "Albet Einten" and "Albert Einstein"
fuzz.ratio("Albet Einten", "Albert Einstein")

89

In [13]:
## what's the fuzz ratio between "albert einsten", "einsten, albert"
fuzz.ratio("albert einsten", "einsten, albert")

48

### ```fuzz.partial_ratio(string1, string2)```

This function is ideal for substring matching which involving checking the value of the shortest string against substrings in a longer string.

This is useful when checking a name that includes the first, middle and last name againt just the last name.

In [15]:
## ADD CELLS AS NEEDED
fuzz.partial_ratio("Einstein", "Einstein, Albert")

100

In [16]:
fuzz.partial_ratio("Einstein, Al", "Einstein, Albert")

100

In [18]:
fuzz.partial_ratio

46

### ```fuzz.token_sort_ratio(string1, string2)```

This function first tokenizes each string, lowercases each token, removes punctuation and then sorts the tokens alphabetically.

This is most useful when comparing strings with names of people or companies that are in different order.

Note that if the names are spelled differently the ratio decreases.

In [17]:
## ADD CELLS AS NEEDED
fuzz.token_sort_ratio("Albert Einsten", "Einstein, Albert")

97

In [19]:
fuzz.token_sort_ratio("albert einsten", "einsten, albert")

100

## Application

In [20]:
## recall our df
df

Unnamed: 0,name
0,albert einstein
1,"einsten, albert"
2,coke
3,"einstein, al"
4,marie salomea skłodowska–curie
5,curie marie
6,coca-cola
7,marie s. curie
8,marie curry
9,the coca-cola company


### Pre-process to remove company abbreviations

In [21]:
## list of remove words
remove_words = ["the", "inc\.", "incorporated", "llc", "limited liability corporation"
                "assoc\.", "Associtation", "Bros\.", "Brothers", "Co\.", 'company',
                "corp\.", "Corporation", "ltd\.", "limited", "mfg.", "manufacturing",
                "mfrs\." "manufacturers"]



In [23]:
## apply the remove words
df["name"] = df["name"].str.replace('|'.join(remove_words), "", flags = re.I, regex = True)
df

Unnamed: 0,name
0,albert einstein
1,"einsten, albert"
2,coke
3,"einstein, al"
4,marie salomea skłodowska–curie
5,curie marie
6,coca-cola
7,marie s. curie
8,marie curry
9,coca-cola


## ```fuzz.token_sort_ratio``` in a function

- Write a function that checks a "correct spelling" against a possible variation.
- If it meets a minimally acceptable ratio, add the correct spelling in a new column.
- Also return the calculated ratio in a separate column

In [24]:
## list of known companies and people
known_entities = [
    "albert einstein",
    "marie curie",
    "coca-cola",
    "pepsico"
    
]

In [25]:
## write the function here

def check_prox(seek_values, string, min_ratio):
    for value in seek_values:
        ratio = fuzz.token_sort_ratio(value, string)
        if ratio >= min_ratio:
            return value, ratio
        


In [28]:
## try it out
x, y = check_prox(known_entities, "marie salomea curie", 60)


('marie curie', 73)

In [None]:
## list of know companies and people


In [None]:
## store result in a variable


In [29]:
## call the two variables
x, y

('marie curie', 73)

## Apply to our df

In [35]:
## create a new column that contains the associated name (if applicable) and the fuzz ratio
df["fuzzname-ratio"] = df["name"].apply(lambda x: check_prox(known_entities, x, 60))
df

Unnamed: 0,name,fuzzname-ratio
0,albert einstein,"(albert einstein, 100)"
1,"einsten, albert","(albert einstein, 97)"
2,coke,
3,"einstein, al","(albert einstein, 85)"
4,marie salomea skłodowska–curie,
5,curie marie,"(marie curie, 100)"
6,coca-cola,"(coca-cola, 100)"
7,marie s. curie,"(marie curie, 92)"
8,marie curry,"(marie curie, 82)"
9,coca-cola,"(coca-cola, 100)"


In [37]:
## create a 2nd df
df2 = pd.DataFrame(df["fuzzname-ratio"].to_list(), columns = ["fuzz-name", "ratio"])
df2

Unnamed: 0,fuzz-name,ratio
0,albert einstein,100.0
1,albert einstein,97.0
2,,
3,albert einstein,85.0
4,,
5,marie curie,100.0
6,coca-cola,100.0
7,marie curie,92.0
8,marie curie,82.0
9,coca-cola,100.0


In [38]:
## concat and drop extraneous column
dff = pd.concat([df, df2], axis = "columns")
dff

Unnamed: 0,name,fuzzname-ratio,fuzz-name,ratio
0,albert einstein,"(albert einstein, 100)",albert einstein,100.0
1,"einsten, albert","(albert einstein, 97)",albert einstein,97.0
2,coke,,,
3,"einstein, al","(albert einstein, 85)",albert einstein,85.0
4,marie salomea skłodowska–curie,,,
5,curie marie,"(marie curie, 100)",marie curie,100.0
6,coca-cola,"(coca-cola, 100)",coca-cola,100.0
7,marie s. curie,"(marie curie, 92)",marie curie,92.0
8,marie curry,"(marie curie, 82)",marie curie,82.0
9,coca-cola,"(coca-cola, 100)",coca-cola,100.0


In [39]:
dff.drop("fuzzname-ratio", axis = 1, inplace = True)
dff

Unnamed: 0,name,fuzz-name,ratio
0,albert einstein,albert einstein,100.0
1,"einsten, albert",albert einstein,97.0
2,coke,,
3,"einstein, al",albert einstein,85.0
4,marie salomea skłodowska–curie,,
5,curie marie,marie curie,100.0
6,coca-cola,coca-cola,100.0
7,marie s. curie,marie curie,92.0
8,marie curry,marie curie,82.0
9,coca-cola,coca-cola,100.0


In [None]:
## call our new column


In [None]:
## show all nans


## Some manual labor

In [None]:
## .at v. iloc
## https://stackoverflow.com/questions/37216485/pandas-at-versus-loc


In [None]:
## show all nans


In [None]:
## call our dff


# Python’s Scikit-Learn library

What if we want to compare a list of entities with itself? We want to cluster similar items together but we have no idea what all the items are. How do we do that?

The answer is more complex and involves machine learning using Python’s Scikit-Learn library.