# TheFuzz library
Sometimes a dataset may include typos. Let's consider the following DataFrame that shows restaurant reviews given by some people:
|name|review|
|----|------|
|Jussi|good|
|Mary|terrible|
|Alex|excellent|
|Johanna|god|
|Lauri|exellent|

We can easily realize there should be three categories in reviews: "good", "terrible", and "excellent", while "god" and "exellent" are clearly typos. We can of course manually fix the typos but it makes more sense to automate the process, since there is no guarantee that we won't see other typos when we receive more reviews. One popular choice in correcting typos is to use fuzzy string matching. This is where [TheFuzz](https://github.com/seatgeek/thefuzz) library comes into the play. TheFuzz can be used to calculate the similarity of strings $a$ and $b$ using the Levenshtein similarity ratio
$$
\frac{|a|+|b|-\text{lev}(a,b)}{|a|+|b|},
$$
where $|a|$ and $|b|$ are string lengths and $\text{lev}(a,b)$ is the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). In the TheFuzz library, this ratio is named "simple ratio" and used to calculate a "similarity score". This similarity score is scaled to be a number between 0-100, where 100 signals maximal similarity (identical strings). 

Let's now use the TheFuzz library to correct the typos in the DataFrame. 

In [1]:
import pandas as pd
from thefuzz import fuzz, process

In [4]:
# Construct the DataFrame
df = pd.DataFrame.from_dict({"name": ["Jussi", "Mary", "Alex", "Johanna", "Lauri"], "reviews": ["good", "terrible", "excellent", "god", "exellent"]})
df

Unnamed: 0,name,reviews
0,Jussi,good
1,Mary,terrible
2,Alex,excellent
3,Johanna,god
4,Lauri,exellent


In [5]:
# Use TheFizz to fix the typos
correct_categories = ["good", "terrible", "excellent"]
matches = []
for entry in df["reviews"]:
    best_match, score = process.extractOne(entry, correct_categories, scorer=fuzz.ratio)
    matches.append(best_match)
    print(f"The similarity score between {best_match} and {entry} is {score}")
df["reviews"] = matches
df

The similarity score between good and good is 100
The similarity score between terrible and terrible is 100
The similarity score between excellent and excellent is 100
The similarity score between good and god is 86
The similarity score between excellent and exellent is 94


Unnamed: 0,name,reviews
0,Jussi,good
1,Mary,terrible
2,Alex,excellent
3,Johanna,good
4,Lauri,excellent


Let's take a closer look at what happened. 
```python
for entry in df["reviews"]:
    best_match, score = process.extractOne(entry, correct_categories, scorer=fuzz.ratio)
```
For each entry in the "reviews" column, we use the [process.extractOne](https://github.com/seatgeek/thefuzz#process) function to find the single best match for the entry from a list of choices. The `extract_one` function accepts a `scorer` parameter, which is a function used for calculating the similarity score. In our example, we use `fuzz.score` to calculate similarity based on the Levenshtein distance.

TheFuzz also support other scoring functions, you can find more details from its documentations if you're interested. 

You can now go to [the next tutorial](./great_expectations/great_expectations.ipynb). 