# Fuzzy Wuzzy

## Why use Fuzzy Wuzzy?

- Input source cannot be guaranteed 100% accurate
- Optical Character Engine (OCR) output
- Alias name: Fuzzy Wuzzy is simpler than word embeddings \[Taiwan $\approx$ Taiwan (R.O.C.)\]

![image](images/googlesearch.png)

## Fuzzy Wuzzy facts

- Fuzzy Wuzzy is a library that uses Levenshtein Distance to calculate the differences between sequences
- It is Open Source on [Github](https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy)
- Can improve your model by up to 5% accuracy
- Requires Python 2.4 or higher
- Requires python-Levenshtein

## Installation

```
pip install fuzzywuzzy
pip install python-Levenshtein
```

## Usage

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

## `ratio`

In [2]:
fuzz.ratio("this is a test", "this is a test!")

97

In [3]:
fuzz.ratio('Taiwan', 'Taiwan (R.O.C)')

60

- First parameter is the target word
- Second parameter is a matching string

## `partial_ratio`

In [4]:
fuzz.partial_ratio("this is a test", "this is a test!")

100

## `token_sort_ratio`

In [5]:
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [6]:
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

## `token_set_ratio`

In [7]:
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

84

In [8]:
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

100

## `process`

In [9]:
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=3)

[('New York Jets', 100), ('New York Giants', 79), ('Atlanta Falcons', 29)]

- First parameter is the target word
- Second parameter is a list of values

In [10]:
process.extractOne("cowboys", choices)

('Dallas Cowboys', 90)

In [11]:
import pandas as pd
df = pd.read_csv('data/countries.csv')
df.head(10)

Unnamed: 0,Name,Code
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS
5,Andorra,AD
6,Angola,AO
7,Anguilla,AI
8,Antarctica,AQ
9,Antigua and Barbuda,AG


In [12]:
countries = df.iloc[:,0].values
countries[:5]

array(['Afghanistan', 'Åland Islands', 'Albania', 'Algeria',
       'American Samoa'], dtype=object)

In [13]:
process.extract('Hong Kong', countries, limit=3)

[('Hong Kong', 100), ('Congo', 57), ('Gabon', 54)]

## `ratio` score algorithm

- A wrapper of SequenceMatcher which uses [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
- Calculate the score based on number of matched character blocks
- Return a measure of the sequences' similarity between 0 and 100
- Formula: `2*(Matched Characters)/(len(String A) + len(String B))`

In [14]:
process.extract('Edward', ['Edwards', 'Edwards2', 'drawdE'], scorer=fuzz.ratio)

[('Edwards', 92), ('Edwards2', 86), ('drawdE', 50)]

In [15]:
fuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

62

Two strings similarity score 62%

In [16]:
fuzz.ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

69

In [17]:
fuzz.ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

74

The naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

## `partial_ratio` compares partial string similarity

In [18]:
fuzz.partial_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

69

In [19]:
fuzz.partial_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

83

In [20]:
fuzz.partial_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

63

## `token_sort_ratio` ignores word order

In [21]:
fuzz.token_sort_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

84

In [22]:
fuzz.token_sort_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

In [23]:
fuzz.token_sort_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

83

## `token_set_ratio` ignores duplicated words

It is similar with token sort ratio, but a little bit more flexible.

In [24]:
fuzz.token_set_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

100

In [25]:
fuzz.token_set_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

In [26]:
fuzz.token_set_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

97

## `WRatio` is the default scorer

- Return a measure of the sequences' similarity between 0 and 100
- Uses different algorithms
- UWRatio (Same as WRatio) for unicode

In [27]:
# Default scorer is Weighed Ratio
for location in ['Hong Kong', 'jepen', 'United tates']:
    result = process.extract(location, countries, limit=2)
    print(result)

[('Hong Kong', 100), ('Congo', 57)]
[('Japan', 60), ('Yemen', 60)]
[('United States', 96), ('Tanzania, United Republic of', 86)]


## `Qratio` does a quick ratio comparison for strings

- Return a measure of the sequences' similarity between 0 and 100
- UQRatio (Same as QRatio) for unicode

In [28]:
# Partial Ratio
process.extract('Hong Kong', countries, scorer=fuzz.QRatio, limit=3)

[('Hong Kong', 100), ('Congo', 57), ('Mongolia', 47)]

## More examples

In [29]:
process.extract("Taiwan", countries)

[('Taiwan, Province of China', 90),
 ('Thailand', 71),
 ('Tajikistan', 62),
 ('Australia', 60),
 ('Azerbaijan', 60)]

In [30]:
choices = ["臺北市", "台北市", "新北市", "桃園市"]
process.extract("北市", choices, limit=2)

[('臺北市', 90), ('台北市', 90)]

In [31]:
items = ['各機關人事費',
 '公務人員退休撫卹金、慰問金及各項補助',
 '各類員工待遇準備',
 '直接提供服務之機關編列部分',
 '一般機關編列部分',
 '公務人員進修及健檢補助等',
 '直接提供服務之機關編列部分',
 '一般機關編列部分',
 '補助特種基金',
 '退休公務人員及遺族三節慰問金等',
 '國家賠償金',
 '第一預備金',
 '第二預備金',
 '災害準備金']

In [32]:
process.extract("退休", items, limit=2)

[('退休公務人員及遺族三節慰問金等', 90), ('公務人員退休撫卹金、慰問金及各項補助', 60)]

In [33]:
process.extract("公務員退休金", items, limit=2)

[('公務人員退休撫卹金、慰問金及各項補助', 75), ('公務人員進修及健檢補助等', 45)]

## More information

https://github.com/seatgeek/fuzzywuzzyhttps://github.com/seatgeek/fuzzywuzzy

https://en.wikipedia.org/wiki/Levenshtein_distancehttps://en.wikipedia.org/wiki/Levenshtein_distance

https://towardsdatascience.com/how-fuzzy-matching-improve-your-nlp-model-bc617385ad6b

https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49

https://www.geeksforgeeks.org/fuzzywuzzy-python-library/