# FuzzyWuzzy 字串模糊比對
在電腦科學中，字串模糊匹配（fuzzy string matching）是一種近似地（而不是精確地）查詢與模式匹配的字串的技術。<br>
換句話說，字串模糊匹配是一種搜尋，即使使用者拼錯單詞或只輸入部分單詞進行搜尋，也能夠找到匹配項。因此，它也被稱為字串近似匹配。

In [62]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


## 使用FuzzyWuzzy合併房型

In [5]:
import pandas as pd
 
df = pd.read_excel('C:/Users/11004076/Documents/Python Scripts/2_DataAnalysis/room_type.xlsx')
df.head(10)

Unnamed: 0,Expedia,Booking.com
0,"Deluxe Room, 1 King Bed",Deluxe King Room
1,"Standard Room, 1 King Bed, Accessible",Standard King Roll-in Shower Accessible
2,"Grand Corner King Room, 1 King Bed",Grand Corner King Room
3,"Suite, 1 King Bed (Parlor)",King Parlor Suite
4,"High-Floor Premium Room, 1 King Bed",High-Floor Premium King Room
5,"Traditional Double Room, 2 Double Beds",Double Room with Two Double Beds
6,"Room, 1 King Bed, Accessible",King Room - Disability Access
7,"Deluxe Room, 1 King Bed",Deluxe King Room
8,Deluxe Room,Deluxe Room (Non Refundable)
9,"Room, 2 Double Beds (19th to 25th Floors)",Two Double Beds - Location Room (19th to 25th ...


In [None]:
有幾種方法可以比較Fuzzywuzzy中的兩個字串，讓我們一個一個地進行嘗試。

In [None]:
from fuzzywuzzy import fuzz

### ratio ，按順序比較整個字串的相似度

In [70]:
fuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')
#返回結果時62，它告訴我們“Deluxe Room, 1 King Bed”和“Deluxe King Room”的相似度約62%。

62

In [65]:
fuzz.ratio('Traditional Double Room, 2 Double Beds','Double Room with Two Double Beds')

69

In [66]:
fuzz.ratio('Room, 2 Double Beds (19th to 25th Floors)','Two Double Beds - Location Room (19th to 25th Floors)')

74

### partial_ratio，比較部分字串的相似度
我們仍在使用相同的資料對：

In [67]:
fuzz.partial_ratio('Deluxe Room, 1 King Bed','Deluxe King Room')

69

In [68]:
fuzz.partial_ratio('Traditional Double Room, 2 Double Beds','Double Room with Two Double Beds')

83

In [69]:
fuzz.partial_ratio('Room, 2 Double Beds (19th to 25th Floors)','Two Double Beds - Location Room (19th to 25th Floors)')

63

In [None]:
返回依次69、83、63。對於我的資料集來說，比較部分字串並不能帶來更好的整體效果。讓我們嘗試下一個。

### token_sort_ratio，忽略單詞順序

In [71]:
fuzz.token_sort_ratio('Deluxe Room, 1 King Bed','Deluxe King Room')

84

In [72]:
fuzz.token_sort_ratio('Traditional Double Room, 2 Double Beds','Double Room with Two Double Beds')

78

In [73]:
fuzz.token_sort_ratio('Room, 2 Double Beds (19th to 25th Floors)','Two Double Beds - Location Room (19th to 25th Floors)')

83

In [None]:
返回依次84、78、83。這是迄今為止最好的。

### token_set_ratio，去重子集匹配
它與token_sort_ratio類似，但更加靈活。

In [74]:
fuzz.token_set_ratio('Deluxe Room, 1 King Bed','Deluxe King Room')

100

In [75]:
fuzz.token_set_ratio('Traditional Double Room, 2 Double Beds','Double Room with Two Double Beds')

78

In [76]:
fuzz.token_set_ratio('Room, 2 Double Beds (19th to 25th Floors)','Two Double Beds - Location Room (19th to 25th Floors)')

97

In [None]:
返回依次100、78、97。看來token_set_ratio最適合我的資料。

### 應用到整個資料集

In [None]:
根據這一發現，將token_set_ratio應用到整個資料集。

In [77]:
def get_ratio(row):
    name1 = row['Expedia']
    name2 = row['Booking.com']
    return fuzz.token_set_ratio(name1, name2)
 
rated = df.apply(get_ratio, axis=1)
rated.head(10)
 
greater_than_70_percent = df[rated > 70]
greater_than_70_percent.count()
len(greater_than_70_percent) / len(df)

0.9029126213592233

### 多個處理：使用Process

In [None]:
當設定相似度> 70時，超過90％的房間對超過這個匹配分數。還很不錯！上面只是做了2個文字間的相似度比較，
如果存在多個如何處理？可以使用庫中提供的 Process類：用來返回模糊匹配的字串和相似度。

In [80]:
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2)

[('New York Jets', 100), ('New York Giants', 79)]

In [81]:
process.extractOne("cowboys", choices)

('Dallas Cowboys', 90)

In [92]:
which = ["DX 110", "DX-110", "Mouse Genius DX-120", "Mouse Genius USB NetScroll DX-110", "Mouse Optic DX-110"]
process.extract("DX-110", which, limit=4)

[('DX 110', 100),
 ('DX-110', 100),
 ('Mouse Genius USB NetScroll DX-110', 90),
 ('Mouse Optic DX-110', 90)]

In [93]:
whichs = ["HS-930BT-LI", "HS-940BT-LI", "HS-935BT-LI"]
process.extract("HS-930BT", whichs, limit=4)

[('HS-930BT-LI', 95), ('HS-940BT-LI', 74), ('HS-935BT-LI', 74)]

In [94]:
whichs = ["SW-G2.1 1250", "SW-G2.1 1250 II", "SW-G2.1 1250 IILI"]
process.extract("SW-G2.1 1250", whichs, limit=4)

[('SW-G2.1 1250', 100), ('SW-G2.1 1250 II', 95), ('SW-G2.1 1250 IILI', 95)]

In [None]:
你可以傳入附加引數到 extractOne 方法來設定使用特定的匹配模式。一個典型的用法是來匹配檔案路徑:

In [None]:
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
        ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
    >>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
        ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

### FuzzyWuzzy在中文場景下的使用
FuzzyWuzzy支援對中文進行比較：

In [79]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
 
print(fuzz.ratio("資料探勘", "資料探勘工程師"))
 
title_list = ["資料分析師", "資料探勘工程師", "大資料開發工程師", "機器學習工程師",
              "演算法工程師", "資料庫管理", "商業分析師", "資料科學家", "首席資料官",
              "資料產品經理", "資料運營", "大資料架構師"]
 
print(process.extractOne("資料探勘", title_list))

73
('資料探勘工程師', 90)


In [None]:
仔細檢視程式碼，還是存在的問題：

FuzzWuzzy並不會針對中文進行分詞
也沒有對中文的一些停用詞進行過濾
改進方案，處理前進行中文處理：

繁簡轉換
中文分詞
去除停用詞

In [60]:
!pip install difflib

Collecting difflib


  Could not find a version that satisfies the requirement difflib (from versions: )
No matching distribution found for difflib


In [61]:
import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print('\n'.join(diff))


ModuleNotFoundError: No module named 'difflib_data'