# Fuzzy matching sur Expedia

On essaye de rappprocher différentes données

In [2]:
import pandas as pd

df = pd.read_csv('./data/room_type.txt')

In [3]:
df.head(10)

Unnamed: 0,Expedia,Booking.com
0,"Deluxe Room, 1 King Bed",Deluxe King Room
1,"Standard Room, 1 King Bed, Accessible",Standard King Roll-in Shower Accessible
2,"Grand Corner King Room, 1 King Bed",Grand Corner King Room
3,"Suite, 1 King Bed (Parlor)",King Parlor Suite
4,"High-Floor Premium Room, 1 King Bed",High-Floor Premium King Room
5,"Traditional Double Room, 2 Double Beds",Double Room with Two Double Beds
6,"Room, 1 King Bed, Accessible",King Room - Disability Access
7,"Deluxe Room, 1 King Bed",Deluxe King Room
8,Deluxe Room,Deluxe Room (Non Refundable)
9,"Room, 2 Double Beds (19th to 25th Floors)",Two Double Beds - Location Room (19th to 25th ...


### FuzzyWuzzy

Si on compare des chaînes de caractères, on aura :

In [4]:
from fuzzywuzzy import fuzz
print(fuzz.ratio(df.iloc[7,0], df.iloc[7,1]),
fuzz.ratio(df.loc[7,"Expedia"],df.loc[7,"Booking.com"]))

62 62


In [5]:
fuzz.ratio(df.iloc[5,0], df.iloc[5,1])

69

Si on fixe un ratio > 70

In [6]:
def get_ratio(row):
    name = row['Expedia']
    name1 = row['Booking.com']
    return fuzz.token_set_ratio(name, name1)

df[df.apply(get_ratio, axis=1) > 90]

Unnamed: 0,Expedia,Booking.com
0,"Deluxe Room, 1 King Bed",Deluxe King Room
2,"Grand Corner King Room, 1 King Bed",Grand Corner King Room
3,"Suite, 1 King Bed (Parlor)",King Parlor Suite
4,"High-Floor Premium Room, 1 King Bed",High-Floor Premium King Room
7,"Deluxe Room, 1 King Bed",Deluxe King Room
8,Deluxe Room,Deluxe Room (Non Refundable)
9,"Room, 2 Double Beds (19th to 25th Floors)",Two Double Beds - Location Room (19th to 25th ...
10,"Room, 1 King Bed (19th to 25 Floors)",King Bed - Location Room (19th to 25th Floors)
11,Deluxe Room,Deluxe Double Room
12,"Junior Suite, 1 King Bed with Sofa Bed",Junior Suite


On peut enlever les espaces à la fin et au début ainsi que les virgules

In [7]:
df["Expedia"]=df["Expedia"].str.replace(",","").str.strip()

In [8]:
df["distance"]=df.apply(lambda x : fuzz.ratio(x['Expedia'],x['Booking.com']), axis=1)

In [9]:
df.sort_values("distance", ascending=False)

Unnamed: 0,Expedia,Booking.com,distance
47,Deluxe Suite,Deluxe Suite,100
95,Regency Club Ocean View,Regency Club Ocean View,100
94,Regency Club Mountain View,Regency Club Mountain View,100
73,Standard Room Ocean View,Standard Room With Ocean View,91
26,Deluxe Room 2 Queen Beds,Deluxe Room - Two Queen Beds,88
25,Deluxe Room 1 King Bed,Deluxe Room - One King Bed,88
45,Business Double Room 2 Double Beds,Business Double Room With Two Double Beds,88
102,Junior Suite 1 King Bed Accessible (Roll-in Sh...,Junior Suite - Accessible Roll-in Shower,86
78,Room Ocean View,Room With Ocean View,86
72,Standard Room Lagoon View,Standard Room Dolphin Lagoon View,86


PLus de 90 % des observations sont bien concernées

In [10]:
df[df.apply(lambda row: fuzz.token_set_ratio(row['Expedia'], row['Booking.com']), axis=1) > 60]

Unnamed: 0,Expedia,Booking.com,distance
0,Deluxe Room 1 King Bed,Deluxe King Room,63
1,Standard Room 1 King Bed Accessible,Standard King Roll-in Shower Accessible,70
2,Grand Corner King Room 1 King Bed,Grand Corner King Room,80
3,Suite 1 King Bed (Parlor),King Parlor Suite,52
4,High-Floor Premium Room 1 King Bed,High-Floor Premium King Room,77
5,Traditional Double Room 2 Double Beds,Double Room with Two Double Beds,70
6,Room 1 King Bed Accessible,King Room - Disability Access,51
7,Deluxe Room 1 King Bed,Deluxe King Room,63
8,Deluxe Room,Deluxe Room (Non Refundable),56
9,Room 2 Double Beds (19th to 25th Floors),Two Double Beds - Location Room (19th to 25th ...,75


On utilise ici difflib pour comparer chaque description de Expedia avec les autres

In [12]:
import difflib

In [13]:
df['test'] = df['Expedia'].apply(
    lambda x: difflib.get_close_matches(x, 
                                        df['Booking.com'],
                                        n=1,
                                        cutoff=0)[0])

In [14]:
df.head()

Unnamed: 0,Expedia,Booking.com,distance,test
0,Deluxe Room 1 King Bed,Deluxe King Room,63,Deluxe Room - One King Bed
1,Standard Room 1 King Bed Accessible,Standard King Roll-in Shower Accessible,70,Standard King Roll-in Shower Accessible
2,Grand Corner King Room 1 King Bed,Grand Corner King Room,80,Grand Corner King Room
3,Suite 1 King Bed (Parlor),King Parlor Suite,52,King Parlor Suite
4,High-Floor Premium Room 1 King Bed,High-Floor Premium King Room,77,High-Floor Premium King Room


In [18]:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

In [19]:
df['test_fuzz'] = df['Expedia'].apply(lambda x: process.extractOne(x, df['Booking.com'], 
                                                                   scorer= fuzz.ratio,
                                                                   score_cutoff=0)[0])

In [20]:
df.head()

Unnamed: 0,Expedia,Booking.com,distance,test,test_fuzz
0,Deluxe Room 1 King Bed,Deluxe King Room,63,Deluxe Room - One King Bed,Deluxe Room - One King Bed
1,Standard Room 1 King Bed Accessible,Standard King Roll-in Shower Accessible,70,Standard King Roll-in Shower Accessible,Standard King Roll-in Shower Accessible
2,Grand Corner King Room 1 King Bed,Grand Corner King Room,80,Grand Corner King Room,Grand Corner King Room
3,Suite 1 King Bed (Parlor),King Parlor Suite,52,King Parlor Suite,King Parlor Suite
4,High-Floor Premium Room 1 King Bed,High-Floor Premium King Room,77,High-Floor Premium King Room,High-Floor Premium King Room


In [21]:
# on vérifie les différences
df[df["test"]!=df["test_fuzz"]]

Unnamed: 0,Expedia,Booking.com,distance,test,test_fuzz
14,Signature Room 1 King Bed,Signature One King,70,Signature King,Signature One King
23,Club Room City View (Club Lounge Access for 2 ...,Club Level King Or Queen Room with City View,39,Double Room with Two Double Beds,Business King Room - Exclusive access to Gold ...
24,Club Room Lake View (Club Lounge Access for 2 ...,Club Level King Or Queen Room with Water View,39,Queen Room - Disability Access,Business King Room - Exclusive access to Gold ...
31,Suite 1 King Bed Non Smoking,King Suite,37,Business King Room,Signature One King
32,Room 1 Queen Bed Non Smoking (Fairmont Room),Queen Room,37,Queen Room With Two Queen Beds and Waikiki View,Queen Room With Two Queen Beds and Garden View
37,Signature Room 2 Double Beds Non Smoking,Signature Double,57,Signature Double,Signature One King
38,Signature Room 1 King Bed Non Smoking,Signature King,55,Signature King,Signature One King
42,Double Room,Double Room with Two Double Beds,51,Luxury Double Room,Deluxe Double Room
48,Room 1 Queen Bed Accessible Bathtub,Queen Room - Disability Access,40,Deluxe Room - Two Queen Beds,Rainbow Tower Ocean View With King Bed - Mobil...
57,Deluxe Suite 1 Bedroom,Deluxe One - Bedroom Suite,71,Deluxe One - Bedroom Suite,Deluxe Suite
