# Fuzzy Matching Code

Below is the code to generate a fuzzy matching result from two datasets to help human intervention when evaluating two large datasets to help in discovering both matches and discrepancies.

For the reconciliation of the concepts between WKC/UDF and CCDE use case shown here, this code will make use of the fuzzy matching capabilities in Python, and will present the best possible matches to evaluate between the two datasets to arrive at a final record for both groups.

First we will import the necessary libraries for this work

In [1]:
import re
import pandas as pd
import requests
import psycopg2
import pymssql
import os
from fuzzywuzzy import fuzz, process
import xlsxwriter
import numpy as np
from tqdm import tqdm

pd.set_option('display.max_rows', 150000)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 10000)

Next, we will load both the extract from WKC and the Stewardship-CCDE file provided by Molly found here: https://mskcc.sharepoint.com/:x:/t/UnifiedDataFabric85/EQLyh5clRqZPpEKgOTHnY_sBE6ipmXf7gHmGNnZs7FG-fg?e=DfwabG. We save both extracts locally to perform this work.

In other instances, you can pass code where you query directly a database to retrieve a dataset, such as the use case in Regimens Fuzzy Match Python script.

In [2]:
df = pd.read_excel('WKC-Extract-PROD.xlsx')
df2 = pd.read_excel('./Stewardship/Stewardship-CCDE-full.xlsx')

Finally we create the procedure for the fuzzy matching based on three categories:
- Fuzzy match between **CCDE Concept Name** and **WKC Working Definition**
- Fuzzy match between **CCDE Concept Name** and **WKC Concept Name**
- Fuzzy match between **CCDE Concept Definition** and **WKC Working Definition**

Using the matches' scores, we will get the average scores for all matches that occurred, and then collect the highest possible score for each concept and present to the end user.
Note that we can generate as many fuzzy match scores and averages as we need to.

In [4]:
sc = []
conc = []
wd = []
ccde = []
d = []
ad = []
avga = []
avgb = []
avgc = []
# df3 = pd.DataFrame(columns=['Domain','WKC Concept Name','WKC Description','CCDE Name', 'CCDE Description','CCDE Alt Name', 'AVG Score - Check 1','AVG Score - Check 2','AVG Score - Check 3'])
for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc="Loading..."):
  for idx2, row2 in df2.iterrows():
    alt_list = str(row2['alternativeLabel']).replace('[','').replace(']','').replace("'","").split(',')
    for item in alt_list:
      sc.append(row2['domains'])
      conc.append(row['Name'])
      wd.append(row['Description'])
      ccde.append(row2['primaryLabel'])
      d.append(row2['description'])
      ad.append(item.strip())

      ratio = fuzz.ratio(str(row['Name']).lower(), str(row2['description']).lower())
      pratio = fuzz.partial_ratio(str(row['Name']).lower(), str(row2['description']).lower())
      tokensr = fuzz.token_sort_ratio(str(row['Name']).lower(), str(row2['description']).lower())
      tokenstr = fuzz.token_set_ratio(str(row['Name']).lower(), str(row2['description']).lower())

      avg1 = (ratio + pratio + tokensr + tokenstr) / 4
      avga.append(avg1)

      ### check number 2 ####
      ratio2 = fuzz.ratio(str(row['Name']).lower(), item.strip().lower())
      pratio2 = fuzz.partial_ratio(str(row['Name']).lower(), item.strip().lower())
      tokensr2 = fuzz.token_sort_ratio(str(row['Name']).lower(), item.strip().lower())
      tokenstr2 = fuzz.token_set_ratio(str(row['Name']).lower(), item.strip().lower())
      avg2 = (ratio2 + pratio2 + tokensr2 + tokenstr2) / 4
      avgb.append(avg2)

      ### check number 3 ####
      ratio3 = fuzz.ratio(str(row['Description']).lower(), str(row2['description']).lower())
      pratio3 = fuzz.partial_ratio(str(row['Description']).lower(), str(row2['description']).lower())
      tokensr3 = fuzz.token_sort_ratio(str(row['Description']).lower(), str(row2['description']).lower())
      tokenstr3 = fuzz.token_set_ratio(str(row['Description']).lower(), str(row2['description']).lower())
      avg3 = (ratio3 + pratio3 + tokensr3 + tokenstr3) / 4
      avgc.append(avg3)
      # print(f"adding to DF information for {item.strip()}, avg score 1: {avg1}, avg score 2: {avg2}, avg score 3: {avg3}")
      # try:
      # df3.loc[len(df3)] = [row2['domains'],row['Name'],row['Description'],row2['primaryLabel'],row2['description'],item.strip(),avg1, avg2, avg3]

      # except Exception:
      #   print(df3.shape)
      #   print(f"Error adding for {item.strip()}, avg score 1: {avg1}, avg score 2: {avg2}, avg score 3: {avg3} ")

Loading...: 100%|██████████| 2900/2900 [52:18<00:00,  1.08s/it] 


In [5]:
print(len(avgb))

5315700


Then, set up the new dataframe to store the Results for the Output Excel File

In [6]:
df3 = pd.DataFrame()
df3['Domain'] = sc
df3['WKC Concept Name'] = conc
df3['WKC Definition'] = wd
df3['CCDE Name - Molly'] = ccde
df3['CCDE Alt Name - Molly'] = ad
df3['CCDE Definition - Molly'] = d
df3['WKC Concept - CCDE Definition - AVG'] = avga
df3['WKC Concept - CCDE Name - AVG'] = avgb
df3['WKC Definition - CCDE Definition - AVG'] = avgc


We average the averages and then take the highest average for each of the concepts in our new dataframe

In [20]:
avgf = []
for idx3, row3 in df3.iterrows():
  avg4 = (row3['WKC Concept - CCDE Definition - AVG'] + row3['WKC Concept - CCDE Name - AVG'] + row3['WKC Definition - CCDE Definition - AVG'])/3
  avgf.append(avg4)
df3['AVG'] = avgf
dfp = df3.groupby(['CCDE Name - Molly'])['AVG'].transform(max) == df3['AVG']
df3 = df3[dfp]
df3 = df3.drop_duplicates().reset_index(drop=True).drop('level_0',axis=1)

We save the results into an Excel file for further intervention.

In [21]:
print(df3)

       index                                             Domain                                   WKC Concept Name                                     WKC Definition                                  CCDE Name - Molly  ...                            CCDE Definition - Molly WKC Concept - CCDE Definition - AVG  WKC Concept - CCDE Name - AVG  WKC Definition - CCDE Definition - AVG        AVG
0       1314                                          Pathology                       MMR Deficient Report Details  Details of report determined to be MMR Deficie...                             MMR deficiency details  ...                Mismatch repair deficiency details.                               70.25                          81.50                                   50.25  67.333333
1      27506                                      Adverse event            Letter Of Agreement 9 End Date and Time  Date time 9 when the letter of agreement has e...                          End date of adverse event  

In [22]:
writer = pd.ExcelWriter('ccde-test.xlsx', engine='xlsxwriter')
df3.to_excel(writer, sheet_name='WKC-Vs-CCDE', index=False)
writer.save()