### Screening Model
#### Ronel Khan

First step is to import all the required libraries.  
We will need FuzzyWuzzy to essentially run our fuzzylogic match algorithm(s).  
Pandas is required in order to read the input and reference data as dataframes

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from IPython.display import Markdown, display
import pandas as pd

We will then need to load our list of Client names. Although the file is stored as a .txt file, we can use Pandas to read the file as a dataframe. Our Customer names will serve as our input data.  

In [2]:
clients = pd.read_csv(r'C:\\Users\\ronel\\OneDrive\\Documents\\Python\\Client_List.csv')
display(clients)

Unnamed: 0,Client Name
0,JOHN BROWN
1,MIKA ARSHAD
2,ENRIQUE TORRES GOMEZ
3,HUD SALEM
4,AMBER DAVIES
...,...
5768,GAYNOR SMITH
5769,PEPILLO THOMPSON
5770,JARIYAH LONE
5771,WASFIYAH MAHDI


We then load our dataset containing names of individuals and entities on the United States Office of Foreign Assets Control's (OFAC) list of Specially Designated Nationals List, or more commonly known as the SDN List. The SDN List will serve as our reference data.  
  
The steps taken to derive and cleanse the SDN dataset so that it contains all primary/legal names and aliases can be found on 'Data Cleanse.ipynb'

In [3]:
sdn = pd.read_csv('SDN.csv')
sdn['SDN_Names'] = sdn['SDN_Names'].str.upper()
display(sdn)

Unnamed: 0,SDN_Names
0,AEROCARIBBEAN AIRLINES; AERO-CARIBBEAN;
1,"ANGLO-CARIBBEAN CO., LTD.; AVIA IMPORT;"
2,BANCO NACIONAL DE CUBA; NATIONAL BANK OF CUBA;...
3,BOUTIQUE LA MAISON
4,CASA DE CUBA
...,...
8435,"IBRAHIM, NASREEN HUSSEIN; IBRAHIM, NSRIN; IBRA..."
8436,"IBRAHIM, RANA HUSSEIN; IBRAHIM, RANA HUSSIN;"
8437,"JEDID, MILAD; JADID, MILAD; JADEED, MILAD; JED..."
8438,SYRIAN MINISTRY OF TOURISM


In order to effectively run our FuzzyWuzzy algorithm, we need the reference data as a list, and therefore are required to convert our dataframe to a list

In [4]:
sdnlist = sdn.values.tolist()
sdnlist

[['AEROCARIBBEAN AIRLINES; AERO-CARIBBEAN; '],
 ['ANGLO-CARIBBEAN CO., LTD.; AVIA IMPORT; '],
 ['BANCO NACIONAL DE CUBA; NATIONAL BANK OF CUBA;  BNC.'],
 ['BOUTIQUE LA MAISON '],
 ['CASA DE CUBA '],
 ['CECOEX, S.A. '],
 ['CIMEX '],
 ['CIMEX IBERICA '],
 ['CIMEX, S.A. '],
 ['COMERCIAL IBEROAMERICANA, S.A.; COIBA; '],
 ['COMERCIAL CIMEX, S.A. '],
 ['COMERCIAL DE RODAJES Y MAQUINARIA, S.A.; CRYMSA; '],
 ['COMERCIALIZACION DE PRODUCTOS VARIOS; COPROVA; COPROVA SARL; '],
 ['COMPANIA DE IMPORTACION Y EXPORTACION IBERIA; CIMEX; '],
 ['CORPORACION CIMEX, S.A. '],
 ['COTEI '],
 ['CRUZ, JUAN M. DE LA '],
 ['CRYMSA - ARGENTINA, S.A. '],
 ['CUBACANCUN CIGARS AND GIFT SHOPS '],
 ['CUBAEXPORT '],
 ['CUBAFRUTAS '],
 ['CUBAN CIGARS TRADE '],
 ['CUBANATUR '],
 ['CUBATABACO '],
 ['CUMEXINT, S.A. '],
 ['DELVEST HOLDING, S.A.; DELVEST HOLDING COMPANY; '],
 ['EDICIONES CUBANAS '],
 ['EDYJU, S.A. '],
 ['EMPRESA CUBANA DE AVIACION; CUBANA AIRLINES; '],
 ['EMPRESA CUBANA DE PESCADOS Y MARISCOS; CARIBBEAN EXPO

However, when converting our dataframe, we find that it becomes a list of lists; and we are now required to convert the list of lists into a single, flat list

In [5]:
sdnflat=[]
for i in sdnlist:
  for j in i:
    sdnflat.append(j)
print (sdnflat)

['AEROCARIBBEAN AIRLINES; AERO-CARIBBEAN; ', 'ANGLO-CARIBBEAN CO., LTD.; AVIA IMPORT; ', 'BANCO NACIONAL DE CUBA; NATIONAL BANK OF CUBA;  BNC.', 'BOUTIQUE LA MAISON ', 'CASA DE CUBA ', 'CECOEX, S.A. ', 'CIMEX ', 'CIMEX IBERICA ', 'CIMEX, S.A. ', 'COMERCIAL IBEROAMERICANA, S.A.; COIBA; ', 'COMERCIAL CIMEX, S.A. ', 'COMERCIAL DE RODAJES Y MAQUINARIA, S.A.; CRYMSA; ', 'COMERCIALIZACION DE PRODUCTOS VARIOS; COPROVA; COPROVA SARL; ', 'COMPANIA DE IMPORTACION Y EXPORTACION IBERIA; CIMEX; ', 'CORPORACION CIMEX, S.A. ', 'COTEI ', 'CRUZ, JUAN M. DE LA ', 'CRYMSA - ARGENTINA, S.A. ', 'CUBACANCUN CIGARS AND GIFT SHOPS ', 'CUBAEXPORT ', 'CUBAFRUTAS ', 'CUBAN CIGARS TRADE ', 'CUBANATUR ', 'CUBATABACO ', 'CUMEXINT, S.A. ', 'DELVEST HOLDING, S.A.; DELVEST HOLDING COMPANY; ', 'EDICIONES CUBANAS ', 'EDYJU, S.A. ', 'EMPRESA CUBANA DE AVIACION; CUBANA AIRLINES; ', 'EMPRESA CUBANA DE PESCADOS Y MARISCOS; CARIBBEAN EXPORT ENTERPRISE;  CARIBEX.', 'EMPRESA DE TURISMO NACIONAL Y INTERNACIONAL; CUBATUR; ', 'ET

We define a function, using FuzzyWuzzy, which aims to return us two values - the closest hit, and the fuzzymatch score. We also define a condition so that we are only informed of potential matches scoring over 85 - our minimum threshold.
  
We aim to run this algorithm across the four various fuzzy match types FuzzyWuzzy offers us:  

- Simple Ratio | Comparing the similairity between two strings
- Partial Ratio | Comparing the similarity between two strings whilst also taking partial/substrings into consideration
- Token Sort Ratio | Comparing the similarity between two strings, partially disregarding the order of characters within the string
- Token Set Ratio | Comparing the similarity between two stings in a way similar to Token Sort; however, instead of being strings being sorted alphabetically and compared, the tokens are split into two groups: the ‘intersection’ and the ‘remainder’, and are then compared

In [6]:
def fuzz_m(col, sdn, score_t):
    potential_hit, score = process.extractOne(col, sdn, scorer=score_t)
    if score<85:
        return "", ""
    else:
        return potential_hit, score

clients['Ratio Match'],clients['Ratio Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.ratio))
clients['Partial Ratio Match'], clients['Partial Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.partial_ratio))
clients['Token Sort Match'], clients['Token Sort Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.token_sort_ratio))
clients['Token Set Match'], clients['Token Set Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.token_set_ratio))



Our output should display the input data (Customer Name) in the leftmost column, and then the best match (taking the specified threshold of fuzzymatch score of 85 and above into consideration) and its respective fuzzymatch score for each fuzzymatch type in the subsequent columns

In [7]:
display(clients)

Unnamed: 0,Client Name,Ratio Match,Ratio Score,Partial Ratio Match,Partial Score,Token Sort Match,Token Sort Score,Token Set Match,Token Set Score
0,JOHN BROWN,,,,,,,,
1,MIKA ARSHAD,,,PARSHAD,86,,,,
2,ENRIQUE TORRES GOMEZ,,,,,,,,
3,HUD SALEM,,,,,,,,
4,AMBER DAVIES,,,,,,,,
...,...,...,...,...,...,...,...,...,...
5768,GAYNOR SMITH,,,,,,,,
5769,PEPILLO THOMPSON,,,,,,,,
5770,JARIYAH LONE,,,,,,,,
5771,WASFIYAH MAHDI,,,,,,,,


We can then use the export function to save our new dataframe, which includes the results of our FuzzyWuzzy matches, to a .csv file for our reference

In [9]:
clients.to_csv('C:\\Users\\ronel\\OneDrive\\Documents\\Python\Fuzzy Logic 85.csv')

We run the model again, but this time our similarity score threshold is set to 90 in order to limit the scope for false positives (although we can also just disregard all scores below 90 from our previous results to acheive the same).

In [11]:
def fuzz_m(col, sdn, score_t):
    potential_hit, score = process.extractOne(col, sdn, scorer=score_t)
    if score<90:
        return "", ""
    else:
        return potential_hit, score

clients['Ratio Match'],clients['Ratio Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.ratio))
clients['Partial Ratio Match'], clients['Partial Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.partial_ratio))
clients['Token Sort Match'], clients['Token Sort Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.token_sort_ratio))
clients['Token Set Match'], clients['Token Set Score'] = zip(*clients['Client Name'].apply(fuzz_m, sdn=sdnflat, score_t=fuzz.token_set_ratio))




In [12]:
display(clients)

Unnamed: 0,Client Name,Ratio Match,Ratio Score,Partial Ratio Match,Partial Score,Token Sort Match,Token Sort Score,Token Set Match,Token Set Score
0,JOHN BROWN,,,,,,,,
1,MIKA ARSHAD,,,,,,,,
2,ENRIQUE TORRES GOMEZ,,,,,,,,
3,HUD SALEM,,,,,,,,
4,AMBER DAVIES,,,,,,,,
...,...,...,...,...,...,...,...,...,...
5768,GAYNOR SMITH,,,,,,,,
5769,PEPILLO THOMPSON,,,,,,,,
5770,JARIYAH LONE,,,,,,,,
5771,WASFIYAH MAHDI,,,,,,,,


In [13]:
clients.to_csv('C:\\Users\\ronel\\OneDrive\\Documents\\Python\Fuzzy Logic 90.csv')

#### Limitations of FuzzyWuzzy

A significant limitation of the FuzzyWuzzy package is that it doesn’t account for phonetic similarities – something which undoubtedly assist in identifying individuals whose names have been transliterated from non-Latin scripts such as Arabic and Chinese  
  
One particular example of could be the name ‘عُثْمان’; which can be transliterated to Othman, Uthman, Osman or Usmaan amongst a few other options. If coupled with another name ‘مايكل’; which could be transliterated to Mika’eel just as easily as it could be Mikhail, we could find ourselves with false negatives when trying to match transliterated combinations against one another  
  
Despite being the same name, FuzzyWuzzy fails to identify a potential match for a score benchmark of below 65

In [14]:
A = fuzz.ratio("OTHMAN MIKHAIL","USMAAN MIKA'EEL")
B = fuzz.partial_ratio("OTHMAN MIKHAIL","USMAAN MIKA'EEL")
C = fuzz.token_sort_ratio("OTHMAN MIKHAIL","USMAAN MIKA'EEL")
D = fuzz.token_set_ratio("OTHMAN MIKHAIL","USMAAN MIKA'EEL")

print(A,B,C,D)

62 64 55 55


In [15]:
AA = fuzz.ratio("OTHMAN MICHAEL", "USMAAN MIKA'EEL")
BB = fuzz.partial_ratio("OTHMAN MICHAEL","USMAAN MIKA'EEL")
CC = fuzz.token_sort_ratio("OTHMAN MICHAEL","USMAAN MIKA'EEL")
DD = fuzz.token_set_ratio("OTHMAN MICHAEL","USMAAN MIKA'EEL")

print(AA,BB,CC,DD)

62 64 48 48


__With all screening models, it is important to understand given the importance of financial crime prevention, that a decrease in false positives should not come at an expense of an increase in false negatives.__