Before onboarding a new customer, we need to check whether customer already exists within the system.
If customer details exactly matches in all attributes, it is easy to find out whether existing customer or not.
Most of the times, customer attributes may not be exactly same even though it is same customer.
Implement a google like search i.e., fuzzy logic search utility for CRM,
to search existing customer based on multiple criteria Name, Address etc.

a. Advanced search algorithm (Phonetic algorithms etc.)

b. Integration - real time data search from enterprise application to display results.

c. Performance of algorithm

# My References include GeeksforGeeks.com & Medium.com

#### 1. Import Required Python Libraries including Phonetic & FuzzyWuzzy

In [None]:
pip install soundex



In [None]:
pip install fuzzywuzzy



In [None]:
import pandas as pd
from fuzzywuzzy import fuzz, process
print('Libraries Imported!')

Libraries Imported!




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### 2. Read Dataset & Prepare List of Customer's Full Names

In [None]:
### ###############################################################################
# 1 a. Read Customer Database for information on Existing Customers
cust_info_raw = pd.read_csv(r'/content/drive/MyDrive/CRM_Customer_Search_Algorithm/customer_info.csv')
# ['CustID', 'First Name', 'Middle Name', 'Last Name', 'Email', 'Phone', 'Address Line 1', 'Address Line 2', 'City', 'State', 'PinCode']

# cust_info = cust_info_raw[['first_name', 'last_name', 'email', 'phone', 'address', 'gender']]
cust_info = cust_info_raw[['CustID', 'first_name', 'last_name', 'address']]

print('**************************************************************************')
print(cust_info.shape, cust_info.columns)

### ###############################################################################
# 1 b. Derive Customer's Full Name & Prepare List of Customer's Full Name
cust_info['full_name']  = cust_info.loc[:, 'first_name'] + ' ' + cust_info.loc[:, 'last_name']
list_fullName = cust_info['full_name'].to_list()
print('*************************************************************************')
print('list_fullName: ', len(list_fullName), '\n', list_fullName)

### ###############################################################################
# 1 c. Derive Customer's Full Address & Prepare List of Customer's Full Address
list_fullAddr = cust_info['address'].to_list()
print('*************************************************************************')
print('list_fullAddr: ', len(list_fullAddr), '\n', list_fullAddr)

### ###############################################################################
# 1 d. Dimensions of Customer's Dataset
print('**************************************************************************')
print(cust_info.shape, cust_info.columns)
print('**************************************************************************')
cust_info.head(6)

**************************************************************************
(1000, 4) Index(['CustID', 'first_name', 'last_name', 'address'], dtype='object')
*************************************************************************
list_fullName:  1000 
 ['Joseph Rice', 'Gary Moore', 'John Walker', 'Eric Carter', 'William Jackson', 'Nicole Jones', 'David Davis', 'Jason Montgomery', 'Kent Weaver', 'Darrell Dillon', 'Jacqueline Wang', 'Jodi Gonzalez', 'Omar Martin', 'Nicole Lara', 'Jason Brown', 'David Benson', 'Kimberly Mora', 'Erik Macias', 'James Johnson', 'Derek Peterson', 'Diane Henson', 'Cody Lyons', 'Margaret Hardy', 'Arthur Hayes', 'Raymond Taylor', 'Tina Moore', 'Kylie White', 'Jonathan Manning', 'Angela Bryant', 'Michael Mann', 'Craig Myers', 'Tiffany Carter', 'Erica Owens', 'Patricia Parker', 'Brian Palmer', 'Melissa Smith', 'Connor Adams', 'Michael Williams', 'Ruth Smith', 'Jonathan Middleton', 'Francis Velez', 'Matthew Velez', 'Kim Shaw', 'Gary Jones', 'Nicholas Clayton', 'An

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cust_info['full_name']  = cust_info.loc[:, 'first_name'] + ' ' + cust_info.loc[:, 'last_name']


Unnamed: 0,CustID,first_name,last_name,address,full_name
0,10010,Joseph,Rice,"91773 Miller Shoal\r\nDiaztown, FL 38841",Joseph Rice
1,10011,Gary,Moore,"6450 John Lodge\r\nTerriton, KY 95945",Gary Moore
2,10012,John,Walker,"27265 Murray Island\r\nKevinfort, PA 63231",John Walker
3,10013,Eric,Carter,USNS Knight\r\nFPO AA 76532,Eric Carter
4,10014,William,Jackson,"170 Jackson Loaf\r\nKristenland, AS 48876",William Jackson
5,10015,Nicole,Jones,"14354 Baker Harbor Apt. 017\r\nEricville, HI 1...",Nicole Jones


#### 3. Assign Name & Address Values for the Customer to be searched

In [None]:
### ###############################################################################
# 1 c. Assign Customer's Full Name that needs to be Searched and Matched against Existing Customers
# search_firstname = input('Enter First Name of the Customer you want to Search: ')
# search_fullName = 'Caroline Green'
search_fullName = 'sandhya yenugula'
print('**************************************************************************')
print("Name to Search: '", search_fullName, "'")

search_fullAddress = r'16623 Jordan Estate Suite 664 East Stephanie, CT 22452'
print('**************************************************************************')
print("Address to be searched if Name Match Found: '", search_fullAddress, "'")
print('**************************************************************************')

**************************************************************************
Name to Search: ' sandhya yenugula '
**************************************************************************
Address to be searched if Name Match Found: ' 16623 Jordan Estate Suite 664 East Stephanie, CT 22452 '


#### 4. Method to Obtain 'Soundex Phonetic Hash Code'
Call this Method passing Strings of "Customer Full Name & Address" for both the Customer Name & Address to be Searched and also Name & Address in the Dataset.
Thi will Tokenize the string into Tokens and obtain the hash code to be compared for Matching

In [None]:
### ###############################################################################
# 2. a. Find if the Customer already Exists using "Soundex Phonetic Hash Code"
### Method to Generate Soundex Phonetic Hash Code
def soundex_Phonetic_HashCode(search_string):

  lst_stringHashCode = []

  for token in search_string.split():

    # print('**************************************************************************')
    # print('token === ', token)
    # print('lst_stringHashCode === ', lst_stringHashCode)
    ### Convert Token Characteres to Upper Case for Has Code Comparison
    token = token.upper()

    token_hash_code = ''

    # Retain the First Letter
    token_hash_code += token[0]

    # Create Dictionary for Letter & Soundex Phonetic Hash Code Mapping.
    # 'H', 'W', 'Y' & Vowels are mapped to '.',
    # because they do not impact phonetic distinction of a Word like Consonants

    dict_hashCode = {"BFPV": "1", "CGJKQSXZ": "2",
                    "DT": "3",
                    "L": "4", "MN": "5", "R": "6",
                    "AEIOUHWY": "."}

    # Derive Soundex Phonetic Hash Code for Characters from 2nd till Last
    for char in token[1:]:
      for key in dict_hashCode.keys():
        if char in key:
          hash_code = dict_hashCode[key]
          # print('CHAR === ', char, 'KEY === ', key, 'hash_code === ', hash_code)
          if hash_code != '.' and hash_code != token_hash_code[-1]:
            token_hash_code += hash_code

    # Trim or Pad to keep the Hash_code 7 Digits to find Matching Tokens
    token_hash_code = token_hash_code[:7].ljust(7, '0')
    # print('token_hash_code === ', token_hash_code)

    lst_stringHashCode.append(token_hash_code)
    # print('lst_stringHashCode === ', lst_stringHashCode)

  return lst_stringHashCode

In [None]:
# lst_stringHashCode = soundex_Phonetic_HashCode('i am learning phonetic matching')
lst_searchFullName_HashCode = soundex_Phonetic_HashCode(search_fullName)
print('lst_searchFullName_HashCode === ', lst_searchFullName_HashCode)

lst_searchFullName_HashCode ===  ['S530000', 'Y524000']


In [None]:
list_CustFullName_HashCode = []
for fullName in list_fullName:
  fullName_HashCode = soundex_Phonetic_HashCode(fullName)
  list_CustFullName_HashCode.append(fullName_HashCode)

print('list_CustFullName_HashCode === ', list_CustFullName_HashCode)


list_CustFullName_HashCode ===  [['J210000', 'R200000'], ['G600000', 'M600000'], ['J500000', 'W426000'], ['E620000', 'C636000'], ['W450000', 'J250000'], ['N240000', 'J520000'], ['D130000', 'D120000'], ['J250000', 'M532560'], ['K530000', 'W160000'], ['D640000', 'D450000'], ['J245000', 'W520000'], ['J300000', 'G524200'], ['O560000', 'M635000'], ['N240000', 'L600000'], ['J250000', 'B650000'], ['D130000', 'B525000'], ['K516400', 'M600000'], ['E620000', 'M200000'], ['J520000', 'J525000'], ['D620000', 'P362500'], ['D500000', 'H525000'], ['C300000', 'L520000'], ['M626300', 'H630000'], ['A636000', 'H200000'], ['R530000', 'T460000'], ['T500000', 'M600000'], ['K400000', 'W300000'], ['J535000', 'M520000'], ['A524000', 'B653000'], ['M240000', 'M500000'], ['C620000', 'M620000'], ['T150000', 'C636000'], ['E620000', 'O520000'], ['P362000', 'P626000'], ['B650000', 'P456000'], ['M420000', 'S530000'], ['C560000', 'A352000'], ['M240000', 'W452000'], ['R300000', 'S530000'], ['J535000', 'M343500'], ['F6520

In [None]:
list_searchFullAddr_HashCode = soundex_Phonetic_HashCode(search_fullAddress)
print('searchFullAddr_HashCode === ', list_searchFullAddr_HashCode)

searchFullAddr_HashCode ===  ['1000000', 'J635000', 'E230000', 'S300000', '6000000', 'E230000', 'S315000', 'C300000', '2000000']


In [None]:
list_CustFullAddr_HashCode = []
for address in list_fullAddr:
  fullAddr_HashCode = soundex_Phonetic_HashCode(address)
  list_CustFullAddr_HashCode.append(fullAddr_HashCode)

print('list_CustFullAddr_HashCode === ', list_CustFullAddr_HashCode)


list_CustFullAddr_HashCode ===  [['9000000', 'M460000', 'S400000', 'D235000', 'F400000', '3000000'], ['6000000', 'J500000', 'L320000', 'T635000', 'K000000', '9000000'], ['2000000', 'M600000', 'I245300', 'K151630', 'P000000', '6000000'], ['U252000', 'K523000', 'F100000', 'A000000', '7000000'], ['1000000', 'J250000', 'L100000', 'K623545', 'A200000', '4000000'], ['1000000', 'B260000', 'H616000', 'A130000', '0000000', 'E621400', 'H000000', '1000000'], ['0000000', 'K365000', 'M400000', 'J523500', 'D200000', '2000000'], ['1000000', 'S230000', 'L100000', 'A130000', '7000000', 'P630000', 'A240000', 'N000000', '3000000'], ['6000000', 'M324000', 'B620000', 'V236150', 'K200000', '6000000'], ['P200000', '7000000', 'B200000', '9000000', 'A100000', 'A000000', '4000000'], ['1000000', 'S363000', 'C610000', 'S300000', '2000000', 'S300000', 'C565000', 'N300000', '2000000'], ['3000000', 'J525000', 'O140000', 'S300000', 'S320000', 'R000000', '7000000'], ['U530000', '9000000', 'B200000', '4000000', 'D10000

#### 5. First Find if the New Customer's Name and also Address is Matching with Name & Address of any Customers already Exists or is a New Customer

In [None]:
### ###############################################################################
# Find if the Customer already Exists using Fuzz.token_set_ratio()
# Tokenizes each string and then instead of immediately comparing the sorted tokens,
# Then they are split into two groups- intersection and remainder.
# These groups are then used for string comaprision

# # Find if the Customer already Exists using Fuzz.token_sort_ratio
# # Tokenizes each String and sorts in alphabetical order, then compares both strings and returns the Similarity Score

for cust_index, custFullName_hashCode in enumerate(list_CustFullName_HashCode):
  # fuzzy_matches = process.extract(lst_searchFullName_HashCode, list_CustFullName_HashCode, scorer=fuzz.token_sort_ratio, limit=3)

  firstName_Similarity = fuzz.token_set_ratio(lst_searchFullName_HashCode[0], custFullName_hashCode[0])
  lastName_Similarity = fuzz.token_set_ratio(lst_searchFullName_HashCode[1], custFullName_hashCode[1])

  if (firstName_Similarity >= 90) and (lastName_Similarity >= 90):
    search_found_index = cust_index
    print("New Customer Name Match found in Existing Customer Names using 'fuzz.token_set_ratio'")
    print('New Customer Name === ', search_fullName, 'found at Index === ', search_found_index, cust_index)
    print('First Name & Last Name Similarity is === ', firstName_Similarity, lastName_Similarity)

  else:
    search_found_index = NULL
    print('Similarity < 90, Customer does not seem to be found in the dataset!')
    break


Similarity < 90, Customer does not seem to be found in the dataset!


#### 6. Lets also Check Customer's Address for the same Matched-Customer-Name is Similar

In [None]:
from zmq import NULL
# Find Address of the Matched-Customer-Name; fuzz.token_set_ratio
if search_found_index!= NULL:

  custFullAddr_hashCode = list_CustFullAddr_HashCode[search_found_index]
  str_CustFullAdrr_hashCode = ' '.join(custFullAddr_hashCode)
  print('custAddr_hashCode === ', str_CustFullAdrr_hashCode)

  str_searchFullAddr_HashCode = ' '.join(list_searchFullAddr_HashCode)
  print('str_searchFullAddr_HashCode === ', str_searchFullAddr_HashCode)

  addr_similarity = fuzz.partial_token_set_ratio(str_CustFullAdrr_hashCode, str_searchFullAddr_HashCode)
  print('addr_similarity === ', addr_similarity)

  if (addr_similarity >= 90):
    print("New Customer's Address also Found Matching along with Customer's Name using 'fuzz.partial_token_set_ratio'")
    print('Matching Similarity === ', addr_similarity)
    print('\n This Customer we searched seems to be Already Existing! \n Please check further Critical Information before adding the Customer!')
  else:
    print('Address does not seem to be matching!')
else:
  print('Name Similarity < 90, Customer does not seem to be found in the dataset!')


Name Similarity < 90, Customer does not seem to be found in the dataset!
