### Fuzzy matching demo
Fuzzywuzzy package
- Uses Levenshtein distance to measure the similarity between two string
- library [link](https://anaconda.org/jkroes/fuzzywuzzy)
- Referece code, [link](https://medium.com/better-programming/fuzzy-string-matching-with-python-cafeff0d29fe)

In [19]:
from fuzzywuzzy import process, fuzz
import pandas as pd

#### Simple ratio
The `ratio` method compares the whole string and follows the standard Levenshtein distance similarity ratio between two strings:

In [20]:
# Simple ratio
String_Matched = fuzz.ratio('Hello World', 'Hello World!')
print("String Matched:",String_Matched)
String_Matched = fuzz.ratio('Hello World', 'Hello world')
print("String Matched:",String_Matched)
String_Matched = fuzz.ratio('Hello world', 'Hello world')
print("String Matched:",String_Matched)

String Matched: 96
String Matched: 91
String Matched: 100


#### Partial ratio
The `partial_ratio` method works on “optimal partial” logic. If the short string k and long string m are considered, the algorithm will score by matching the length of the k string:

In [23]:
# Partial ratio
Str_Partial_Match = fuzz.partial_ratio('Hello World', 'Hello World!')
print("String Matched:",Str_Partial_Match)
Str_Partial_Match = fuzz.partial_ratio('Hello World', 'Hello world')
print("String Matched:",Str_Partial_Match)

String Matched: 100
String Matched: 91


#### Token sort ratio
The `token_sort_ratio` method sorts the tokens alphabetically. Then, the simple `ratio` method is applied to output the matched percentage:

In [24]:
# Token sort ratio
Str_Sort_Match = fuzz.token_sort_ratio('Hello World', 'Hello wrld')
print("String Matched:",Str_Partial_Match)
Str_Sort_Match = fuzz.token_sort_ratio('Hello World', 'world Hello')
print("String Matched:",Str_Partial_Match)

String Matched: 91
String Matched: 91


#### Token set ratio
The `token_set_ratio` ignores the duplicate words. It is similar to the sort ratio method but more flexible. It basically extracts the common tokens and then applies `fuzz.ratio()` for comparisons:

In [None]:
# Token set ratio
String_Matched=fuzz.token_set_ratio('Hello World', 'Hello Hello world')
print(String_Matched)

In [156]:
raw = pd.read_csv('data/fuzzy_matching_test.csv')
tickers = pd.read_csv('data/nasdaq_screener_1609274234552.csv')
df = raw[['stra', 'strb']]

#### Example: Match tickers to a list of companies
`tickers` is a dictionary of company name and stock tickers. Our goal is to get the sotck ticker by fuzzy matching standard company name to `stra` in dataframe `df`.

In [158]:
# Function searching ticker list and find the top 3 similar record
def fuzzy_match(a_str, df, b_str, b_return, topN = 1):
    df['score'] = df.apply(lambda row : fuzz.token_set_ratio(a_str.lower(), 
                                row[b_str].lower()), axis = 1)
    # get the top record by similarity
    top_record = df.nlargest(topN, 'score')[[b_str, b_return, 'score']]
    
    return top_record.iloc[0]
#top_records[[b_str]], top_records[[b_return]] ,top_records[['score']]

In [166]:
print(fuzzy_match('facebook inc', tickers, 'Name' ,'Symbol'))

Name      Facebook Inc. Class A Common Stock
Symbol                                    FB
score                                    100
Name: 2397, dtype: object


In [167]:
df[['name','ticker','score']] = df.apply(lambda row : fuzzy_match(row['stra'], tickers, 'Name' ,'Symbol'), axis = 1)

In [168]:
df

Unnamed: 0,stra,strb,name,ticker,score
0,apple inc,Apple,Apple Inc. Common Stock,AAPL,100
1,google inc,Google,New Gold Inc.,NGD,73
2,Hangzhou Alibaba,Alibaba,Alibaba Group Holding Limited American Deposit...,BABA,61
