### Fuzzy matching demo
Fuzzywuzzy package
- Uses Levenshtein distance to measure the similarity between two string
- library [link](https://anaconda.org/jkroes/fuzzywuzzy)
- Referece code, [link](https://medium.com/better-programming/fuzzy-string-matching-with-python-cafeff0d29fe)

In [42]:
from fuzzywuzzy import process, fuzz
import pandas as pd

#### Simple ratio
The `ratio` method compares the whole string and follows the standard Levenshtein distance similarity ratio between two strings:

In [20]:
# Simple ratio
String_Matched = fuzz.ratio('Hello World', 'Hello World!')
print("String Matched:",String_Matched)
String_Matched = fuzz.ratio('Hello World', 'Hello world')
print("String Matched:",String_Matched)
String_Matched = fuzz.ratio('Hello world', 'Hello world')
print("String Matched:",String_Matched)

String Matched: 96
String Matched: 91
String Matched: 100


#### Partial ratio
The `partial_ratio` method works on “optimal partial” logic. If the short string k and long string m are considered, the algorithm will score by matching the length of the k string:

In [23]:
# Partial ratio
Str_Partial_Match = fuzz.partial_ratio('Hello World', 'Hello World!')
print("String Matched:",Str_Partial_Match)
Str_Partial_Match = fuzz.partial_ratio('Hello World', 'Hello world')
print("String Matched:",Str_Partial_Match)

String Matched: 100
String Matched: 91


#### Token sort ratio
The `token_sort_ratio` method sorts the tokens alphabetically. Then, the simple `ratio` method is applied to output the matched percentage:

In [24]:
# Token sort ratio
Str_Sort_Match = fuzz.token_sort_ratio('Hello World', 'Hello wrld')
print("String Matched:",Str_Partial_Match)
Str_Sort_Match = fuzz.token_sort_ratio('Hello World', 'world Hello')
print("String Matched:",Str_Partial_Match)

String Matched: 91
String Matched: 91


#### Token set ratio
The `token_set_ratio` ignores the duplicate words. It is similar to the sort ratio method but more flexible. It basically extracts the common tokens and then applies `fuzz.ratio()` for comparisons:

In [None]:
# Token set ratio
String_Matched=fuzz.token_set_ratio('Hello World', 'Hello Hello world')
print(String_Matched)

In [58]:

# Source of tickers: https://www.nasdaq.com/market-activity/stocks/screener?exchange=NASDAQ&render=download
#raw = pd.read_csv('data/fuzzy_matching_test.csv')
tickers = pd.read_csv('data/nasdaq_screener_1609274234552.csv')
df = pd.read_csv('data/company_list.csv')

In [60]:
df.head()

Unnamed: 0,company,val
0,Abercrombie & Fitch,0
1,Aritzia,0
2,At Home Group Inc.,0
3,Betterware,0
4,Big 5 Sporting Goods,0


#### Example: Match tickers to a list of companies
`tickers` is a dictionary of company name and stock tickers. Our goal is to get the sotck ticker by fuzzy matching standard company name to `stra` in dataframe `df`.

In [10]:
# Function searching ticker list and find the top 3 similar record
def fuzzy_match(a_str, df, b_str, b_return, topN = 1):
    df['score'] = df.apply(lambda row : fuzz.token_set_ratio(a_str.lower(), 
                                row[b_str].lower()), axis = 1)
    # get the top record by similarity
    top_record = df.nlargest(topN, 'score')[[b_str, b_return, 'score']]
    
    return top_record.iloc[0]

In [22]:
print(fuzzy_match('Big 5 Sporting Goods', tickers, 'Name' ,'Symbol'))

Name      Big 5 Sporting Goods Corporation Common Stock
Symbol                                             BGFV
score                                               100
Name: 856, dtype: object


In [61]:
df[['name','ticker','score']] = df.apply(lambda row : fuzzy_match(row['company'], tickers, 'Name' ,'Symbol'), axis = 1)

In [62]:
df

Unnamed: 0,company,val,name,ticker,score
0,Abercrombie & Fitch,0,Abercrombie & Fitch Company Common Stock,ANF,100
1,Aritzia,0,Alight Inc.,ALIT,47
2,At Home Group Inc.,0,At Home Group Inc. Common Stock,HOME,100
3,Betterware,0,Betterware de Mexico S.A.B. de C.V. Ordinary S...,BWMX,100
4,Big 5 Sporting Goods,0,Big 5 Sporting Goods Corporation Common Stock,BGFV,100
5,Big Lots,0,Big Lots Inc. Common Stock,BIG,100
6,"Boot Barn Holdings, Inc.",0,Boot Barn Holdings Inc. Common Stock,BOOT,100
7,Caleres,0,Caleres Inc. Common Stock,CAL,100
8,Citi Trends,0,Citi Trends Inc. Common Stock,CTRN,100
9,"Crocs, Inc.",0,Crocs Inc. Common Stock,CROX,100
