# Fetch Project: Similarity Search

Prepared and presented by Chen Zhang

## 1. Data Cleaning

The first step is data cleaning. One of the key steps in processing language data is to remove noise so that the machine can more easily detect the patterns in the data. 

Text data contains a lot of noise, this takes the form of special characters. All of which are difficult for computers to understand if they are present in the data. We need to, therefore, process the data to remove these elements.

Additionally, it is also important to apply some attention to the casing of words. If we include both upper case and lower case versions of the same words then the computer will see these as different entities, even though they may be the same.

One example is shown in the snapshot of the offer_retailer.csv file below. Note that some items starts with a column character, and the 2nd and 3rd row has only difference with a capital letter 'L'. 

![image.png](attachment:image.png)

The code below performs these steps. To keep a track of the changes we are making to the text I have put the clean text into a new column. The output is shown below the code.



In [89]:
import numpy as np
import pandas as pd
import re

def clean(file): 
    # change to lower case
    df = pd.read_csv(file, dtype = str)
    for col in df.columns.values: 
        df[col] = df[col].str.lower()
    
    # remove special characters in the offer column
    if file == 'offer_retailer.csv': 
        for index, row in df.iterrows():
            txt = row['OFFER']
            df.loc[index, "OFFER"] = re.sub(r"[^a-z0-9 ]", "", txt)
    # remove duplicate rows
        df.drop_duplicates(subset=['OFFER'], inplace = True)
    return df

The three datasets are shown below, after cleaning. Note that a few lines in the offer_retailer file are removed due to duplication. The missing values in the 'Retailer' column are filled with empty strings. The some columns are also re-named for the purpose of future joining. 

In [90]:
# Offer_brand table
offer = clean('offer_retailer.csv')
offer.replace(np.nan,'',regex=True, inplace=True)
print(offer.head())
print(offer.shape)

                                               OFFER            RETAILER  \
0       spend 50 on a fullpriced new club membership           sams club   
1           beyond meat plantbased products spend 25                       
2           good humor viennetta frozen vanilla cake                       
3  butterball select varieties spend 10 at dillon...  dillons food store   
4  gatorade fast twitch 12ounce 12 pack at amazon...              amazon   

         BRAND  
0    sams club  
1  beyond meat  
2   good humor  
3   butterball  
4     gatorade  
(369, 3)


In [91]:
brand = clean('brand_category.csv')
brand.rename(columns={"BRAND_BELONGS_TO_CATEGORY": "CATEGORY"}, inplace=True)
# for index, row in brand.iterrows():
#     txt1 = row['CATEGORY']
#     brand.loc[index, "CATEGORY"] = re.sub(r"[^a-z0-9 ]", "", txt1)
print(brand.head())
print(brand.shape)

              BRAND          CATEGORY RECEIPTS
0  caseys gen store  tobacco products  2950931
1  caseys gen store            mature  2859240
2            equate      hair removal   893268
3         palmolive       bath & body   542562
4              dawn       bath & body   301844
(9906, 3)


In [92]:
category = clean('categories.csv')
category.rename(columns={"PRODUCT_CATEGORY": "CATEGORY", 
                      "IS_CHILD_CATEGORY_TO": "PARENT_CATEGORY"}, inplace=True)
# for index, row in category.iterrows():
#     txt1 = row['CATEGORY']
#     category.loc[index, "CATEGORY"] = re.sub(r"[^a-z0-9 ]", "", txt1)
#     txt2 = row['PARENT_CATEGORY']
#     category.loc[index, "PARENT_CATEGORY"] = re.sub(r"[^a-z0-9 ]", "", txt2)

category.drop(columns=['CATEGORY_ID'], inplace = True)

print(category.head())
print(category.shape)

                      CATEGORY    PARENT_CATEGORY
0              red pasta sauce        pasta sauce
1  alfredo & white pasta sauce        pasta sauce
2             cooking & baking             pantry
3             packaged seafood             pantry
4             feminine hygeine  health & wellness
(118, 2)


## 2. Table merge

I'd like to join all table into one major table for better handling. 

In [93]:
# join the brand and category table, using pandas merge function which is equal to SQL inner join
brand_cat = brand.merge(category, right_on = ['CATEGORY'], left_on = ['CATEGORY'])

print(f'After joining the category (shape {category.shape}) and brand table (shape {brand.shape}), the output has shape of {brand_cat.shape}')
print(brand_cat.head())

After joining the category (shape (118, 2)) and brand table (shape (9906, 3)), the output has shape of (9906, 4)
               BRAND          CATEGORY RECEIPTS    PARENT_CATEGORY
0   caseys gen store  tobacco products  2950931             mature
1  rj reynolds vapor  tobacco products       21             mature
2   caseys gen store            mature  2859240             mature
3             equate      hair removal   893268  health & wellness
4           barbasol      hair removal   283926  health & wellness


In [94]:
full = offer.merge(brand_cat, right_on = ['BRAND'], left_on = ['BRAND'])

# rearrange columns
cols = ['OFFER', 'RETAILER', 'BRAND', 'CATEGORY', 'PARENT_CATEGORY', 'RECEIPTS']
full = full[cols]
print(full)
print(full.shape)

                                                 OFFER  \
0             beyond meat plantbased products spend 25   
1             beyond meat plantbased products spend 25   
2             beyond meat plantbased products spend 25   
3    beyond steak plantbased seared tips 10 ounce a...   
4    beyond steak plantbased seared tips 10 ounce a...   
5    beyond steak plantbased seared tips 10 ounce a...   
6    beyond steak plantbased seared tips 10 ounce b...   
7    beyond steak plantbased seared tips 10 ounce b...   
8    beyond steak plantbased seared tips 10 ounce b...   
9    beyond steak plantbased seared tips 10 ounce a...   
10   beyond steak plantbased seared tips 10 ounce a...   
11   beyond steak plantbased seared tips 10 ounce a...   
12   beyond steak plantbased seared tips 10 ounce b...   
13   beyond steak plantbased seared tips 10 ounce b...   
14   beyond steak plantbased seared tips 10 ounce b...   
15            beyond meat plantbased products spend 15   
16            

It's worth noting that the major table has duplicate rows for some categories that belongs to more than one parent category. One example is frozen pizza is under both 'frozen' and 'pantry' parent_category, which makes sense. Another point is, I use inner join for simple search. With the above major table ready, I will move forward and seek for input similarity search. 

## 3. Similarity Search Based on Fuzzy String - Baseline Method

A few critieria including ratio, partial ratio, token sort ratio, token set ratio, and partial token sort ratio, are being tested. One example is shown below using 'meat' as the input. 

In [95]:
from thefuzz import fuzz

name = "meat"
similarity = category.copy()
partialRatio, ratio, tokenSortRatio, tokenSetRatio, partialTokenSortRatio = [], [], [], [], []

# for criteria in ['Ratio', 'Partial Ratio', 'Token Sort Ratio', 'Token Set Ratio', 'Partial Token Set Ratio']: 
#     print(similarity['CATEGORY'].nlargest(10, criteria))

for index, row in similarity.iterrows():
    val = row['CATEGORY']
#     full_name = "Kurtis K D Pykes"
    a = fuzz.partial_ratio(name, val)
    b = fuzz.ratio(name, val)
    c = fuzz.token_sort_ratio(name, val)
    d = fuzz.token_set_ratio(name, val)
    e = fuzz.partial_token_sort_ratio(name, val)
    partialRatio.append(a)
    ratio.append(b)
    tokenSortRatio.append(c)
    tokenSetRatio.append(d)
    partialTokenSortRatio.append(e)
similarity['Partial Ratio'] = partialRatio
similarity['Ratio'] = ratio
similarity['Token Sort Ratio'] = tokenSortRatio
similarity['Token Set Ratio'] = tokenSetRatio
similarity['Partial Token Set Ratio'] = partialTokenSortRatio

for criteria in ['Ratio', 'Partial Ratio', 'Token Sort Ratio', 'Token Set Ratio', 'Partial Token Set Ratio']: 
    print(f'Top 10 matched category using {criteria}')
    print(similarity.nlargest(10, criteria)[['CATEGORY', criteria]])
    print()

Top 10 matched category using Ratio
             CATEGORY  Ratio
112            mature     60
52                tea     57
90      packaged meat     47
6               cream     44
51              bread     44
71              water     44
23         condiments     43
72   plant-based meat     40
74        fresh pasta     40
92             makeup     40

Top 10 matched category using Partial Ratio
                       CATEGORY  Partial Ratio
47           jerky & dried meat            100
72             plant-based meat            100
90                packaged meat            100
105     frozen plant-based meat            100
16   meal replacement beverages             75
19                 frozen meals             75
23                   condiments             75
24       packaged meals & sides             75
35                sexual health             75
36               malt beverages             75

Top 10 matched category using Token Sort Ratio
               CATEGORY  Token Sort

Even if the first couple top-matched items contain word 'meat', but other matched words don't make too much sense. This is because the nature of fuzzy search is edit distance. for example, from 'meat' to 'meal', there is only one letter difference, thus the edit distance is 1. 

## 4. Similarity Search Using Transfomer - Improved Method

Text/Sentence similarity is one of the clearest examples of how powerful transformer models can be. Here the NLP solution will take some query text, process it to create an embeddings using pre-trained model, and compute the similarity between the query text and the target field (e.g. category column). A few well-established models are evaluated including 'all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'all-distilroberta-v1', and 'bert-base-nli-mean-tokens'. 'bert-base-nli-mean-tokens' is selected due to robustness and high performance. 

'bert-base-nli-mean-tokens' is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. Using 'meat' as an examplary query test for category search, the model clearly generates more meaningful results than the previous fuzzy search approach. 
The full model architecture is shown below. 

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 
                'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 
                'pooling_mode_mean_sqrt_len_tokens': False})
)


First, a model is created using sentence_transformers. 

In [96]:
from sentence_transformers import SentenceTransformer
model_name = 'bert-base-nli-mean-tokens'
# model_name = 'all-MiniLM-L6-v2'
# model_name = 'all-mpnet-base-v2'

model = SentenceTransformer(model_name)

Second, a sentence library coming from the desired field is created. In this example, 

In [97]:
catDict = category['CATEGORY'].to_list()
catDict_vecs = model.encode(catDict)
print(catDict_vecs.shape)

(118, 768)


In [98]:
catDict_vecs

array([[-0.41283208,  0.19538327, -0.7930067 , ..., -0.17972049,
         0.6738818 ,  0.5252664 ],
       [-0.34384283,  0.20766301, -0.21792002, ..., -0.37926593,
         0.02805232,  0.63666254],
       [ 0.54358596,  0.8228612 ,  1.6034012 , ..., -0.3250906 ,
         0.2267657 ,  0.13139722],
       ...,
       [-0.51006913,  1.0867724 , -0.1981626 , ..., -0.18807282,
         0.77340156,  0.3120817 ],
       [-0.18643278,  0.87110186,  0.6632734 , ..., -0.37362155,
        -0.26959875,  0.02258943],
       [ 0.26655945,  0.2495401 ,  1.5600967 , ...,  0.7145732 ,
         0.29977995,  0.2982676 ]], dtype=float32)

Next, I applied the cosine similarity to calculate pair-wise similarity between the input string and all items in the category column. 

In [99]:
from sklearn.metrics.pairwise import cosine_similarity

query = model.encode(['meat'])
print(query.shape)
cos = cosine_similarity(query, catDict_vecs)
similarity = category.copy()
similarity['Cosine'] = cos.reshape(-1, 1)

print(similarity.nlargest(20, 'Cosine')[['CATEGORY', 'Cosine']])

(1, 768)
                       CATEGORY    Cosine
90                packaged meat  0.899200
47           jerky & dried meat  0.781604
72             plant-based meat  0.755113
105     frozen plant-based meat  0.655589
115                 frozen beef  0.632474
62                 food storage  0.621770
73                         eggs  0.609283
84                 dog supplies  0.607903
30                 soup & broth  0.600675
114              frozen chicken  0.570397
24       packaged meals & sides  0.570185
23                   condiments  0.569826
86               prepared meals  0.569633
63                       cheese  0.568969
85                pickled goods  0.568233
64            frozen vegetables  0.564005
17                     pretzels  0.547607
16   meal replacement beverages  0.529368
113               frozen turkey  0.528729
51                        bread  0.522049


The top 20 matched category makes more sense than the fuzzy search. Not only categories contain word 'meat' are identified, the rest high-score categories are food related, like eggs or forzen chicken. I will move forward to the actual problem using this pre-trained model. 

## 5. Search Tasks

### Task 1. Category Search

The previous experiments show that we can query the category name and return similar categories. I will extend the search to the full table and return all offers under those similar categories in the order of relevance, namely similarity score. The next cell defines the query and how many anwser is desired. 

In [250]:
def categorySearch(queryLst): 
    catDict = full['CATEGORY'].to_list()
    catDict_vecs = model.encode(catDict)
    similarity = full.copy()
    query = model.encode(queryLst)
    cos = cosine_similarity(query, catDict_vecs).round(4)
    for i, q in enumerate(queryLst): 
        similarity[q] = cos[i].reshape(-1, 1)
    out = similarity.copy()
    return out

In [174]:
queries = ['meat', 'coffee']
res = categorySearch(queries)

To display the result (stored in variable 'res'), I picked the top N based on similarity score from high to low. Note that duplicated offer is removed, as some offer may belong to multiple category. In the output, the last column, named as the query input string, denotes the similarity score. The output is stored in a csv file. 

In [175]:
# Display the result, top N matched offer
def display(queryLst, df, N = 20): 
    for q in queryLst: 
#         print(f'The Top {N} offers by searching category {q}')
#         df.round(2)
        df.drop_duplicates(subset=['OFFER'], inplace = True)
        a = df.nlargest(N, q)[['OFFER', 'RETAILER', 'BRAND', 'CATEGORY', q]]
        a.rename(columns={q: "SCORE"}, inplace=True)
#         print(a.to_string(index=False))
#         print(a)
#         print()
        fname = f'Category Search Top {N} {q}.csv'
        a.to_csv(fname)

In [176]:
display(queries, res)

The result of top 20 offers by searching 'meat' in category is shown below. 

![image-2.png](attachment:image-2.png)

The result of top 20 offers by searching 'coffee' in category is shown below. 

![image.png](attachment:image.png)

### Task 2. Brand Search

Similar to the category search. The difference is that I will search the brand in the brand table, and use the

In [219]:
def brandSearch(query): 
    """
    query: a string input of search in brand
    return: a dataframe with the last column is the similarity score between the query and brand in each row
    """
    
    brandDict = brand['BRAND'].to_list()
    brandDict_vecs = model.encode(brandDict)
    similarity = brand.copy()
    q = model.encode([query])
    cos = cosine_similarity(q, brandDict_vecs).round(4)
    similarity[query] = cos.reshape(-1, 1)
    out = similarity.copy()
    return out

In [258]:
def brandToCategory(df, query, cutoff = 0.9): 
    res['RECEIPTS'] = pd.to_numeric(res['RECEIPTS'])
    sortres = res.sort_values(by=[query, 'RECEIPTS'], ascending=[False, False])
    match = sortres[sortres[query] >= cutoff]
    print(match)
    return match['CATEGORY'].to_list(), match[query].to_list()

In [273]:
# Display the result, top N matched offer
def displayBrand(df, queryLst,  scores, cutoff = 0.8, N = 20): 
    df['RECEIPTS'] = pd.to_numeric(df['RECEIPTS'])
    df[queryLst] = df[queryLst]*scores
    df['SCORE'] = df[queryLst].max(axis=1)
    df.drop_duplicates(subset=['OFFER'], inplace = True)
    df.drop(columns=queryLst, inplace = True)
    sortdf = df.sort_values(by=['SCORE', 'RECEIPTS'], ascending=[False, False])
    print(sortdf.iloc[:N])
    fname = f'Category Search Top {N} {q}.csv'
    a.to_csv(fname)

In [256]:
brandQuery = 'Kroger'
res = brandSearch(brandQuery)

In [274]:
cats, b2catscore = brandToCategory(res, brandQuery)
n = len(cats)
res1 = categorySearch(cats)

       BRAND                             CATEGORY  RECEIPTS  Kroger
6     kroger                               bakery    251276     1.0
430   kroger                   household supplies      5255     1.0
473   kroger                                water      4823     1.0
554   kroger                         fruit juices      3985     1.0
601   kroger               carbonated soft drinks      3617     1.0
772   kroger  cereal, granola, & toaster pastries      2587     1.0
806   kroger                                chips      2439     1.0
1381  kroger                               coffee      1012     1.0
1770  kroger                      pasta & noodles       655     1.0
2079  kroger                          bath & body       475     1.0
2127  kroger                   sauces & marinades       452     1.0
2173  kroger                                bread       437     1.0
2658  kroger                     cooking & baking       300     1.0
2760  kroger                             cracker

In [275]:
# print(cats)
displayBrand(res1, cats, b2catscore, 50)

                                                 OFFER             RETAILER  \
68   dr pepper regular or zero sugar strawberries a...  united supermarkets   
677  mtn dew kickstart 16ounce 12 count select vari...               amazon   
329  pepsico beverage 75ounce 10 pack select variet...               amazon   
330  pepsico variety pack select varieties at amazo...               amazon   
388    dove hand wash select varieties buy 2 at target               target   
392          dove hand wash select varieties at target               target   
713                   cheerios oat crunch berry cereal                        
505                     artesano buns buy 2 at walmart              walmart   
695                      glad trash bags 4 or 8 gallon                        
699             glad forceflex max strength trash bags                        
683                                ballpark buns buy 2                        
79                                 barilla pesto sau

In [180]:
display(queries, res)

# 7. References: 

Sentence-Transformer library: 
https://www.sbert.net/docs/usage/semantic_textual_similarity.html

bert-base-nli-mean-tokens model: 
https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens

all-MiniLM-L6-v2 model
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

cosine similarity: 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

