This notebook returns offers for categories that fetch has in its database

Approach:

If a product category has associated offers in the final DataFrame created by merging brands_df, category_df, and retail_df, we return all non-NaN offers for that product category. In cases where a product category has only NaN offers, we follow a two-step process:

Create a dictionary where keys are product category names with non-NaN offers, and values are sets of products associated with those brands.

Calculate Jaccard similarity between the set of products for the product category with all NaN offers and the sets of products for product category with non-NaN offers. We then order these product categories based on their similarity scores and return the top offers from the most similar product categories.

In addition, we have addressed another scenario where if a user searches for a product category that is not present in the dataframe, we employ BERT embeddings to create a representation for this category. We then calculate the cosine similarity between this category and all the product categories in the dataframe that have associated offers. Subsequently, we provide a list of product categories and their respective offers, focusing on those categories that exhibit significant similarity with the user's input product category.

In [1]:
#Import library
import numpy as np
import pandas as pd

In [2]:
#Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#Load dataset
brands_df = pd.read_csv('/content/drive/MyDrive/fetch/brand_category.csv')
category_df = pd.read_csv('/content/drive/MyDrive/fetch/categories.csv')
retail_df = pd.read_csv('/content/drive/MyDrive/fetch/offer_retailer.csv')

In [4]:
#brands_df
brands_df['BRAND'] = brands_df['BRAND'].str.lower()
brands_df['BRAND_BELONGS_TO_CATEGORY'] = brands_df['BRAND_BELONGS_TO_CATEGORY'].str.lower()

In [5]:
#categories_df
category_df['PRODUCT_CATEGORY'] = category_df['PRODUCT_CATEGORY'].str.lower()
category_df['IS_CHILD_CATEGORY_TO'] = category_df['IS_CHILD_CATEGORY_TO'].str.lower()

In [6]:
#retail_df
retail_df['RETAILER'] = retail_df['RETAILER'].str.lower()
retail_df['BRAND'] = retail_df['BRAND'].str.lower()

In [7]:
# Merge the dataframes based on the common column and keep only the required column
merged_df = brands_df.merge(category_df[['PRODUCT_CATEGORY', 'IS_CHILD_CATEGORY_TO']],
                            left_on='BRAND_BELONGS_TO_CATEGORY',
                            right_on='PRODUCT_CATEGORY',
                            how='left')

# Drop the columns that are not needed
merged_df.drop(columns=['PRODUCT_CATEGORY', 'RECEIPTS'], inplace=True)

# Rename the new column if needed
merged_df.rename(columns={'IS_CHILD_CATEGORY_TO': 'Product_category'}, inplace=True)
merged_df.rename(columns={'BRAND_BELONGS_TO_CATEGORY': 'Product'}, inplace=True)


# Save the merged dataframe to a new CSV file if needed
merged_df.to_csv('merged_brands.csv', index=False)

In [8]:
merged_df['BRAND'] = merged_df['BRAND'].replace({'caseys gen store': 'caseys general store'})

In [9]:
# Merge the dataframes based on the common 'BRAND' column using a left join
merged_with_offers_df = merged_df.merge(retail_df, on='BRAND', how='left')

# Save the merged dataframe with offers to a new CSV file if needed
merged_with_offers_df.to_csv('merged_with_offers.csv', index=False)

In [10]:
ordered_df = merged_with_offers_df[['RETAILER', 'BRAND', 'Product', 'Product_category', 'OFFER']]

In [12]:
import pandas as pd

# Create a list to store categories with all NaN offers
categories_with_all_nan_offers = []

# Create a list to store categories with at least one non-NaN offer
categories_with_non_nan_offers = []

# Iterate through the rows in the dataframe
for category in ordered_df['Product_category'].unique():
    category_offers = ordered_df[ordered_df['Product_category'] == category]['OFFER']

    if all(pd.isna(offer) for offer in category_offers):
        categories_with_all_nan_offers.append(category)
    else:
        categories_with_non_nan_offers.append(category)

# Now, categories_with_all_nan_offers contains categories with all NaN offers
# categories_with_non_nan_offers contains categories with at least one non-NaN offer

print("Categories with All NaN Offers:")
print(categories_with_all_nan_offers)

print("\nCategories with Non-NaN Offers:")
print(categories_with_non_nan_offers)


Categories with All NaN Offers:
['baby & toddler', 'puffed snacks', 'oral care', 'spirits', 'home & garden']

Categories with Non-NaN Offers:
['mature', 'health & wellness', 'deli & bakery', 'beverages', 'beauty', 'snacks', 'household supplies', 'alcohol', 'dairy', 'sports drinks & enhanced waters', 'pantry', 'frozen', 'pasta & noodles', 'frozen meat', 'candy', 'pasta sauce', 'animals & pet supplies', 'meat & seafood']


In [13]:

# Create a dictionary to store categories with at least one non-NaN offer and their associated products
categories_with_non_nan_offers_dict = {}

# Iterate through the rows in the dataframe
for category in ordered_df['Product_category'].unique():
    category_offers = ordered_df[ordered_df['Product_category'] == category]['OFFER']

    # Check if the category has at least one non-NaN offer
    if not all(pd.isna(offer) for offer in category_offers):
        # Get the associated products for the category
        associated_products = ordered_df[ordered_df['Product_category'] == category]['Product'].unique()

        # Add the category and its associated products to the dictionary
        categories_with_non_nan_offers_dict[category] = associated_products

# Now, categories_with_non_nan_offers_dict contains categories with at least one non-NaN offer as keys
# and their associated products as values

print("Categories with Non-NaN Offers and Their Associated Products:")
print(categories_with_non_nan_offers_dict)

Categories with Non-NaN Offers and Their Associated Products:
{'mature': array(['tobacco products', 'mature'], dtype=object), 'health & wellness': array(['hair removal', 'bath & body', 'medicines & treatments',
       'oral care', 'deodorant & antiperspirant', 'hair care',
       'feminine hygeine', 'skin care', 'adult incontinence', 'first aid',
       'foot care', 'eye care', 'sexual health'], dtype=object), 'deli & bakery': array(['bakery', 'prepared meals', 'leafy salads', 'deli counter'],
      dtype=object), 'beverages': array(['carbonated soft drinks', 'coffee', 'fruit juices', 'tea', 'water',
       'energy drinks', 'drink mixes', 'meal replacement beverages',
       'vegetable juices'], dtype=object), 'beauty': array(['body fragrances', 'makeup', 'nail care', 'cosmetic tools'],
      dtype=object), 'snacks': array(['dips & salsa', 'snack cakes', 'candy', 'chips', 'puffed snacks',
       'nuts & seeds', 'fruit & vegetable snacks', 'pretzels', 'crackers',
       'cookies', 'jerk

In [14]:
# Create an empty dictionary to store product categories and their associated sets of unique products
category_product_dict = {}

# Create a dictionary to keep track of offers for each category
category_offers = {}

# Iterate through the rows in the dataframe
for _, row in ordered_df.iterrows():
    product_category = row['Product_category']
    product = row['Product']
    offer = row['OFFER']

    # Check if an offer exists for the product category (ignore rows where offer is NaN)
    if not pd.isna(offer):
        # Check if the product category has an entry in the dictionary
        if product_category not in category_product_dict:
            category_product_dict[product_category] = set()  # Use a set to store unique products

        # Add the product to the product category's set of unique products
        category_product_dict[product_category].add(product)

    # Keep track of offers for each category
    if product_category not in category_offers:
        category_offers[product_category] = []

    # Add the offer to the category's list of offers
    category_offers[product_category].append(offer)

# Filter categories where all offers are NaN
categories_with_nan_offers = [category for category, offers in category_offers.items() if all(pd.isna(offer) for offer in offers)]

print("Category Product Dictionary:")
print(category_product_dict)

print("\nCategories with All NaN Offers:")
print(categories_with_nan_offers)


Category Product Dictionary:
{'mature': {'mature', 'tobacco products'}, 'beverages': {'drink mixes', 'coffee', 'water', 'carbonated soft drinks', 'fruit juices', 'tea', 'energy drinks', 'meal replacement beverages'}, 'health & wellness': {'sexual health', 'hair care', 'oral care', 'medicines & treatments', 'hair removal', 'deodorant & antiperspirant', 'skin care', 'bath & body'}, 'snacks': {'nuts & seeds', 'chips', 'dips & salsa', 'candy', 'crackers', 'pudding & gelatin', 'trail mix', 'snack cakes', 'cookies', 'puffed snacks', 'jerky & dried meat', 'fruit & vegetable snacks'}, 'deli & bakery': {'leafy salads', 'bakery', 'prepared meals', 'deli counter'}, 'sports drinks & enhanced waters': {'sports drinks'}, 'alcohol': {'malt beverages', 'wine', 'hard seltzers, sodas, waters, lemonades & teas', 'beer', 'spirits'}, 'pantry': {'cooking & baking', 'pasta & noodles', 'sauces & marinades', 'packaged meat', 'bread', 'cereal, granola, & toaster pastries', 'packaged vegetables', 'nut butters & 

In [15]:
def get_similar_categories_and_offers(category_name):
    # Check if the category is in category_product_dict
    if category_name in category_product_dict:
        # If the category has non-NaN offers, return them
        category_offers = ordered_df[(ordered_df['Product_category'] == category_name) & (~ordered_df['OFFER'].isna())]['OFFER'].unique()
        return [(category_name, category_offers)]  # Just return the actual offers

    # Category not found in category_product_dict, find similar categories
    similarity_scores = {}

    # Get the set of products associated with the category (excluding NaN offers)
    products_for_category = set(ordered_df[(ordered_df['Product_category'] == category_name) & (~ordered_df['OFFER'].isna())]['Product'].unique())

    for other_category, other_products in category_product_dict.items():
        # Skip the same category
        if other_category == category_name:
            continue

        # Calculate Jaccard similarity between the two categories
        intersection = len(products_for_category.intersection(other_products))
        union = len(products_for_category.union(other_products))

        # Handle cases where both categories have no associated products (all NaN offers)
        if union == 0:
            similarity = 0.0
        else:
            similarity = intersection / union

        similarity_scores[other_category] = similarity

    # Sort similar categories by similarity score in descending order
    sorted_similar_categories = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

    # Filter out categories with a similarity score of 0
    similar_category_offers = [(similar_category, ordered_df[(ordered_df['Product_category'] == similar_category) & (~ordered_df['OFFER'].isna())]['OFFER'].unique()) for similar_category, similarity_score in sorted_similar_categories if similarity_score != 0]

    # If there are similar category offers, return them
    if similar_category_offers:
        return similar_category_offers

    # If there are no similar category offers, return "No offers"
    return [(category_name, ["No offers"])]


In [16]:
# Example usage:
category_name = 'snacks'  # Replace with the desired category name
similar_offers = get_similar_categories_and_offers(category_name)
print("Similar Offers for Category:", category_name)
for similar_category, offers in similar_offers:
    print(similar_category, offers)

Similar Offers for Category: snacks
snacks ["M&M'S®, select sizes, buy 1"
 "M&M'S® chocolate candies, select varieties"
 'SNICKERS®, select sizes, buy 1'
 'SNICKERS® chocolate candy bar, select varieties'
 'Little Bites® Spend $10 at Walmart®' 'Tostitos® Toppers™'
 "Order from Casey's app or Caseys.com" "Spend $25 at Casey's"
 "Spend $5 in-store at Casey's"
 "Select beverages AND prepared food items at Casey's"
 "Visit OR order online from Casey's 7 times"
 "Spend $5 on single-serve prepared food items at Casey's"
 "12 Pack OR 2 Liter AND Whole Pizza Pie at Casey's"
 "Whole Pizza at Casey's" "Wings OR Cheesy Breadsticks at Casey's"
 "Whole Pizza Pie at Casey's" "Frozen OR Fountain Drink at Casey's"
 "Spend $10 in-store at Casey's"
 "Fresh bakery item, select varieties, at Casey's"
 "Whole pizza at Casey's" "12 Pack OR 2 Liter AND Whole Pizza at Casey's"
 "Frozen OR Fountain Drink at Casey's "
 "Visit OR order online from Casey's 3 times"
 "Visit OR order online from Casey's 5 times"
 "

In [19]:
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Define the BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [18]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
Col

In [20]:
# Define the function to compute category similarity and get offers
def get_similar_categories_and_offers_with_similarity(input_category, top_n=5):
    # Tokenize and encode the input category
    input_tokens = tokenizer(input_category, return_tensors='pt', padding=True, truncation=True)

    # Get the BERT embeddings for the input category
    with torch.no_grad():
        input_category_embedding = model(**input_tokens).last_hidden_state.mean(dim=1)  # Mean pooling

    # Calculate cosine similarity with all categories that have offers
    category_similarities = {}
    for category in ordered_df[~ordered_df['OFFER'].isna()]['Product_category'].unique():
        category_tokens = tokenizer(category, return_tensors='pt', padding=True, truncation=True)
        category_embedding = model(**category_tokens).last_hidden_state.mean(dim=1)
        similarity = cosine_similarity(input_category_embedding.detach().numpy(), category_embedding.detach().numpy())
        category_similarities[category] = similarity[0][0]

    # Sort categories by similarity score in descending order
    sorted_similar_categories = sorted(category_similarities.items(), key=lambda x: x[1], reverse=True)

    # Select the top N similar categories
    top_similar_categories = sorted_similar_categories[:top_n]

    # Retrieve and return the offers and their associated similarity scores for the selected similar categories
    similar_category_offers = []
    for similar_category, similarity_score in top_similar_categories:
        offers = ordered_df[(ordered_df['Product_category'] == similar_category) & (~ordered_df['OFFER'].isna())]['OFFER'].unique()
        similar_category_offers.append((similar_category, offers.tolist(), similarity_score))

    if not similar_category_offers:
        return f"No similar categories found for {input_category}."
    return similar_category_offers

In [21]:
# Example usage:
input_category = 'chips'  # Replace with the desired input category
similar_offers = get_similar_categories_and_offers_with_similarity(input_category)
for similar_category, offers, similarity_score in similar_offers:
    print(f"Similar Category: {similar_category}")
    print(f"Similarity Score: {similarity_score}")
    print(f"Offers: {', '.join(offers)}\n")

Similar Category: snacks
Similarity Score: 0.9346911311149597
Offers: M&M'S®, select sizes, buy 1, M&M'S® chocolate candies, select varieties, SNICKERS®, select sizes, buy 1, SNICKERS® chocolate candy bar, select varieties, Little Bites® Spend $10 at Walmart®, Tostitos® Toppers™, Order from Casey's app or Caseys.com, Spend $25 at Casey's, Spend $5 in-store at Casey's, Select beverages AND prepared food items at Casey's, Visit OR order online from Casey's 7 times, Spend $5 on single-serve prepared food items at Casey's, 12 Pack OR 2 Liter AND Whole Pizza Pie at Casey's, Whole Pizza at Casey's, Wings OR Cheesy Breadsticks at Casey's, Whole Pizza Pie at Casey's, Frozen OR Fountain Drink at Casey's, Spend $10 in-store at Casey's, Fresh bakery item, select varieties, at Casey's, Whole pizza at Casey's, 12 Pack OR 2 Liter AND Whole Pizza at Casey's, Frozen OR Fountain Drink at Casey's , Visit OR order online from Casey's 3 times, Visit OR order online from Casey's 5 times, 12 pack OR 2 liter

In [22]:
# Define the final function that delegates to the appropriate function
def get_similar_categories_and_offers_final(category_name, top_n=5):
    # Check if the category exists in the ordered_df
    if category_name in ordered_df['Product_category'].unique():
        # Call the first function for existing categories
        similar_offers = get_similar_categories_and_offers(category_name)
        for similar_category, offers in similar_offers:
            yield similar_category, offers, None
    else:
        # Call the second function for non-existing categories
        similar_offers = get_similar_categories_and_offers_with_similarity(category_name, top_n)
        for similar_category, offers, similarity_score in similar_offers:
            yield similar_category, offers, similarity_score



In [23]:
# Example usage:
input_category = 'food'  # Replace with the desired input category
similar_offers = get_similar_categories_and_offers_final(input_category)
for similar_category, offers, similarity_score in similar_offers:
    print(f"Similar Category: {similar_category}")
    if similarity_score is not None:
        print(f"Similarity Score: {similarity_score}")
    print(f"Offers: {', '.join(offers)}\n")

Similar Category: snacks
Similarity Score: 0.9291396141052246
Offers: M&M'S®, select sizes, buy 1, M&M'S® chocolate candies, select varieties, SNICKERS®, select sizes, buy 1, SNICKERS® chocolate candy bar, select varieties, Little Bites® Spend $10 at Walmart®, Tostitos® Toppers™, Order from Casey's app or Caseys.com, Spend $25 at Casey's, Spend $5 in-store at Casey's, Select beverages AND prepared food items at Casey's, Visit OR order online from Casey's 7 times, Spend $5 on single-serve prepared food items at Casey's, 12 Pack OR 2 Liter AND Whole Pizza Pie at Casey's, Whole Pizza at Casey's, Wings OR Cheesy Breadsticks at Casey's, Whole Pizza Pie at Casey's, Frozen OR Fountain Drink at Casey's, Spend $10 in-store at Casey's, Fresh bakery item, select varieties, at Casey's, Whole pizza at Casey's, 12 Pack OR 2 Liter AND Whole Pizza at Casey's, Frozen OR Fountain Drink at Casey's , Visit OR order online from Casey's 3 times, Visit OR order online from Casey's 5 times, 12 pack OR 2 liter