<h1 style="color: blue; font-style: italic; font-family: sans-serif; text-align: center;">Revolutionizing Recommendations as a Personalization Strategy:<p style="color:brown; text-align: center;">Semantic Transformer</p></h1>

### [Article: Revolutionizing Recommendations as a Personalization Strategy: Semantic Transformer](https://medium.com/@shukla.shankar.ravi/revolutionizing-recommendations-as-a-personalization-strategy-semantic-transformer-05d7ba545a45)

# Semantic Transformer

In [1]:
import os
import pandas as pd
import random
import shutil  # Required for removing non-empty directories

import torch
from transformers import BertTokenizer, BertModel
from langchain.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document


import warnings
warnings.filterwarnings("ignore")
random.seed(9973)

  warn(


## Step 1: Initiate Chroma DB

In [2]:
# Path where ChromaDB is stored
chroma_db_path = './chroma_db'

# Function to drop ChromaDB
def reset_chroma_db(chroma_db_path):
    """
    Deletes the existing ChromaDB directory to reset the vector database.

    Args:
        chroma_db_path (str): Filesystem path to the ChromaDB directory.

    Behavior:
        - If the directory exists, it is removed entirely.
        - If it doesn't exist, informs the user that there's nothing to remove.

    Edge Cases:
        - Handles case where the directory does not exist (avoids error).
        - Uses shutil.rmtree instead of os.rmdir to handle non-empty directories.
    """
    
    # Remove ChromaDB directory if it exists to reset the database
    if os.path.exists(chroma_db_path):
        print("Dropping ChromaDB...")
        # shutil.rmtree is used here instead of os.rmdir because os.rmdir only works for empty directories
        shutil.rmtree(chroma_db_path)
        #os.rmdir(chroma_db_path)  # Remove the directory
    else:
        print("ChromaDB not found, starting fresh.")

# Call the function to reset the ChromaDB before initializing a new vector store
reset_chroma_db(chroma_db_path)

ChromaDB not found, starting fresh.


## Step 2. Load the datasets

#### Note: Data generated Using Association Rule Mining

[Refer: Revolutionizing Recommendations as a Personalization Strategy: Traditional Approaches](https://github.com/rs-shukla/Revolutionizing-Recommendations-as-a-Personalization-Strategy/blob/main/TraditionalApproaches-AssociationRuleMining(ARM).ipynb)

In [3]:
# Load CSV
fields = ['antecedents', 'consequents', 'support', 'confidence', 'lift']
df = pd.read_csv('./data/rules_apriori_dataset.csv', usecols=fields)
print(df.head(5))

                  antecedents                 consequents  support  \
0      frozenset({'Bananas'})      frozenset({'Avocado'})  0.16774   
1      frozenset({'Avocado'})      frozenset({'Bananas'})  0.16774   
2  frozenset({'Black beans'})      frozenset({'Avocado'})  0.16604   
3      frozenset({'Avocado'})  frozenset({'Black beans'})  0.16604   
4  frozenset({'Blueberries'})      frozenset({'Avocado'})  0.16508   

   confidence      lift  
0    0.408544  0.996352  
1    0.409082  0.996352  
2    0.405986  0.990112  
3    0.404936  0.990112  
4    0.404628  0.986801  


## Step 3. Initialize BERT tokenizer and model for embedding

In [4]:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

## Step 4. Generates BERT embeddings

In [5]:
def get_bert_embeddings(texts):
    """
    Generates BERT embeddings for a list of input texts using mean pooling.

    Args:
        texts (List[str]): A list of input text strings to encode.

    Returns:
        np.ndarray: A 2D NumPy array where each row is the embedding of a text.

    Logic:
        - Tokenizes the input texts with padding and truncation to handle varying lengths.
        - Runs the tokenized inputs through the BERT model without tracking gradients (inference mode).
        - Applies mean pooling over the token embeddings (last_hidden_state) to produce a single fixed-size vector per text.

    Edge Case Handling:
        - `padding=True`: Ensures all input texts are padded to the same length for batch processing.
        - `truncation=True`: Prevents overly long texts from causing errors by truncating to the model's max length.
        - `with torch.no_grad()`: Reduces memory usage and speeds up inference by disabling gradient calculation.
    """
    
    # Tokenize input texts with padding and truncation to create uniform tensor input
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    
    # Disable gradient tracking for inference to improve efficiency
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Apply mean pooling across token embeddings to get a single embedding per input text
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling to get a single vector
    
    # Convert PyTorch tensor to NumPy array for easier downstream usage
    return embeddings.numpy()

In [6]:
# Process antecedents and consequents for text search
df['antecedent_str'] = df['antecedents'].apply(lambda x: str(eval(x)))  # Convert frozenset to string
df['consequent_str'] = df['consequents'].apply(lambda x: str(eval(x)))  # Convert frozenset to string
df['antecedent_item_count'] = df['antecedent_str'].apply(lambda x: len(eval(x)))  # Count number of items

## Step 5. Create a list of documents for the Chroma vectorstore

In [7]:

documents = []
for idx, row in df.iterrows():
    antecedent = row['antecedent_str']
    consequent = row['consequent_str']
    metadata = {
        'support': row['support'],
        'confidence': row['confidence'],
        'lift': row['lift'],
        'consequent_str': consequent,  # Store consequent as part of metadata
        'antecedent_item_count': row['antecedent_item_count']  # Add item count to metadata
    }
    # Both antecedent and consequent are part of the same document, so we use the same document ID (idx)
    document = Document(
        page_content=antecedent,  # Antecedent will be the main text (page_content)
        metadata=metadata,        # Metadata contains the consequent and other info
        id=str(idx)               # Using the row index as a unique document ID
    )
    documents.append(document)

## Step 6. Use HuggingFaceEmbeddings directly in Chroma

In [8]:

embedding_model = HuggingFaceEmbeddings(model_name="bert-base-uncased")

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


## Step 7. Create Chroma vectorstore

In [9]:

vectorstore = Chroma.from_documents(documents, embedding_model)

## Step 8. Search for similar antecedent rules and return top resulting consequents based on confidence.

In [10]:
def search_antecedent(query, k=5, result_k=5, metadata_filter=None):
    """
    Search for similar antecedent rules and return top resulting consequents based on confidence.

    Parameters:
    - query (str): Comma-separated list of input items to search for.
    - k (int): Number of similar documents to retrieve from vectorstore.
    - result_k (int): Number of top results to return after sorting by confidence.
    - metadata_filter (dict): Optional filter to narrow down search scope.

    Returns:
    - List of tuples: (consequent item(s), confidence score)
    """
    query_items = query.split(',')
    #print(f"Query : {query_items}")
    
    # Perform similarity search
    search_results = vectorstore.similarity_search(query, k=k, filter=metadata_filter)
    #print(f"search_results: {search_results} \n\n")
    
    filtered_results = []

    for doc in search_results:
        metadata = doc.metadata
        confidence = metadata.get("confidence", 0)
        
        # Safely evaluate frozenset string to extract item(s)
        consequent_raw = metadata.get("consequent_str", "")
        try:
            consequent_set = eval(consequent_raw)
            if isinstance(consequent_set, frozenset):
                consequent = ', '.join(consequent_set)
            else:
                consequent = str(consequent_set)
        except Exception as e:
            print(f"Error parsing consequent: {e}")
            consequent = consequent_raw
        
        filtered_results.append((consequent, confidence))

    # Sort by confidence and limit the number of results
    filtered_results = sorted(filtered_results, key=lambda x: x[1], reverse=True)[:result_k]

    #print("Top Consequents by Confidence:")
    #for item, conf in filtered_results:
        #print(f"{item}: {conf:.2f}")
    
    return filtered_results


## Step 9. Generate Recommendations based on query

In [11]:
# Example: When a user selects a single item, the recommendation system suggests related items based on confidence scores.
# (Different metrics may be used depending on the specific use case.)
query = 'Avocado'  # User query for antecedent

#metadata_filter = {"antecedent_item_count": 1}  # You can filter based on category, author, or any metadata field
metadata_filter = {
    "$and": [{
    "confidence": {"$gte": 0.4}
    },  # Filter for confidence >= 0.4
    {"antecedent_item_count": {"$eq": 1}
    }# Filter for antecedent_item_count == 1
    ]
}

# Retrieve top-k similar consequents based on the antecedent query
recommended_consequents = search_antecedent(query, k=50, result_k=5, metadata_filter=metadata_filter)
print(f'Recommended Consequents:\n\n {recommended_consequents}')

Recommended Consequents:

 [('Oranges', 0.4115072224113613), ('Russet Potatoes', 0.4099600039020583), ('Bananas', 0.4090820407765096), ('Green Grapes', 0.4089357135889181), ('Strawberries', 0.4084023038575093)]


In [12]:
# Example: When a user selects two items, the recommendation system suggests related items based on confidence scores.
# (Different metrics may be used depending on the specific use case.)
query = 'Avocado,Russet Potatoes'  # User query for antecedent
#metadata_filter = {"antecedent_item_count": 1}  # You can filter based on category, author, or any metadata field
metadata_filter = {
    "$and": [{
    "confidence": {"$gte": 0.4}
    },  # Filter for confidence >= 0.4
    {"antecedent_item_count": {"$eq": 2}
    }# Filter for antecedent_item_count == 1
    ]
}

# Retrieve top-k similar consequents based on the antecedent query
recommended_consequents = search_antecedent(query, k=50, result_k=5, metadata_filter=metadata_filter)
print(f'Recommended Consequents:\n\n {recommended_consequents}')

Recommended Consequents:

 [('Bananas', 0.4120458891013384), ('Green Grapes', 0.4115379983138624), ('Eggs', 0.4112093690248566), ('Red Grapes', 0.4097694971933596), ('Potatoes', 0.4090583173996175)]


In [13]:
# Example: When a user selects three items, the recommendation system suggests related items based on confidence scores.
# (Different metrics may be used depending on the specific use case.)
query = 'Avocado,Russet Potatoes,Bananas'  # User query for antecedent
#metadata_filter = {"antecedent_item_count": 1}  # You can filter based on category, author, or any metadata field
metadata_filter = {
    "$and": [{
    "confidence": {"$gte": 0.4}
    },  # Filter for confidence >= 0.4
    {"antecedent_item_count": {"$eq": 3}
    }# Filter for antecedent_item_count == 1
    ]
}

# Retrieve top-k similar consequents based on the antecedent query
recommended_consequents = search_antecedent(query, k=50, result_k=5, metadata_filter=metadata_filter)
print(f'Recommended Consequents:\n\n {recommended_consequents}')


Recommended Consequents:

 [('Potatoes', 0.4147862232779097), ('Potatoes', 0.4140982606042112), ('Oranges', 0.4127646883388011), ('Bananas', 0.4122728626750074), ('Eggs', 0.4112759643916914)]


In [14]:
# Example: When a user selects four items, the recommendation system suggests related items based on confidence scores.
# (Different metrics may be used depending on the specific use case.)
query = 'Avocado,Russet Potatoes,Bananas,Strawberries'  # User query for antecedent
#metadata_filter = {"antecedent_item_count": 1}  # You can filter based on category, author, or any metadata field
metadata_filter = {
    "$and": [{
    "confidence": {"$gte": 0.4}
    },  # Filter for confidence >= 0.4
    {"antecedent_item_count": {"$eq": 4}
    }# Filter for antecedent_item_count == 1
    ]
}

# Retrieve top-k similar consequents based on the antecedent query
recommended_consequents = search_antecedent(query, k=50, metadata_filter=metadata_filter)
print(f'Recommended Consequents:\n\n {recommended_consequents}')

Recommended Consequents:

 [('Potatoes', 0.4233176838810641), ('Milk', 0.4213313161875945), ('Bananas', 0.4213313161875945), ('Gala Apples', 0.4183055975794251), ('Eggs', 0.4176646706586826)]
