<a href="https://colab.research.google.com/github/razzzeeev/product_alternates/blob/main/Help_shoppers_find_product_alternates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# !pip install sentence_transformers


In the FindAlternateGroups function, I took an approach to group similar products from a store based on the cosine similarity between the sentence embeddings of their titles and descriptions.

First, I fetched all the products from the store by paginating through the JSON endpoint. I then extracted the titles and descriptions of all the products and combined them into a single list.

Next, I used the SentenceTransformer model to generate sentence embeddings for all product titles and descriptions. I then computed the pairwise cosine similarity between these sentence embeddings to measure the similarity between products.

Finally, I grouped products based on a similarity threshold. Products with a cosine similarity greater than or equal to the specified similarity threshold were grouped together. I then prepared a result JSON containing groups of similar products and returned it from the function.

This approach allowed me to group similar products from a store based on their titles and descriptions. By adjusting the similarity threshold, I could control how similar products needed to be in order to be grouped together.

In [None]:
import json
import requests
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup


In [None]:
def extract_text_from_html(html_string: str) -> str:
    """
    Extracts text from an HTML string using BeautifulSoup and removes newline characters.

    :param html_string: The HTML string to extract text from.
    :return: The extracted text with newline characters removed.
    """
    soup = BeautifulSoup(html_string, 'html.parser')
    text = soup.get_text()
    text = text.replace('\n', '')
    return text


In [None]:
def FindAlternateGroups(store_domain):
    """
    This function takes in a store domain and returns groups of similar products from the store.
    The similarity between products is determined based on the cosine similarity between the sentence embeddings of their titles and descriptions.
    Products with a cosine similarity greater than or equal to a specified similarity threshold are grouped together.

    :param store_domain: The domain of the store to fetch products from.
    :type store_domain: str
    :return: A JSON string containing groups of similar products.
    :rtype: str
    """
    base_url = f"https://{store_domain}"
    page = 1
    all_products = []

    # Fetch all products by paginating through the JSON endpoint
    while True:
        url = f"{base_url}/collections/all/products.json?page={page}"
        response = requests.get(url)
        data = response.json()

        if not data["products"]:
            break

        all_products.extend(data["products"])
        page += 1

    # Extract product titles and descriptions for clustering
    product_titles = [product['title'] for product in all_products]
    product_descriptions = [extract_text_from_html(product.get('body_html', '')) for product in all_products]

    # Combine product titles and descriptions
    all_text = product_titles + product_descriptions

    # Generate sentence embeddings using SentenceTransformer
    model = SentenceTransformer('distilbert-base-nli-mean-tokens')
    text_features = model.encode(all_text)

    # Compute pairwise similarity between sentence embeddings
    similarity_matrix = cosine_similarity(text_features)

    # Group products based on similarity threshold
    similarity_threshold = 0.8
    cluster_groups = {}
    for i in range(len(all_products)):
        cluster_groups[i] = []
        for j in range(len(all_products)):
            if similarity_matrix[i][j] >= similarity_threshold:
                cluster_groups[i].append(all_products[j])

    # Prepare the result JSON
    result = []
    for group in cluster_groups.values():
        product_links = [f"{base_url}/products/{product['handle']}" for product in group]
        result.append(product_links)

    return json.dumps(result)


In [None]:
store_domain='www.boysnextdoor-apparel.co'
FindAlternateGroups(store_domain)

**Future Improvements:**
There are several ways we can improve this model:

1. **Use additional product features**: In addition to using the titles and descriptions of products, we can also use other features such as price, category, tags, etc. to measure the similarity between products. We can concatenate these features with the sentence embeddings of the titles and descriptions before computing the cosine similarity.

2. **Use clustering algorithms**: Instead of grouping products based on a fixed similarity threshold, we can use clustering algorithms such as K-Means or DBSCAN to group similar products together. These algorithms can automatically determine the optimal number of groups and assign products to groups based on their similarity.

3. **Use image processing**: If our products have images, we can use image processing techniques to extract visual features from the images and use them to measure the similarity between products. We can use pre-trained convolutional neural networks (CNNs) to extract these features and concatenate them with the sentence embeddings before computing the cosine similarity.

By incorporating these techniques, we can improve the accuracy of our model in grouping similar products together.