<a href="https://colab.research.google.com/github/ravindrabharathi/kaggle_genai/blob/main/From_Guesswork_to_AI_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import markdown

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![emb_5.gif](attachment:b2bb6a01-e572-46fa-8b94-cf0e104e045a.gif)

# From Guesswork to AI Embeddings: The Future of Smart Decisions in Retail, E-commerce & Supply Chain Management

## Introduction
I'm not a data scientist. My expertise lies in the trenches of supply chain, retail, and e-commerce. But even from my perspective, the concept of embeddings has become incredibly important. Simply put, an embedding is a way to represent something complex—like a product description, a customer review, or even an entire customer journey—as a string of numbers. These numbers capture the essence and relationships of the original data in a format that computers can easily understand and process.

Understanding embeddings, even at a high level, is crucial for any business leader looking to leverage the power of AI.

**More Than Just Numbers**

Why is this useful? Because computers don't understand words or concepts the way we do. They understand numbers. By converting our valuable, qualitative data into numerical representations, embeddings allow us to find hidden patterns and connections that would be impossible to see otherwise. Think of it like assigning coordinates on a map: similar items will be located closer together in this numerical space.


**Business Impact: Smarter Decisions**

The business applications are vast and transformative. In e-commerce, embeddings power more accurate product recommendations, helping customers discover items they actually want. For retail, they can identify subtle trends in customer behavior, optimizing store layouts or promotional offers. In supply chain, embeddings can categorize and analyze vast amounts of unstructured data, to flag potential risks or opportunities.


**Beyond the Buzzword**
You've likely heard of "AI" and "machine learning" being buzzwords. Embeddings are one of the fundamental building blocks that make these technologies truly powerful and applicable to real-world business challenges. They enable us to unlock the true value of our data, moving beyond simple analysis to predictive insights and automated decision-making.

### Let's Get Practical: Working with a Dataframe

To illustrate, consider the Online Retail dataset at UCI Machine Learning Repository. This dataset contains transactional data from a UK-based online retailer, including product descriptions, quantities, prices, and customer information. We can use embeddings to represent each product description as a vector, allowing us to identify similar products based on their numerical representations.

This could be used for product recommendations, inventory management, or even identifying potentially fraudulent transactions.

In [None]:
# URL of the Online Retail dataset at UCI Machine Learning Repository
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"

# Read the Excel file into a pandas DataFrame
df = pd.read_excel(url)
df.head()

For this first example I'll use the Top 20 product sold by number of invoices in the whole dataset.
Here are the descriptions. Yes! we will start from product descriptions.

In [None]:
df_product_sold = df.groupby(['Description'])['InvoiceNo'].count().reset_index().sort_values(by="InvoiceNo", ascending=False)
TOP20_list = df_product_sold['Description'][:20].values
TOP20_list

## Diving Deeper: Sentence Transformers

When we talk about turning sentences into those useful numerical embeddings, a powerful tool often used is called Sentence Transformers. Imagine you have hundreds of thousands of customer reviews, supplier agreements, or product descriptions. Instead of just looking for keywords, Sentence Transformers allow us to understand the meaning of an entire sentence or paragraph and convert that meaning into a unique string of numbers. This means that sentences with similar meanings, even if they use different words, will have numerical representations that are very close to each other. This capability is vital for tasks like finding truly relevant documents, grouping similar feedback, or matching questions to the best answers. It's worth noting that the field of embeddings is constantly evolving, and even more powerful algorithms exist that can capture even finer nuances and relationships within data, continuously pushing the boundaries of what's possible.


**The Power of Hugging Face**

So, where do you find these advanced tools and models? This is where Hugging Face comes in. Think of Hugging Face as a massive, collaborative hub for artificial intelligence, particularly for tools that understand and generate human language. It's like a central marketplace and open-source community where researchers and companies share cutting-edge AI models, datasets, and practical tools. For business users, Hugging Face simplifies access to powerful AI capabilities, allowing you to leverage pre-trained models (like those used for Sentence Transformers) without needing to build them from scratch. It dramatically lowers the barrier to entry for implementing sophisticated AI solutions, enabling faster innovation and real-world application in areas like customer service automation, intelligent search, and market intelligence.

The code below, simply shows how it's easy to convert our TOP 20 product description into embeddings.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
embeddings = model.encode(TOP20_list)

Let's now have a look at what has been generated by SentenceTransformer.

* The embeddings array cointains 20 rows (like our products) and 384 numbers that describes each product.
* Below the code that shows the rapresentation of the first product in the list "WHITE HANGING HEART T-LIGHT HOLDER". Not exactly easy to understand for humans.

In [None]:
embeddings.shape

In [None]:
embeddings[0][:10]

## Visualizing Embeddings with PCA

Once we've transformed our text into these numerical embeddings, how do we actually see and understand them? Since these embeddings often have many dimensions (hundreds or even thousands of numbers for each sentence, 384 in our case), it's impossible for us to visualize directly.
This is where a technique called *Principal Component Analysis (PCA)* becomes incredibly useful.

Think of PCA as a powerful way to simplify complex data. It takes a high-dimensional set of numbers and boils it down to just a few essential dimensions, while trying to keep as much of the original "information" or relationships as possible.

It's like taking a detailed 3D model and intelligently compressing it into a 2D image, retaining the most important features. We'll explore this by reducing our embeddings to just three key components, allowing us to plot and visually interpret their relationships in a way our brains can grasp

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
reduced_embeddings = pca.fit_transform(embeddings)

In the code above I've used 3 components with the objective to reduce the embeddings array in a new array with 20 rows and only 3 columns

In [None]:
reduced_embeddings.shape

The 3D chart visualizes data points after applying Principal Component Analysis (PCA) to embeddings. PCA reduces the high-dimensional embedding vectors into three components (X, Y, and Z axes), allowing us to plot them.

Here's a breakdown of the clusters we can identify:

**Cluster 1 (Bottom-left)**: This cluster seems to group items related to baking or kitchenware. In a supply chain context, this could represent a group of products that are often purchased together or share similar supply chain characteristics (e.g., storage requirements, supplier).

**Cluster 2 (Bottom-right)**: This cluster contains various types of bags. In a supply chain, these might be grouped based on material, size, or target customer.

**Cluster 3 (Top)**: This cluster includes diverse items and it's harder to categorize at first glance. However, they might represent items that are seasonal, decorative, or used for events.

This visualization demonstrates how embeddings, combined with PCA, can reveal meaningful relationships between data points, which can be extremely valuable in supply chain applications such as product categorization, demand forecasting, and inventory optimization.

In [None]:
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')  # Create a 3d chart

# Use colors for the dots
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'orange', 'purple', 'brown', 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'orange', 'purple', 'brown']
for i, text in enumerate(TOP20_list):
    ax.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1], reduced_embeddings[i, 2], c=colors[i], marker='o')
    ax.text(reduced_embeddings[i, 0], reduced_embeddings[i, 1], reduced_embeddings[i, 2], text, fontsize=5) #aggiunge il testo vicino al punto

ax.set_title('Embedding of Product descriptions with PCA (3D)')
plt.show()

## Predicting New Products

The true power of embeddings and this kind of visualization becomes apparent when we introduce new products.

Imagine we add two brand-new items to our inventory: "JUMBO SHOPPER BAG RED RETROSPOT" and "JUMBO SHOPPER BAG GREEN RETROSPOT". By running their descriptions through the same embedding model and applying PCA, we can determine their numerical representations and project them onto this very chart.

Given that these are both types of bags, our expectation is that they would naturally land within the existing "bag" cluster (the purple, pink, black, blue cluster we identified earlier).

This demonstrates how embeddings allow us to not only understand existing data but also to intelligently categorize and predict the relationships of unseen items, leading to more accurate product placement, recommendation strategies, and inventory planning.

In [None]:
new_products = ["JUMBO SHOPPER BAG RED RETROSPOT", "JUMBO SHOPPER BAG GREEN RETROSPOT"]

In [None]:
#Use the same model for create embeddings for the 2 new products.
new_embeddings = model.encode(new_products)

#apply PCA
reduced_new_embeddings = pca.transform(new_embeddings)

#Append the new rows to the previous array
reduced_new_embeddings = np.append(reduced_embeddings, reduced_new_embeddings).reshape(22, 3)

#Add the new product to the list of products
product_list = np.append(TOP20_list, new_products)

Indeed we can see in the chart below that the two new products have been correctly placed in the right cluster.

In [None]:
# Visualizza gli embedding 3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')  # Crea un asse 3D

# Usa colori diversi per i punti
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'orange', 'purple', 'brown', 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'orange', 'purple', 'brown', 'r', 'g']
for i, text in enumerate(product_list):
    ax.scatter(reduced_new_embeddings[i, 0], reduced_new_embeddings[i, 1], reduced_new_embeddings[i, 2], c=colors[i], marker='o')
    ax.text(reduced_new_embeddings[i, 0], reduced_new_embeddings[i, 1], reduced_new_embeddings[i, 2], text, fontsize=6) #aggiunge il testo vicino al punto

ax.set_xlim(0.2, np.max(reduced_new_embeddings[:, 0])) # Usa il limite inferiore desiderato e il valore massimo effettivo
ax.set_ylim(0, np.max(reduced_new_embeddings[:, 0])) # Usa il limite inferiore desiderato e il valore massimo effettivo

ax.set_title('Embedding delle Frasi Ridotti con PCA (3D)')
plt.show()

## Beyond the Visual: Measuring Product Similarity with Cosine Similarity

While the 3D chart gives us a great visual intuition, seeing clusters is just one piece of the puzzle.

To precisely quantify how similar products are, we use a powerful technique called **Cosine Similarity**.

Imagine our product embeddings as arrows pointing in different directions in that multi-dimensional space. Cosine similarity measures the angle between these arrows. If two product embeddings are very similar in meaning, their arrows will point in almost the same direction, resulting in a high cosine similarity score (close to 1). If they're completely different, their arrows will point in very different directions, and the score will be low (closer to 0).

This method allows us to go beyond just looking at a chart and mathematically identify the top 'n' most similar products to any given input product. This is incredibly valuable for enhancing recommendation engines, identifying cross-selling opportunities, or even suggesting alternative products when an item is out of stock.

In [None]:
new_embeddings = np.append(embeddings, new_embeddings).reshape(22, 384)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# Calcola la matrice di similarità coseno
cosine_similarities = cosine_similarity(new_embeddings)

This array is a list of cosine similarity scores, showing how similar the first product in your list is to every other product (including itself).

Let's break it down:

* 0.9999999 (The first number): This score represents the similarity of the first product to itself. As you'd expect, a product is perfectly similar to itself, so the score is very close to 1.0. This acts as a good check that the calculation is working correctly.

* The rest of the numbers (e.g., 0.17248935, 0.2665741, 0.34575576): Each subsequent number in the array is the cosine similarity score between your first product and another product in your dataset.

Higher numbers (closer to 1.0) indicate a higher degree of similarity in meaning or context between the first product and that specific other product.
Lower numbers (closer to 0.0) indicate less similarity.

So, if you were looking for products similar to your first one, you would scan this array for the highest values (excluding the first 0.9999999), and those would be your most similar products. This is exactly what a "top n products" function would do – it sorts these scores and picks out the items with the highest similarity.

In [None]:
cosine_similarities[0]

This Python function, find_similar_items, is a practical tool that leverages the embeddings and cosine similarity we've discussed to actually recommend products.

Here's how it works in simple terms:

**What it does**: Its main job is to find the "top N" most similar products to any specific product you're interested in, based on the underlying meaning captured by the embeddings.

**What you give it (Inputs)**:

* product_list: This is simply a comprehensive list of all your product descriptions.
* similarity_matrix: This is the big table (matrix) we talked about earlier, where every number represents the cosine similarity score between any two products.
* product_description: This is the specific product (by its description) for which you want to find similar items.
* n: This is the number you specify for how many "top" similar products you want to see (e.g., if you want the top 3, n would be 3).

**How it finds the similar items (Step-by-step)**:

1. Locate Your Product: First, the function quickly finds your product_description within the product_list to get its exact position (or "index").
2. Grab Similarities: Using that position, it goes into the similarity_matrix and pulls out the entire row of similarity scores that corresponds to your selected product. This row tells you how similar your product is to every other product in the list.
3. Filter Out Itself: It then intelligently removes the score where your product is compared to itself (which will always be nearly 1.0), because you're interested in other items.
Rank by Similarity: Next, it sorts all the remaining product similarities from the highest score (most similar) to the lowest score (least similar).
4. Pick the Top N: Finally, it takes the top n products from this sorted list.
5. Present Results: It then neatly formats these top n product descriptions along with their rounded similarity scores, making them easy to read and use.
   
**The Benefit**: This function automates the process of finding relevant product recommendations. Instead of manually sifting through data or guessing, you can instantly identify which products are most semantically similar to a given item, directly impacting your cross-selling strategies, personalized customer experiences, and inventory management.

In [None]:
def find_similar_items(product_list, similarity_matrix, product_description, n):

    """
    Finds the top n most similar items to a selected product based on its name using cosine similarity.

    Args:
        lista_prodotti (list): A list of strings, where each string is the description of a product.
        matrice_similarita (np.ndarray): A square matrix containing the cosine similarity values between all products.
        nome_prodotto_selezionato (str): The name (description) of the selected product.

    Returns:
        list: A list of tuples, where each tuple contains (similar_item_description, similarity_value).
              Returns an error message if the product is not found.
              The results are sorted in descending order of similarity.
    """
    try:
        product_index = product_list.index(product_description)
    except ValueError:
        return f"Product '{product_description}' not found."

    # Get the similarity values for the selected product
    similarity_product = similarity_matrix[product_index]

    # Create a tuple (index, cosine_similarity) for all the other product in the list
    other_product_similarity = [(i, similarity_product[i]) for i in range(len(product_list)) if i != product_index]

    # Sort the list in descending order of similarity
    other_product_similarity_sorted = sorted(other_product_similarity, key=lambda x: x[1], reverse=True)

    # Get the first 3 similar product
    top_3_prod = other_product_similarity_sorted[:n]

    results = []
    for index, similarity in top_3_prod:
        description = product_list[index]
        results.append((description, np.round(similarity,2)))

    return print(f"Top {n} recommended product are: {results}")

Let's try it!

In [None]:
find_similar_items(product_list.tolist(), cosine_similarities, "JUMBO SHOPPER BAG RED RETROSPOT", 3)

**Business Case 1: Targeted Email Suggestions for Engaged Customers**

Imagine a valuable client has just completed a purchase of your "JUMBO SHOPPER BAG RED RETROSPOT" online. This isn't just a transaction; it's an opportunity. Rather than sending a generic follow-up email, we can use our embedding-powered system to send highly personalized suggestions. This proactive, intelligent approach goes beyond simple purchase history; it understands the underlying product attributes and customer preferences. The business benefit is clear: by offering genuinely relevant products, you significantly increase the likelihood of repeat purchases, boost your cross-selling success, and enhance the overall customer experience, strengthening brand loyalty.

**Business Case 2: Smart Alternatives for Out-of-Stock Products**

Now consider a common retail challenge: a popular item, like the "JUMBO SHOPPER BAG RED RETROSPOT," is temporarily out of stock. Without a smart solution, a customer might leave your site frustrated, potentially going to a competitor. This is where embeddings become a sales-saver. Our find_similar_items function can be immediately deployed. By inputting the out-of-stock product's description, the system instantly identifies and presents the most similar available alternatives. This isn't just about suggesting another bag; it's about finding a bag that resonates with the original choice's style, function, or target use case, even if the explicit keywords differ. The business value here is immense: you prevent lost sales, maintain customer satisfaction by offering viable alternatives, and guide customers smoothly towards a purchase even when their first choice isn't available, transforming a potential negative experience into a positive one.

## Forecasting New Products: Leveraging Similarity for Smarter Planning

Now, let's bring it all together for a critical business challenge: forecasting sales for a brand new product. This is often a significant hurdle, as historical data for the new item doesn't exist. This is where our find_similar_items function, powered by embeddings and cosine similarity, becomes invaluable.

We can input the description of our new product into this function and identify the single most similar existing product in our catalog. Once we've found this "twin" product, we can then dive into its historical sales data, specifically looking at its seasonality. Did it sell particularly well during the holiday season? Or perhaps in the spring months? By understanding the sales patterns of its most similar counterpart, we gain crucial insights into how our new product is likely to perform over time.

This allows us to make informed decisions on initial inventory levels, plan targeted marketing campaigns, and allocate resources more effectively, significantly reducing the risk associated with new product launches and leading to more accurate sales forecasts.

for our new "JUMBO SHOPPER BAG RED RETROSPOT" we have found that in our catalogue 'JUMBO BAG RED RETROSPOT' has 95% of cosine similarity. Let's use it for our example.

In [None]:
#Select the most similar Item
Top1_df = df[df['Description']=='JUMBO BAG RED RETROSPOT'].groupby('InvoiceDate')['Quantity'].sum().reset_index().set_index('InvoiceDate')
#Resample the df for haiving weekly data
weekly_df = Top1_df.resample('W').sum()

For our purposes, after selecting the similar product, we would then examine its yearly sales. Unfortunately, we don't have detailed information about promotions or other events that may have affected those sales. For this time, we will not consider these additional factors, but I invite the reader to deep dive into the importance of incorporating promotions, holidays, and other external factors for more robust forecasting in real-world scenarios.

In [None]:
plt.rcParams["figure.figsize"] = (15,4)
_=weekly_df.plot(color='purple', marker='o', markerfacecolor='lime')
_=plt.title("JUMBO BAG RED RETROSPOT Weekly Sales",fontsize = 12)

To gain an even deeper understanding of how a product's sales might fluctuate throughout the year, especially for items whose seasonal swings tend to grow or shrink with their overall sales volume, we can employ **Prophet's model**.

The Python code we use here prepares our historical sales data for Prophet, ensuring it understands the dates and quantities. The key instruction to Prophet is to look for a yearly pattern (yearly_seasonality=True) but to apply it in a multiplicative way (seasonality_mode='multiplicative'). After the model analyzes the data, it generates a forecast, and from this, we can extract and plot just the multiplicative seasonal component. **It's important to note, however, that having only 52 data points (one year of weekly sales) is not the ideal scenario for robust seasonality detection**. While Prophet is quite capable with limited data, more years of historical data would allow the model to identify these repeating patterns with greater confidence and accuracy, leading to an even more precise seasonal forecast.

In [None]:
from prophet import Prophet

# Prophet requires columns named 'ds' for dates and 'y' for values
prophet_df = weekly_df.reset_index()
prophet_df.columns = ['ds', 'y']

# We tell it to look for yearly seasonality
model = Prophet(yearly_seasonality=True, seasonality_mode='multiplicative')
model.fit(prophet_df)

# --- Create a future DataFrame for plotting seasonality ---
# We just need enough future dates to see the pattern, e.g., one year
future = model.make_future_dataframe(periods=52, freq='W', include_history=False) # Only future for seasonality plot

# --- Make predictions ---
forecast = model.predict(future)

plt.figure(figsize=(10, 3))
plt.plot(forecast['ds'], forecast['yearly'])
plt.title('Prophet: Isolated Yearly Seasonality Component')
plt.xlabel('Date')
plt.ylabel('Seasonality Contribution')
plt.grid(True)
plt.tight_layout()
plt.show()

**Business Interpretation:**

This chart provides a clear, quantitative understanding of your product's typical yearly sales rhythm. For instance, if you see a significant peak around November-December, it strongly suggests a holiday season boost. A dip in, say, January-February might indicate a post-holiday slowdown. By analyzing this isolated seasonality pattern, you can more accurately forecast future sales for both existing and new, similar products, enabling precise inventory adjustments, targeted marketing campaigns for peak seasons, and strategic planning to mitigate dips during slower periods.

## From Product to Customer: Personalized Recommendations at Scale

This brings us to the final, incredibly impactful application of embeddings. Having explored how embeddings allow us to understand and forecast product behavior, we can now turn our attention to the other side of the equation: **our customers**.

Imagine converting every facet of a customer's digital footprint – their Browse history, past purchases, review text, customer service interactions, and even their responses to surveys – into a unique "customer embedding." This numerical representation effectively captures their preferences, interests, and behaviors.

Once we have these customer embeddings, the magic happens. If a customer, shows interest in a brand new product, our system can immediately identify other customers whose "customer embeddings" are highly similar. These are individuals who, based on their holistic digital profile, share similar tastes and likely have similar future needs or desires. Even if these "similar customers" haven't yet encountered this new product, our system can confidently recommend it to them.

This approach moves beyond simple "people who bought X also bought Y" logic; it's about understanding the nuanced preferences of individual customers and proactively connecting them with products they're most likely to love, leading to higher conversion rates, increased customer satisfaction, and a truly personalized shopping experience.

For the next example focusing on customer similarity, we will indeed refine our customer base to include only those who made a purchase in 2011. This will allow us to demonstrate the customer embedding and recommendation process with a focused dataset.

In [None]:
customer_importance = df[df["InvoiceDate"].dt.year>2010].groupby("CustomerID")['Description'].nunique().reset_index().sort_values('Description', ascending=False)
customer_importance.columns = ["CustomerID", "nr_products_buought"]

In [None]:
import altair as alt
# Create the histogram
chart = alt.Chart(customer_importance[customer_importance['nr_products_buought']>0]).mark_bar().encode(
    x=alt.X('nr_products_buought:Q', bin=True, title='Number of Products Bought'),
    y=alt.Y('count()', title='Number of Customers'),
    tooltip=[alt.Tooltip('nr_products_buought:Q', bin=True, title='Number of Products Bought'), 'count()']
).properties(
    title='Distribution of Number of Products Bought per Customer'
).interactive()

# Save the chart as a JSON file (optional)
chart.save('distribution_of_products_bought_per_customer_histogram.json')

# To display the chart directly in a Jupyter Notebook or similar environment, you would typically do:
chart.show()

This chart is a histogram that shows how many customers fall into different ranges of invoice counts. From its appearance, we can infer that:

A large majority of customers have a relatively low number of invoices. The bars are highest on the left side of the chart, meaning many customers have made only a few purchases.
The distribution has a long tail: There's a very small number of customers who have an exceptionally high number of invoices, stretching far to the right of the chart. These might be power users, wholesale clients, or even business accounts.

**Why Focus on Customers with 1 to 200 Invoices?**

Given this distribution, the decision to analyze only customers with more than 0 but less than 200 invoices is a practical and strategic one for building customer embeddings for personalized recommendations.

**Filtering Out Extreme Outliers**: The customers with hundreds or even thousands of invoices (the far right tail of the distribution) are often anomalies. Their purchasing patterns might be drastically different from the typical individual consumer, perhaps representing bulk business orders rather than personal shopping habits. Including them could skew our understanding of general customer behavior for recommendation purposes.

**Focusing on the Core Customer Base**: By concentrating on the range of 1 to 200 invoices, we are focusing on the vast majority of our individual consumers. This allows our embedding model to learn from the most representative segment of our customer base, ensuring that the insights we gain and the recommendations we generate are relevant and effective for the largest group of our clients.

**Improving Model Performance**: Extreme data points can sometimes confuse machine learning models or make it harder for them to identify patterns that apply broadly. By working with a more consistent segment, we can create more robust and accurate customer embeddings that truly reflect the preferences of our target audience for personalized marketing efforts.

In [None]:
# Create the histogram
chart = alt.Chart(customer_importance[(customer_importance['nr_products_buought']<200) & (customer_importance['nr_products_buought']>0)]).mark_bar().encode(
    x=alt.X('nr_products_buought:Q', bin=True, title='Number of Products Bought'),
    y=alt.Y('count()', title='Number of Customers'),
    tooltip=[alt.Tooltip('nr_products_buought:Q', bin=True, title='Number of Products Bought'), 'count()']
).properties(
    title='Distribution of Number of Products Bought per Customer'
).interactive()

# Save the chart as a JSON file (optional)
chart.save('distribution_of_products_bought_per_customer_histogram.json')

# To display the chart directly in a Jupyter Notebook or similar environment, you would typically do:
chart.show()

To simplify our analysis for this specific demonstration, we will further refine our customer base to include only those clients who purchased between 40 and 60 unique products, and where individual product quantities bought were more than 5 pieces. This precise filtering will allow us to analyze a very specific and engaged customer segment for our customer similarity example.

In [None]:
target_clients = customer_importance[(customer_importance['nr_products_buought']<60) & (customer_importance['nr_products_buought']>40)]['CustomerID'].values
print(f" Selected {len(target_clients)} clients for our analysis")

In [None]:
df_2010 = df[df["InvoiceDate"].dt.year>2010]
df_2010_target_clients = df_2010[df_2010['CustomerID'].isin(target_clients)]

In [None]:
#Filter for Qty>5
product_bought_target_clients = df_2010_target_clients.groupby(['CustomerID', 'Description'])['Quantity'].sum().reset_index().sort_values(by=['Description','Quantity'], ascending=False)
product_bought_target_clients = product_bought_target_clients[product_bought_target_clients['Quantity']>5]

**This is the final step in preparing your customer data for creating those powerful customer embeddings!**
We'll create a DataFrame where each row represents a selected customer, and importantly, includes a list of all the product descriptions they bought. This list will then serve as the input for your Sentence Transformer model. The field "ln" counts the total number of product descriptions within that list for each client.

To ensure our analysis focuses on a robust set of purchasing behaviors, we applied several filters: first, removing products purchased in quantities of 5 units or less. Then, among the remaining customers, we further selected those who bought between 40 and 60 unique products and, finally, specifically considered only clients who purchased at least 10 total products from that filtered set. This meticulous selection ensures the product lists for our chosen customers are rich enough to yield meaningful embeddings.

In [None]:
df_grouped_target_clients = product_bought_target_clients.groupby('CustomerID')['Description'].apply(list).reset_index()
df_grouped_target_clients['ln'] = df_grouped_target_clients['Description'].str.len()
df_grouped_target_clients = df_grouped_target_clients[df_grouped_target_clients['ln']>10]
df_grouped_target_clients

In [None]:
#Apply Sentence transformers to Description field
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
client_embeddings = model.encode(df_grouped_target_clients['Description'].values)

In [None]:
#Compute the cosine similarity
client_similarities = cosine_similarity(client_embeddings)

This Python function, **get_recommendations**, is the operational heart of your customer-based recommendation system. It takes the customer similarity insights we've built and translates them into actionable product suggestions for a specific individual.

**Here’s a breakdown of what it does**:

For a chosen customer, this function identifies other customers who are most similar to them (based on their purchasing patterns captured by embeddings). It then looks at what those similar customers have bought and recommends products from their history that your target customer has not yet purchased.

**What you give it (Inputs)**:

**client_id**: This is the unique identifier for the specific customer you want to generate recommendations for.
**df_grouped**: This is the DataFrame we just created, containing each customer's ID and the list of product descriptions they purchased.
**similarity_matrix**: This is the "map" we talked about – the large table showing how similar every client is to every other client (derived from their customer embeddings).
**n**: This number determines how many of the most similar clients the function should consider when looking for recommendation candidates.

**How it works (Step-by-step)**:

1. Finds the Target Client: First, the function locates your specified client_id within your list of all customers. If the client isn't found, it lets you know.
   
2. Identifies Most Similar Clients: Using the similarity_matrix (which holds the numerical similarity scores between all clients), it finds the n clients who are most alike your target client. It specifically excludes the target client themselves, as we only care about other customers.

3. Collects Products from Similar Clients: It then goes through each of these n similar clients and gathers all the product descriptions they have ever purchased. This creates a combined list of products that "customers like yours" tend to buy.
Identifies Products Already Bought by Target Client: Separately, it pulls up the list of all products that your client_id (the target customer) has already purchased.
4. Generates Recommendations: This is the crucial step. It compares the combined list of products from the similar clients with the list of products already bought by your target client. Any product that the similar clients bought, but your target client has not yet bought, is added to the recommended list.

The Output:

* The IDs of the n most similar clients found.
* Their exact similarity scores to your target client (e.g., 0.85 indicates 85% similarity).
* The final list of recommended_products.

**Business Value**: This function is a powerful tool for hyper-personalization. It allows you to offer precise product suggestions based on a deep understanding of customer behavior, driving cross-selling, improving customer experience, and increasing sales by presenting relevant items they are likely to want but haven't yet discovered on their own.

In [None]:
def get_recommendations(client_id, df_grouped, similarity_matrix, n):
    """
    Recommends products for a given client, based on the products purchased by the n most similar clients.

    Args:
        client_id: The ID of the client for whom to generate recommendations.
        df_grouped: DataFrame containing the products purchased by each client,
                    with columns 'CustomerID' and 'Product Description' (list of descriptions).
        similarity_matrix: Cosine similarity matrix between clients.

    Returns:
        A list of recommended products for the given client.
        Returns an empty list if the client is not found.
    """
    # 1. Get the list of clients from df_grouped
    clients_list = df_grouped['CustomerID'].unique().tolist()

    if client_id not in clients_list:
        print(f"Cliente {client_id} not found in the Customer Dataframe.")
        return []

    # 2. Find the 5 most similar clients
    client_index = clients_list.index(client_id)
    similarities = similarity_matrix[client_index]
    similar_clients_indices = np.argsort(similarities)[::-1][1:(n+1)]
    similar_clients_ids = [clients_list[i] for i in similar_clients_indices]
    similar_clients_values = [similarities[i] for i in similar_clients_indices] # Get similarity values

    # 3. Find the products purchased by the similar clients
    similar_clients_products = []
    for similar_client_id in similar_clients_ids:
        similar_client_row = df_grouped[df_grouped['CustomerID'] == similar_client_id]
        if not similar_client_row.empty:
            similar_clients_products.extend(similar_client_row['Description'].iloc[0])
        else:
            print(f"Similar client {similar_client_id} not found in df_grouped.")

    # 4. Find the products purchased by the input client
    client_products_row = df_grouped[df_grouped['CustomerID'] == client_id]
    client_products = []
    if not client_products_row.empty:
        client_products = client_products_row['Description'].iloc[0]

    # 5. Calculate the recommended products
    common_products = set(similar_clients_products[0])  # Initialize with the first client's products
    for product_list in similar_clients_products[1:]:
        common_products.intersection_update(product_list) # Find intersection with the other clients


    recommended_products = list(set(similar_clients_products) - set(client_products))
    return print(f"-The similar Clients are: {similar_clients_ids} \n The Similarity is {np.round(similar_clients_values,2)} \n-The product recommended are: {recommended_products}")

In [None]:
get_recommendations(15260.0, df_grouped_target_clients, client_similarities, 2)

A score of 0.97 indicates a very high degree of similarity (nearly identical buying patterns or interests) between our target client and client 15630.0. Similarly, 0.93 signifies a strong match with client 15076.0. These high scores give us confidence that these identified clients truly share comparable preferences.

**This is the core deliverable**: a list of specific products. These are items that the identified similar clients (15630.0 and 15076.0) have purchased, but our target client has not yet added to their own buying history. Because these recommendations are based on a deep understanding of shared preferences, they are highly likely to resonate with our target client. This list can be directly used in personalized email campaigns, on-site product suggestion widgets, or even for sales associates to offer relevant advice.
In essence, this output provides a powerful, data-driven pathway to hyper-personalization, enabling you to proactively suggest products that are highly relevant to individual customer tastes, moving beyond simple demographics or past purchases to predict future desires.

## Conclusions

We've journeyed through the transformative power of semantic embeddings, witnessing how unstructured data—from a simple product description to the intricate tapestry of a customer's purchasing habits—can be converted into quantifiable, actionable insights. No longer are we confined to superficial analysis; embeddings allow us to grasp the true meaning and context, revolutionizing how businesses operate.

The implications are profound. We've seen how identifying truly similar products can fine-tune new product forecasting, mitigate the impact of out-of-stock items, and power hyper-relevant cross-selling initiatives, directly impacting the bottom line. Furthermore, by creating deep "customer embeddings," we move beyond broad demographic strokes to understand the subtle nuances of individual preferences, enabling hyper-personalized marketing campaigns that resonate deeply, foster loyalty, and drive conversion rates previously unattainable. The ability to recommend the perfect new product to a receptive customer, even if they've never seen it before, represents a monumental leap in customer engagement.But the applications extend far beyond recommendations.

The era of merely counting keywords is over. Businesses that embrace the power of semantic embeddings to truly understand their products, their customers, and their operational data will gain an unparalleled competitive advantage. I encourage you to experiment with these powerful techniques, delve deeper into the fascinating world of specialized embedding models, and explore even more advanced algorithms. The journey from raw data to deep business intelligence is just beginning, and the potential for innovation is limitless.