# Week 4 - Systematically Improving Your Rag Application

> If you haven't already, please run `1. Generate Dataset.ipynb` to generate the dataset that we'll be using in this notebook. It'll also help you to get familiar with data that we're working with for this specific case study

# Why Use Topic Modelling?

When dealing with large-scale RAG applications, understanding performance across different query types becomes challenging. Topic modeling helps solve this by automatically clustering similar queries, allowing us to efficiently identify patterns and potential issues in our system's responses.

We use Bertopic because it provides a few benefits

- It has a modular architecture that allows us to swap out embedding models, clustering algorithms and dimensionality reduction techniques easily 
- It has built in visualisation and analysis tools
- It offers a large amount of extensions that we can use to guide the model to generate topics that we care about

We want to use topic modelling here to identify clusters of related queries that we can then manually inspect. This allows us to understand how the queries we're getting differ from each other and come up with explicit categories for them. We would then validate these queries with a domain expert to ensure we're covering all the grounds that we need to. 

Topic modelling is just a way to come up with these explicit categories.

# Generating Topics with BERTopic

In this section, we'll walk you through how to generate topics from our dataset. In order to ensure our analysis is reproducible, we'll be fixing the random state of the `UMAP` algorithm as seen below.

## Generating Our Topics

We want to do so because the `UMAP` algorithm helps us to reduce our embeddings from a high dimensional space into a 2D space which we can then cluster using HDBSCAN. By ensuring that our embeddings are reduced consistently to the same form, we can ensure that our generated topics below are consistent. If we don't do so, then every time we run the BERTopic model, we'll generate slightly different results. 



In [25]:
import json
import pandas as pd
from umap import UMAP
from bertopic import BERTopic

with open("./data/cleaned.jsonl", "r") as f:
    questions = [json.loads(line) for line in f]
    docs = [item["question"] for item in questions]

df = pd.DataFrame(questions)
df.drop(columns=["citation"], inplace=True)
df.head()

Unnamed: 0,question,answer,category,subcategory
0,"Hello Klarna Support, I'm having quite a bit o...","If you’re unable to sign in to your account, t...",Account & settings,Login
1,I'd like to delete all my personal information...,"To delete your personal information, you can d...",Fraud & security,Data protection
2,"I recently started using Klarna, and I'm tryin...","To change your email address, chat with our Cu...",Account & settings,Manage account
3,Can I pay my monthly utility bills with Klarna?,"Yes, select Klarna in the checkout of partneri...",Products & services,How to use Klarna
4,I'm trying to update my email address because ...,To change your email address for your active p...,Account & settings,Manage account


In [27]:


umap_model = UMAP(random_state=42)
topic_model = BERTopic(
    umap_model=umap_model
)
topic_model.fit_transform(docs)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,32,-1_my_account_to_for,"[my, account, to, for, klarna, it, and, do, no...","[Hii Klarna, I just got a wierd email saying s..."
1,0,76,0_refund_nike_my_for,"[refund, nike, my, for, store, returned, to, c...",[I returned a $100 pair of sneakers from Puma ...
2,1,47,1_my_the_payment_for,"[my, the, payment, for, to, klarna, pay, it, b...","[Hey Klarna, I made a purchase using your app ..."
3,2,30,2_to_email_my_account,"[to, email, my, account, the, im, and, klarna,...",[Every time I try to log into my Klarna accoun...
4,3,16,3_card_klarna_it_to,"[card, klarna, it, to, need, details, my, call...",[Accidentally gave my card details over a call...


In [37]:
for topic in topic_model.get_topic_info()["Topic"].tolist():
    if topic == -1:
        continue
    
    topic_items = topic_model.get_document_info(docs)[topic_model.get_document_info(docs)["Topic"] == topic]
    
    print(f"Topic {topic}")
    for item in topic_items["Document"].tolist()[:10]:
        print(f" - {item}")
    print("\n")


Topic 0
 - I've been waiting for my running shoes from Under Armour valued at $120 for two weeks now, and the store isn't responding. Can Klarna help?
 - I want to return my latest purchase 
 - Hey Klarna, I just returned a $50 backpack I bought for college. What happens to my payment schedule now that I reported the return?
 - Requested a refund for my Nike purchase, how will it change my Pay in 4 schedule now they’re refunding me?
 - Just bought a pair of running shoes for $85, and it looks like I got charged twice. What's going on?
 - I've canceled my order for those $150 sports sneakers last week directly with the store like their policy said. They promised to process it and send back a confirmation to me and told me that they handle it with you guys but I haven't heard anything since. Anything else that I need to do?
 - I haven't gotten my new Nike hoodie yet. How can I track this order using Klarna?
 - I returned my Nike jacket and was expecting a full refund, but got store credi

If we manually eyeball some of these topics, we can see that there are some general themes around the queries that are in each topic

- Topic 0: Users are asking about refunds and the status of their refunds
- Topic 1: Users are asking about payment methods and why they're not able to use their payment methods
- Topic 2 : Users are facing issues logging into their accounts and resetting information
- Topic 3 : Users are asking about personal information or reporting potentially fradulent transactions/activity to come

We want to manually inspect some of these topics at all times to see if we can understand what sort of queries we're getting and how well we're doing on them. We want to do so because it gives us a sense for what these topics actually are.

## Assigning Synthetic User Satisfaction Scores

We want to look at our topics in closer detail to see if there are classes of queries that we're performing badly for. This helps us to understand how well we're nailing our search queries. To do so, we'll look at topics in terms of the query volume and the user satisfaction score.

We'll be generating user satisfaction scores synthetically for each topic so that we can see how well we're doing on each of them. Each query will be assigned a binary outcome of 1 or 0 representing whether the user was satisfied or not with the outcome determined by a random uniform distribution.

In order to demonstrate the different combinations of query volume and satisfaction scores, we've chosen the following probabilities for each topic randomly. Once we've done so, we can start segmenting our queries into specific topics and see how well we're doing on those. This user satisfaction score is very important because it allows us to understand whether our system is able to retrieve relevant documents to answer our users' queries.

Therefore, when it comes to building a real product, you should start collecting user satisfaction scores early so that you can use them to iteratively improve your system over time.


In [38]:
probabilities = {
    -1: 0.8,
    0: 0.2,
    1: 0.9,
    2: 0.8,
    3: 0.2,
}

In [39]:
topic_df = topic_model.get_document_info(docs)
topic_df.head()

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,"Hello Klarna Support, I'm having quite a bit o...",2,2_to_email_my_account,"[to, email, my, account, the, im, and, klarna,...",[Every time I try to log into my Klarna accoun...,to - email - my - account - the - im - and - k...,0.911275,True
1,I'd like to delete all my personal information...,3,3_card_klarna_it_to,"[card, klarna, it, to, need, details, my, call...",[Accidentally gave my card details over a call...,card - klarna - it - to - need - details - my ...,0.77533,False
2,"I recently started using Klarna, and I'm tryin...",2,2_to_email_my_account,"[to, email, my, account, the, im, and, klarna,...",[Every time I try to log into my Klarna accoun...,to - email - my - account - the - im - and - k...,0.816783,False
3,Can I pay my monthly utility bills with Klarna?,1,1_my_the_payment_for,"[my, the, payment, for, to, klarna, pay, it, b...","[Hey Klarna, I made a purchase using your app ...",my - the - payment - for - to - klarna - pay -...,1.0,False
4,I'm trying to update my email address because ...,-1,-1_my_account_to_for,"[my, account, to, for, klarna, it, and, do, no...","[Hii Klarna, I just got a wierd email saying s...",my - account - to - for - klarna - it - and - ...,0.0,False


In [41]:
# Join topic_df with original df on Document/question to get categories
results_df = topic_df.merge(df, left_on="Document", right_on="question", how="left")
results_df = results_df[["question", "Topic", "answer", "category", "subcategory"]]
results_df.head()

Unnamed: 0,question,Topic,answer,category,subcategory
0,"Hello Klarna Support, I'm having quite a bit o...",2,"If you’re unable to sign in to your account, t...",Account & settings,Login
1,I'd like to delete all my personal information...,3,"To delete your personal information, you can d...",Fraud & security,Data protection
2,"I recently started using Klarna, and I'm tryin...",2,"To change your email address, chat with our Cu...",Account & settings,Manage account
3,Can I pay my monthly utility bills with Klarna?,1,"Yes, select Klarna in the checkout of partneri...",Products & services,How to use Klarna
4,I'm trying to update my email address because ...,-1,To change your email address for your active p...,Account & settings,Manage account


In [42]:
import numpy as np
import random

random.seed(21)

# Generate satisfaction scores based on topic probabilities
results_df["satisfied"] = results_df["Topic"].apply(
    lambda x: 1 if np.random.uniform() < probabilities[x] else 0
)

results_df

Unnamed: 0,question,Topic,answer,category,subcategory,satisfied
0,"Hello Klarna Support, I'm having quite a bit o...",2,"If you’re unable to sign in to your account, t...",Account & settings,Login,0
1,I'd like to delete all my personal information...,3,"To delete your personal information, you can d...",Fraud & security,Data protection,0
2,"I recently started using Klarna, and I'm tryin...",2,"To change your email address, chat with our Cu...",Account & settings,Manage account,1
3,Can I pay my monthly utility bills with Klarna?,1,"Yes, select Klarna in the checkout of partneri...",Products & services,How to use Klarna,1
4,I'm trying to update my email address because ...,-1,To change your email address for your active p...,Account & settings,Manage account,1
...,...,...,...,...,...,...
196,I got a call saying it was from Klarna and lik...,3,You should immediately pause any payments you ...,Fraud & security,Report fraud,0
197,accidentally cvlicked on a suspicious link and...,-1,Immediately stop communication with the sender...,Fraud & security,Report fraud,1
198,Just got a call from someone claming to be Kla...,3,"Based on the information provided, you should ...",Fraud & security,Report fraud,1
199,I think i just gafve out my Klarna card detail...,3,If you believe you have revealed your card det...,Fraud & security,Report fraud,0


In [43]:
round(float(results_df["satisfied"].sum() / len(results_df)),2)


0.54

## Analyzing Problematic Clusters

If we look at the overall satisfaction score, we might think that we're doing pretty badly because we're getting an average satisfcation sore of 54% - just slightly better than flipping a coin. 

However, we'll see that this is not the case once we start to look at the topics in closer detail. 

In [75]:
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df.groupby("Topic")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats

Unnamed: 0_level_0,satisfied,volume
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,0.81,32
0,0.16,76
1,0.91,47
2,0.8,30
3,0.19,16


Let's quickly revisit our original matrix to decide what to prioritize. 

![](./assets/matrix.png)

We can see here that out of the 4 categories, we can classify our respective topics into

1. Topic 0 - Danger Zone : This topic has an above average query volume and a low satisfaction score. We should dig deeper into this topic to understand why this is the case because it's likely that we're going to get a lot of complaints around this topic.
2. Topic 1 - Maintain : We can see here that it has an above average query volume and a relatively high satisfaction score of 0.91. This is great and we should continously monitor this topic to ensure we're not regressing. 
3. Topic 2 - High ROI : This topic has a low query volume but a high satisfaction score. This is a great topic to prioritize because we're not investing a lot of resources into this topic but we're getting a lot of bang for our buck.
4. Topic 3 - Low ROI : This topic has both a low query volume and a low satisfaction score. We likely shouldn't spend any more resources on this topic and we should consider removing it if we were building a real product.

## Digging Deeper into Problematic Topics

Out of the 4 topics, the one we want to dig deeper into first has to be Topic 0 which is in the danger zone. Let's take a look at the queries in this topic and see if we can understand why users are dissatisfied.

We'll do so in the following 4 steps

1. First we'll take a look and aggregate the queries in this topic by category to see if there are specific categories that are causing the most issues within the topic
2. We'll then sample from the queries in this topic and see if we can understand what sort of queries are being asked
3. We'll then see if there are specific categories or query intents that are causing the most issues within the topic itself by manually inspecting the queries
4. Lastly, we'll brainstorm some potential solutions that we can use to address the issues we're seeing

In [46]:
total_items = results_df[results_df["Topic"] == 0].shape[0]
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df[results_df["Topic"] == 0]
    .groupby("category")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
            # "pct": lambda x: round(len(x) / total_items, 2)
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats["pct"] = round(topic_stats["volume"] / total_items,3)
topic_stats


Unnamed: 0_level_0,satisfied,volume,pct
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Declined purchase,0.0,2,0.026
Delivery & returns,0.23,13,0.171
Fraud & security,0.0,1,0.013
Payments,0.0,4,0.053
Products & services,0.0,2,0.026
Refunds,0.17,54,0.711


In [51]:
query_df = results_df[results_df["Topic"] == 0]
query_df = query_df[query_df["category"].isin(["Refunds", "Delivery & returns"])]
query_df = query_df[query_df["satisfied"] == 0]

for item in query_df["question"].tolist()[:50]:
    print(item)


I've been waiting for my running shoes from Under Armour valued at $120 for two weeks now, and the store isn't responding. Can Klarna help?
I want to return my latest purchase 
Requested a refund for my Nike purchase, how will it change my Pay in 4 schedule now they’re refunding me?
I've canceled my order for those $150 sports sneakers last week directly with the store like their policy said. They promised to process it and send back a confirmation to me and told me that they handle it with you guys but I haven't heard anything since. Anything else that I need to do?
I haven't gotten my new Nike hoodie yet. How can I track this order using Klarna?
I returned my Nike jacket and was expecting a full refund, but got store credit instead. Can't I just get back what I paid, I don't really want to store credit.
I sent back a pair of Adidas trainers about a week ago. Should I expect to see any changes to my balance soon?
Still waiting on my pair of sneakers. It’s been a week, and I haven’t go

While we can see that most of the queries here revolve around refunds, there are a few broad categories that we can classify user intent into

1. Users are unhappy with recieving store credit and are asking to be given cash instead
2. Users are in the process of having their refund process and are asking for updates or issues with their refund request
3. Users are asking about the impact of their refund on their payment schedule
4. The stores are not responding to their refund requests

These are three very different types of queries.

1. The first one is difficult to solve programatically because stores have full control over their refund policy. Users might also be incredibly unhappy about the fact that they still need to pay off their purchase even though they've returned the items with the store credits. 
2. The second one is an issue of expectation mismatch - users expect a refund to be processed immediately but are unaware that it can take up to 14 days to process. This could perhaps be solved by adding a small disclaimer before users submit their refund requests.
3. The third one is a straightforward issue that we can solve by perhaps adding a small FAQ/visual before users submit refund requests to help understand the impact of their refund on their payment schedule.

More importantly, with this query segmentation and manually looking at our data, we can start to see where we might be able to improve our system to better answer user queries. We can also see that there is a single merchant - Nike that has a high volume of disatisfied queries. If this was happening on a consistent basis in a real product, we might want to look into why this is the case and see if there might be a deeper issue at play here.

# Saving Our model

Now that we've trained our model, we want to save it so that we can use it in our production application as an online topic model. We'd ideally also do some batch inference offline to run these topics on previous time periods to see how well we're doing over time.

We can do so by saving the model into a `safetensors` file. `BERTopic` recommends saving the model this way since it's a relatively safe format and creates a small file size to use in production.

In [73]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# Save the model
topic_model.save("./models/topic_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Load the model
topic_model = BERTopic.load("./models/topic_model")



We can also use `HuggingFace` to push our model to the hub relatively easily. This allows us to use it in other projects and deploy it in production without much efforts.

In [74]:
topic_model.push_to_hf_hub("ivanleomk/rag-topic-model")

topic_embeddings.safetensors: 100%|██████████| 7.77k/7.77k [00:00<00:00, 7.77kB/s]


CommitInfo(commit_url='https://huggingface.co/ivanleomk/rag-topic-model/commit/4e63008618204433f830127bda6e1c764501bcf5', commit_message='Add BERTopic model', commit_description='', oid='4e63008618204433f830127bda6e1c764501bcf5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ivanleomk/rag-topic-model', endpoint='https://huggingface.co', repo_type='model', repo_id='ivanleomk/rag-topic-model'), pr_revision=None, pr_num=None)

# Conclusion and Next Steps

In this short notebook, we've done a few things.

1. We've generated topics from a dataset of user queries using BERTopic, identifying a few clusters of related queries that each revolved around a specific theme
2. We took a deep dive into a specific topic that had a high volume of queries and a low satisfaction score. We did so by segmenting the queries in the topic by category and then sampling a few queries to understand the user intent. By doing so, we identified a few broad categories of user intents that were present within the topic itself
3. We also covered how we might deploy and use this model in production to generate topics if we wanted to.

In the next notebook, we'll take a look at how we can classify user queries into these categories that we explicitly defined above. We'll look at how we can collaborate with domain experts by using a simple `.yaml` file which allows us to easily define new categories and reuse them in our prompt.

This process of iteratively improving your RAG application is a crucial part of building a successful product. By understanding how to use topic modelling to identify clusters which we can then use to come up with explicit categories for your queries, you can start to see where your system might be lacking and improve it over time.