# Week 4 - Systematically Improving Your Rag Application

> If you haven't already, please run `1. Generate Dataset.ipynb` to generate the dataset that we'll be using in this notebook. It'll also help you to get familiar with data that we're working with for this specific case study

# Why Use Topic Modelling?

When dealing with large-scale RAG applications, understanding performance across different query types becomes challenging. Topic modeling helps solve this by automatically clustering similar queries, allowing us to efficiently identify patterns and potential issues in our system's responses.

We use Bertopic because it provides a few benefits

- It has a modular architecture that allows us to swap out embedding models, clustering algorithms and dimensionality reduction techniques easily 
- It has built in visualisation and analysis tools
- It offers a large amount of extensions that we can use to guide the model to generate topics that we care about

When applying these techniques in production, make sure to use dynamic topic modelling so that the topics are preserved over time. Additionally it's useful to help get domain experts to guide the topic creation so that the topics are more fine-grained and relevant.

Topic modelling is a technique that allows us to identify clusters of related queries at scale using unsupervised techniques. We can then manually inspect samples from each cluster to understand what sort of queries we're getting and how well we're doing on them.

> It's important here to note that topic modelling is just a way for us to come up with explicit classifications for queries. 

# Topic Modelling

In this section, we'll walk you through how to generate topics from our dataset. We'll be fixing the random state of the `UMAP` so that we can get consistent results in this notebook. 

## Generating Our Topics

UMAP is a dimensionality reduction technique that we can use to reduce the dimensionality of our data. We'll use it here to reduce the dimensionality of our embeddings before clustering. By fixing the random state, we can ensure that the final results are consistent.

In [1]:
import json
import pandas as pd
from umap import UMAP
from bertopic import BERTopic

with open("./data/cleaned.jsonl", "r") as f:
    questions = [json.loads(line) for line in f]
    docs = [item["question"] for item in questions]

df = pd.DataFrame(questions)
df.drop(columns=["citation"], inplace=True)
df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,question,answer,category,subcategory
0,I've been waiting 3 weeks for my Klarna Card,The physical Klarna Card is not yet available ...,Products & services,Klarna Card
1,Hmm I can't find my payment plan for some reas...,If you still can't find your payment plan afte...,Payments,Payment issues
2,I can't seem to find the option to pay with Kl...,Klarna cannot be used for utility bill payment...,Products & services,How to use Klarna
3,"I accidentally created a new klarna ccount, ca...",To manage all your payments in one Klarna acco...,Account & settings,Manage account
4,I just noticed a late fee and now Klarna's loc...,Late fees are applied if a payment fails to be...,Payments,Payment issues


In [2]:


umap_model = UMAP(random_state=14)
topic_model = BERTopic(
    umap_model=umap_model
)
topic_model.fit_transform(docs)
topic_model.get_topic_info()

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,19,-1_the_verification_my_for,"[the, verification, my, for, code, whats, step...",[I got an email saying that my email needs ver...
1,0,98,0_klarna_to_my_and,"[klarna, to, my, and, the, for, im, but, card,...","[Listen, I've been using Klarna for years now,..."
2,1,69,1_my_the_return_store,"[my, the, return, store, still, and, to, but, ...",[I reported a return for some winter boots I b...
3,2,33,2_card_onetime_my_it,"[card, onetime, my, it, for, but, to, and, is,...","[I'm really sorry to bother you, but I'm a bit..."
4,3,24,3_payment_my_due_date,"[payment, my, due, date, the, is, cant, plan, ...",[So I'm approved for financing but now the due...


## Modelling User Satisfaction

We want to look at our topics in closer detail to see if there are classes of queries that we're performing badly for. This helps us to understand how well we're nailing our search queries. To do so, we'll look at topics in terms of the query volume and the user satisfaction score.

We'll be generating user satisfaction scores synthetically for each topic so that we can see how well we're doing on each of them. Each query will be assigned a binary outcome of 1 or 0 representing whether the user was satisfied or not with the outcome determined by a random uniform distribution.

In order to demonstrate the different combinations of query volume and satisfaction scores, we've chosen the following probabilities for each topic randomly. Once we've done so, we can start segmenting our queries into specific topics and see how well we're doing on those. This user satisfaction score is very important because it allows us to understand whether our system is able to retrieve relevant documents to answer our users' queries.

Therefore, when it comes to building a real product, you should start collecting user satisfaction scores early so that you can use them to iteratively improve your system over time.


In [3]:
probabilities = {
    -1: 0.8,
    0: 0.9,
    1: 0.2,
    2: 0.8,
    3: 0.2,
}

In [4]:
topic_df = topic_model.get_document_info(docs)
topic_df.head()

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,I've been waiting 3 weeks for my Klarna Card,0,0_klarna_to_my_and,"[klarna, to, my, and, the, for, im, but, card,...","[Listen, I've been using Klarna for years now,...",klarna - to - my - and - the - for - im - but ...,1.0,False
1,Hmm I can't find my payment plan for some reas...,3,3_payment_my_due_date,"[payment, my, due, date, the, is, cant, plan, ...",[So I'm approved for financing but now the due...,payment - my - due - date - the - is - cant - ...,1.0,False
2,I can't seem to find the option to pay with Kl...,0,0_klarna_to_my_and,"[klarna, to, my, and, the, for, im, but, card,...","[Listen, I've been using Klarna for years now,...",klarna - to - my - and - the - for - im - but ...,1.0,False
3,"I accidentally created a new klarna ccount, ca...",0,0_klarna_to_my_and,"[klarna, to, my, and, the, for, im, but, card,...","[Listen, I've been using Klarna for years now,...",klarna - to - my - and - the - for - im - but ...,0.926805,False
4,I just noticed a late fee and now Klarna's loc...,0,0_klarna_to_my_and,"[klarna, to, my, and, the, for, im, but, card,...","[Listen, I've been using Klarna for years now,...",klarna - to - my - and - the - for - im - but ...,1.0,False


In [5]:
# Join topic_df with original df on Document/question to get categories
results_df = topic_df.merge(df, left_on="Document", right_on="question", how="left")
results_df = results_df[["question", "Topic", "answer", "category", "subcategory"]]
results_df.head()

Unnamed: 0,question,Topic,answer,category,subcategory
0,I've been waiting 3 weeks for my Klarna Card,0,The physical Klarna Card is not yet available ...,Products & services,Klarna Card
1,Hmm I can't find my payment plan for some reas...,3,If you still can't find your payment plan afte...,Payments,Payment issues
2,I can't seem to find the option to pay with Kl...,0,Klarna cannot be used for utility bill payment...,Products & services,How to use Klarna
3,"I accidentally created a new klarna ccount, ca...",0,To manage all your payments in one Klarna acco...,Account & settings,Manage account
4,I just noticed a late fee and now Klarna's loc...,0,Late fees are applied if a payment fails to be...,Payments,Payment issues


In [6]:
import numpy as np
import random

random.seed(21)

# Generate satisfaction scores based on topic probabilities
results_df["satisfied"] = results_df["Topic"].apply(
    lambda x: 1 if np.random.uniform() < probabilities[x] else 0
)

results_df

Unnamed: 0,question,Topic,answer,category,subcategory,satisfied
0,I've been waiting 3 weeks for my Klarna Card,0,The physical Klarna Card is not yet available ...,Products & services,Klarna Card,1
1,Hmm I can't find my payment plan for some reas...,3,If you still can't find your payment plan afte...,Payments,Payment issues,0
2,I can't seem to find the option to pay with Kl...,0,Klarna cannot be used for utility bill payment...,Products & services,How to use Klarna,1
3,"I accidentally created a new klarna ccount, ca...",0,To manage all your payments in one Klarna acco...,Account & settings,Manage account,1
4,I just noticed a late fee and now Klarna's loc...,0,Late fees are applied if a payment fails to be...,Payments,Payment issues,1
...,...,...,...,...,...,...
238,Saw a $175 purchase from an electronics store ...,1,Customer support should verify the specific tr...,Fraud & security,Report fraud,0
239,verification code not working,-1,This could be due to various login issues such...,Account & settings,Login,1
240,Card expired before I could use it - what now?,2,The one-time card expires 24 hours after creat...,Refunds,One-time card,1
241,Just bought some work clothes online and only ...,0,"Yes, Klarna will automatically adjust your pay...",Payments,Payment issues,1


In [19]:
round(float(results_df["satisfied"].sum() / len(results_df)),2)


0.58

## Finding Problematic Clusters

If we look at the overall satisfaction score, we might think that we're doing pretty badly because we're getting an average satisfcation sore of 58% - just slightly better than flipping a coin. However, we'll see that this is not the case once we start to look at the topics in closer detail.

We'll do so in 2 steps

1. We'll first do a breakdown of overall satisfaction per topic and see how each individual topic is performing
2. We'll then dive into specific topics that have a high volume of queries and a correspondingly low satisfaction score. These are the topics that we want to dig deeper into because they're the ones that we need to prioritize fixing

Once we've done so, we'll talk briefly about steps that we might take next if we were building a real product that we had collected this data for.

In [7]:
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df.groupby("Topic")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats

Unnamed: 0_level_0,satisfied,volume
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,0.68,19
0,0.91,98
1,0.16,69
2,0.76,33
3,0.12,24


Let's quickly revisit our original matrix to decide what to prioritize. 

![](./assets/matrix.png)

We can see here that out of the 4 categories, we can classify our respective topics into

1. Topic 0 - Maintain : We can see here that it has an above average query volume and a relatively high satisfaction score of 0.91. This is great and we should continously monitor this topic to ensure we're not regressing. 
2. Topic 1 - Danger Zone : This topic has an above average query volume and a low satisfaction score. We should dig deeper into this topic to understand why this is the case because it's likely that we're going to get a lot of complaints around this topic.
3. Topic 2 - High ROI : This topic has a low query volume but a high satisfaction score. This is a great topic to prioritize because we're not investing a lot of resources into this topic but we're getting a lot of bang for our buck.
4. Topic 3 - Low ROI : This topic has both a low query volume and a low satisfaction score. We likely shouldn't spend any more resources on this topic and we should consider removing it if we were building a real product.

## Digging Deeper into Problematic Topics

Out of the 4 topics, the one we want to dig deeper into first has to be Topic 1 which is in the danger zone. Let's take a look at the queries in this topic and see if we can understand why users are dissatisfied.

We'll do so in the following 3 steps

1. First we'll take a look and aggregate the queries in this topic by category to see if there are specific categories that are causing the most issues within the topic
2. We'll then sample from the queries in this topic and see if we can understand what sort of queries are being asked
3. We'll then brainstorm that we might be able to improve this topic by adding more documents to the index that are relevant to this topic

In [8]:
total_items = results_df[results_df["Topic"] == 1].shape[0]
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df[results_df["Topic"] == 1]
    .groupby("category")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
            # "pct": lambda x: round(len(x) / total_items, 2)
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats["pct"] = round(topic_stats["volume"] / total_items,3)
topic_stats


Unnamed: 0_level_0,satisfied,volume,pct
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Account & settings,0.0,1,0.014
Declined purchase,0.0,1,0.014
Delivery & returns,0.17,41,0.594
Fraud & security,0.17,6,0.087
Payments,0.67,3,0.043
Products & services,0.17,6,0.087
Refunds,0.0,11,0.159


In [21]:
topic_1_items = results_df[results_df["Topic"] == 1]
topic_1_items = topic_1_items[topic_1_items["category"].isin(["Refunds", "Delivery & returns"])]
topic_1_items = topic_1_items[topic_1_items["satisfied"] == 0]
topic_1_items.head(10)


Unnamed: 0,question,Topic,answer,category,subcategory,satisfied
6,Still waiting for my delivery while payment da...,1,You do not need to pay for your order until yo...,Delivery & returns,Deliveries,0
12,Store isn't responding and I keep getting char...,1,It can be frustrating when your return isn’t r...,Delivery & returns,Returns,0
16,"My return investigation got closed, and I am r...",1,It's understandable to be frustrated in this s...,Delivery & returns,Returns,0
18,Why hasn't this unauthorized charge been lifte...,1,It's likely that the charge you are seeing is ...,Delivery & returns,Cancellations,0
29,"I've reported my return 10 days ago, but the s...",1,"After you report a return, your invoice is pau...",Delivery & returns,Returns,0
33,I initiated a return about a week ago but i'm ...,1,You are receiving reminders because your retur...,Delivery & returns,Returns,0
37,"Reported my return 10 days ago, why isn’t ever...",1,"Once you report a return, your invoice is paus...",Delivery & returns,Returns,0
40,The store hasn't gotten back to me for over tw...,1,If the store doesn't respond to your inquiries...,Delivery & returns,Problem resolution,0
55,Store not responding,1,You can report a problem with your order via [...,Delivery & returns,Problem resolution,0
59,Returned my order last week and got a reminder...,1,It's possible that your return is still being ...,Delivery & returns,Returns,0


We can see here that the queries in this topic revolve largely around refunds of items. Specifically users seem to be facing the following 3 issues

1. They're not getting their refunds and are unaware that refunds take up to 14 days to process
2. They're initiated a refund from the store but are not getting a response
3. They're not recieving any updates on their refund status at each point of the process

We can see here that (1) and (3) could potentially be improved by adding a small disclaimer before users submit their refund requests. This would help to set expectations for users and reduce the number of complaints that we receive. 

Additionally, we could improve this by setting a regular reminder to users every few days when they've requested a refund so that they're aware that we're working on their request.

If we had more metadata, we could also explore more questions such as whether the issues in (2) might be due to specific products, stores or partners that we work with. This would help us to understand whether there are any systemic issues at play here.

This wouldn't have been possible if we didn't have the topic modelling layer in place. We would have been limited to only being able to do ad-hoc analysis on individual queries. By getting more granular insights, we're able to make more data-driven decisions on where to focus our efforts - with this short analysis we know now that we should prioritize fixing the refund process and making it more intuitive for users.

# Saving Our model

Now that we've trained our model, we want to save it so that we can use it in our production application as an online topic model. We'd ideally also do some batch inference offline to run these topics on previous time periods to see how well we're doing over time.

We can do so by saving the model into a `safetensors` file. `BERTopic` recommends saving the model this way since it's a relatively safe format and creates a small file size to use in production.

In [27]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# Save the model
topic_model.save("./models/topic_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Load the model
topic_model = BERTopic.load("./models/topic_model")



We can also use `HuggingFace` to push our model to the hub relatively easily. This allows us to use it in other projects and deploy it in production without much efforts.

In [29]:
topic_model.push_to_hf_hub("ivanleomk/rag-topic-model")

topic_embeddings.safetensors: 100%|██████████| 7.77k/7.77k [00:00<00:00, 12.8kB/s]


CommitInfo(commit_url='https://huggingface.co/ivanleomk/rag-topic-model/commit/a9d57c2014f353e1f7116b7dd9fb00d437570bf4', commit_message='Add BERTopic model', commit_description='', oid='a9d57c2014f353e1f7116b7dd9fb00d437570bf4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ivanleomk/rag-topic-model', endpoint='https://huggingface.co', repo_type='model', repo_id='ivanleomk/rag-topic-model'), pr_revision=None, pr_num=None)

# Conclusion

In this short notebook, we've showed you how to use BERTopic to generate topics from a dataset of user queries. We've also demonstrated how to use these topics to identify and prioritize problematic clusters of queries. In the next notebook, we'll take a look at how we can apply this in production so that we can track how well we're doing over time.