# Clustering Conversations: Discovering User Query Patterns

> **Series Overview**: This is the first notebook in a three-part series on systematically analyzing and improving RAG systems. We'll move from raw user queries to production-ready classifiers that enable data-driven improvements.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/567-labs/kura/blob/main/docs/notebooks/how-to-look-at-data/01_clustering_task.ipynb)

In [None]:
# Install kura in Google Colab
!pip install kura


# Make sure you've setup your `OPENAI_API_KEY``
# os.environ['OPENAI_API_KEY'] = <your api key here> 
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')


# Create data directory and download dataset
DATA_DIRECTORY = './data'
CHECKPOINT_DIRECTORY = './checkpoints'

# Curl the Conversation Dataset
os.makedirs(DATA_DIRECTORY, exist_ok=True)
!curl -o {DATA_DIRECTORY}/conversations.json https://usekura.xyz/assets/conversations.json

# Curl the Checkpoints
os.makedirs(CHECKPOINT_DIRECTORY, exist_ok=True)

!curl -o {CHECKPOINT_DIRECTORY}/clusters.jsonl https://usekura.xyz/assets/notebooks/checkpoints/clusters.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/conversations.jsonl https://usekura.xyz/assets/notebooks/checkpoints/conversations.jsonl  
!curl -o {CHECKPOINT_DIRECTORY}/dimensionality.jsonl https://usekura.xyz/assets/notebooks/checkpoints/dimensionality.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/meta_clusters.jsonl https://usekura.xyz/assets/notebooks/checkpoints/meta_clusters.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/summaries.jsonl https://usekura.xyz/assets/notebooks/checkpoints/summaries.jsonl

> **Reproducing Results**: To reproduce the exact results from this notebook, the first cell downloads pre-computed checkpoints from our server. These checkpoints contain the intermediate results from each step of the clustering pipeline, allowing you to follow along without waiting for the computationally expensive embedding and clustering operations to complete.
> 
> To download our precomputed checkpoints, make sure that you curl the `checkpoints` directory

## Why This Matters

In large-scale RAG applications, you'll encounter thousands of user queries. Manually reviewing each is impossible, and simple keyword counting misses deeper patterns. **Topic modeling helps you systematically identify patterns in user queries**, giving you insights into what users are asking and how well your system serves them.

## What You'll Learn

In this first notebook, you'll discover how to:

1. **Prepare Query Data for Analysis**
   - Format JSON data into Kura conversation objects
   - Structure query-document pairs with proper metadata
   - Set up data for effective clustering

2. **Run Hierarchical Topic Clustering**
   - Use Kura's procedural API for LLM-enhanced clustering
   - Generate meaningful summaries of conversation groups
   - Visualize the topic hierarchies that emerge

3. **Analyze and Interpret Results**
   - Examine cluster themes and distribution patterns
   - Identify high-impact areas for system improvements
   - Recognize limitations in default summarization

## What You'll Discover

**By the end of this notebook, you'll uncover that just three major topics account for over two-thirds of all user queries**, with experiment tracking and logging appearing as dominant themes. However, you'll also discover that default summaries miss crucial details about specific features—a limitation that motivates the custom summarization approach in the next notebook.

## What Makes Kura Different

Traditional topic modeling approaches like BERTopic or LDA rely purely on embeddings to group similar documents. **Kura enhances this process by leveraging LLMs to**:

1. **Generate Meaningful Summaries** - Create human-readable descriptions rather than just numeric vectors
2. **Build Topic Hierarchies** - Create multi-level trees showing relationships between themes
3. **Provide Procedural API** - Simple functions rather than complex object hierarchies

By using LLMs for summarization before clustering, Kura produces more intuitive, actionable results than pure embedding-based approaches.

## Understanding Topic Modeling

### What is Topic Modeling?

Topic modeling is a technique for automatically discovering themes or patterns in large collections of text. Think of it like sorting a massive pile of documents into folders based on what they're about—except the computer figures out both what the folders should be AND which documents belong in each one.

### The Role of Embeddings

To group similar texts together, we first need to convert them into a format computers can compare. **Embeddings** are numerical representations of text—think of them as coordinates in a high-dimensional space where similar meanings are positioned closer together.

For example:
- "How do I version my model?" and "What's the best way to track model versions?" would have similar embeddings despite using different words
- These queries would be far from "How do I visualize training metrics?" in the embedding space

### Making Sense with Dimensionality Reduction

Embeddings typically have hundreds or thousands of dimensions—impossible to visualize directly. **Dimensionality reduction** techniques compress these high-dimensional representations down to 2D or 3D while preserving the important relationships between points.

It's like creating a map of a globe—you lose some information when flattening 3D to 2D, but the relative positions of continents remain meaningful.

import json

with open(f"{DATA_DIRECTORY}/conversations.json") as f:
    conversations_raw = json.load(f)

conversations_raw[0]

In [1]:
import json

with open("./data/conversations.json") as f:
    conversations_raw = json.load(f)

conversations_raw[0]

{'query_id': '5e878c76-25c1-4bad-8cae-6a40ca4c8138',
 'query': 'experiment tracking',
 'matching_document': '## Track Experiments\n### How it works\nTrack a machine learning experiment with a few lines of code:\n1. Create a W&B run.\n2. Store a dictionary of hyperparameters, such as learning rate or model type, into your configuration (`wandb.config`).\n3. Log metrics (`wandb.log()`) over time in a training loop, such as accuracy and loss.\n4. Save outputs of a run, like the model weights or a table of predictions.  \n\nThe proceeding pseudocode demonstrates a common W&B Experiment tracking workflow:  \n\n```python showLineNumbers\n\n# 1. Start a W&B Run\n\nwandb.init(entity="", project="my-project-name")\n\n# 2. Save mode inputs and hyperparameters\n\nwandb.config.learning\\_rate = 0.01\n\n# Import model and data\n\nmodel, dataloader = get\\_model(), get\\_data()\n\n# Model training code goes here\n\n# 3. Log metrics over time to visualize performance\n\nwandb.log({"loss": loss})\n\n#

This raw format isn't immediately useful for topic modeling. We need to transform it into something that Kura can process effectively. 

To do so, we'll convert it to a `Conversation` class which `Kura` exposes. This format allows Kura to:

1. Process the conversation flow (even though we only have single queries in this example)
2. Generate summaries of each conversation
3. Embed and cluster conversations based on content and structure

We'll create a function to convert each query-document pair into a Kura Conversation object with a single user Message that combines both the query and retrieved document.

In [3]:
from kura.types import Message, Conversation
from datetime import datetime
from rich import print


def process_query_obj(obj: dict):
    return Conversation(
        chat_id=obj["query_id"],
        created_at=datetime.now(),
        messages=[
            Message(
                created_at=datetime.now(),
                role="user",
                content=f"""
User Query: {obj["query"]}
Retrieved Information : {obj["matching_document"]}
""",
            )
        ],
        metadata={"query_id": obj["query_id"]},
    )


print(process_query_obj(conversations_raw[0]))

In [17]:
conversations = [process_query_obj(obj) for obj in conversations_raw]

Each individual `Conversation` object exposes a metadata field which allows us to provide additional context that can be valuable for analysis.

In this case here, we add the Query ID to the metadata field so that we can preserve it for downstream processing. By properly structuring our data and enriching it with metadata, we're setting a strong foundation for the topic modeling work ahead. 

This careful preparation will pay off when we analyze the results and turn insights into actionable improvements

## Running the Clustering Process

Now that we've converted our raw data into Kura's Conversation format, we're ready to run the clustering process.

### The Clustering Pipeline

The hierarchical clustering process follows these systematic steps:

1. **Summarization**: `summarise_conversations()` - Each conversation is summarized by an LLM
2. **Base Clustering**: `generate_base_clusters_from_conversation_summaries()` - Similar conversations are grouped into initial clusters
3. **Hierarchical Merging**: `reduce_clusters_from_base_clusters()` - Similar clusters are progressively combined
4. **Dimensionality Reduction**: `reduce_dimensionality_from_clusters()` - Projects clusters for visualization

Each function handles one step, making it easy to customize individual components and save intermediate results with checkpointing.

In [21]:
from kura.checkpoints import JSONLCheckpointManager
from kura import (
        summarise_conversations,
        generate_base_clusters_from_conversation_summaries,
        reduce_clusters_from_base_clusters,
        reduce_dimensionality_from_clusters,
    )
from kura.summarisation import SummaryModel
from kura.cluster import ClusterDescriptionModel
from kura.meta_cluster import MetaClusterModel
from kura.dimensionality import HDBUMAP


async def analyze_conversations(conversations, checkpoint_manager):
    
    # Set up models
    summary_model = SummaryModel()
    cluster_model = ClusterDescriptionModel()
    meta_cluster_model = MetaClusterModel()
    dimensionality_model = HDBUMAP()

    # Run pipeline steps
    summaries = await summarise_conversations(
        conversations, model=summary_model, checkpoint_manager=checkpoint_manager
    )

    clusters = await generate_base_clusters_from_conversation_summaries(
        summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
    )

    reduced_clusters = await reduce_clusters_from_base_clusters(
        clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
    )

    projected = await reduce_dimensionality_from_clusters(
        reduced_clusters,
        model=dimensionality_model,
        checkpoint_manager=checkpoint_manager,
    )

    return projected


checkpoint_manager = JSONLCheckpointManager(CHECKPOINT_DIRECTORY, enabled=True)
checkpoint_manager.save_checkpoint("conversations", conversations)
clusters = await analyze_conversations(
    conversations, checkpoint_manager=checkpoint_manager
)

In the output, we can see the consolidation process happening in real-time. Kura starts with 56 base clusters, then gradually merges them through multiple rounds until we reach 9 final top-level clusters. Each merge combines similar topics while preserving the essential distinctions between different conversation types.

Now, let's examine these top-level clusters to understand the main themes in our data. 

By looking at the cluster names, descriptions, and sizes, we can quickly identify what users are discussing most frequently and how these topics relate to each other

In [10]:
# Get top-level clusters (those without parents)
parent_clusters = [cluster for cluster in clusters if cluster.parent_id is None]

# Format each cluster's info with name, description and number of chats
formatted_clusters = []
for cluster in parent_clusters:
    cluster_info = (
        f"[bold]{cluster.name}[/bold] : {cluster.description} : {len(cluster.chat_ids)}"
    )
    formatted_clusters.append(cluster_info)

# Join with newlines and print
print("\n\n".join(formatted_clusters))

## Analysing Our Results

### Understanding Our Top-Level Clusters

Looking at the seven top-level clusters generated by Kura, we can identify clear patterns in how users are interacting with the documentation.

The three largest clusters account for 69% of all queries:
1. **Streamline ML logging and visualization enhancements** (178 conversations) - Users seeking guidance on integrating W&B for logging and customizing visualizations
2. **Manage and log machine learning experiments efficiently** (123 conversations) - Focus on experiment management and tracking using tools like WandB
3. **Guide me on machine learning and Markdown usage** (84 conversations) - Assistance with Markdown reports and troubleshooting ML tools

What's particularly notable is that **logging and experiment management dominate user concerns**. The top two clusters alone represent 54% of all queries (301 out of 560), both focusing on different aspects of experiment tracking and logging.

Additional significant themes include:
- **AWS integration and security** (75 conversations) - IAM roles, SageMaker training, and data storage
- **Team collaboration and data management** (67 conversations) - Table manipulation, collaboration metrics, and project management
- **Model performance optimization** (28 conversations) - Hyperparameter tuning and evaluation

This clustering reveals that the majority of user questions center around **how to effectively use W&B for logging, tracking, and visualizing ML experiments**. Users are consistently trying to figure out how to properly integrate W&B into their workflows, optimize their logging strategies, and create meaningful visualizations of their results.

### Analysing Our Summaries

Let's now examine what are some of the summaries that were generated by Kura for our individual query document pairs. 

To do so, we'll read in the list of conversations that we started with and then find their corresponding summary. This will allows us to then evaluate how representative the conversation summary is of the individual conversation.

In [32]:
from kura.types import ConversationSummary
from kura.checkpoints import JSONLCheckpointManager

checkpoint_manager = JSONLCheckpointManager(CHECKPOINT_DIRECTORY, enabled=True)
summaries = checkpoint_manager.load_checkpoint("summaries", ConversationSummary)
conversations = checkpoint_manager.load_checkpoint("conversations", Conversation)


id_to_conversation = {
    conversation.chat_id: conversation for conversation in conversations
}


for i in range(3):
    print(summaries[i].summary)
    print(id_to_conversation[summaries[i].chat_id].messages[0].content)

## Conclusion

### What You Learned

In this notebook, you discovered how to transform raw user queries into actionable insights for RAG system improvements. You learned to:

- **Prepare query data for Kura** by formatting JSON data into Conversation objects with proper metadata
- **Run hierarchical clustering** using Kura's built-in capabilities to group similar conversations
- **Analyze clustering results** to identify the most common user query patterns and pain points

### What We Accomplished

By leveraging Kura's clustering capabilities, we organized 560 user queries into nine meaningful clusters that revealed clear patterns in how users interact with Weights & Biases documentation. The analysis showed that three major topics—experiment tracking, tool integration, and artifact management—account for over two-thirds of all queries, with artifact management appearing as a significant theme across multiple clusters (61% of conversations).

However, we also identified critical limitations in the default summarization approach. Our generated summaries lacked specificity about the tools users wanted to use and sometimes included irrelevant context from retrieved documents. For example, summaries described queries as "user seeks information about tracking" rather than capturing the specific W&B features involved.

### Next: Better Summaries

While our clustering revealed valuable high-level patterns, the generic summaries limit our ability to understand specific user needs. In the next notebook, "Better Summaries", we'll address this limitation by building a custom summarization model that:

- **Identifies specific W&B features** (Artifacts, Configs, Reports) mentioned in each query
- **Captures precise user intent** rather than generic descriptions  
- **Creates domain-specific summaries** tailored to W&B terminology and workflows

By replacing vague summaries like "user seeks information about tracking" with precise descriptions like "user is managing W&B Artifacts for model versioning", we'll create clusters that better reflect real user needs and provide more targeted, actionable insights for system improvements.