Prerequisites: Make sure that you've downloaded and saved the 560 conversations to a file at the path
./data/conversations.json
-
Clone this repository:
git clone https://github.com/567-labs/how-to-look-at-data.git cd how-to-look-at-data
-
Install dependencies (using uv):
uv pip install -r pyproject.toml
-
Set up your environment variables:
- You will need an
OPENAI_API_KEY
for running this topic modelling proccess. We're using the OpenAI Text-Embedding-3-Small embeddings for clustering and OpenAI models for summarisation (used bykura
).
- You will need an
Discover patterns in 560 real user queries from Weights & Biases documentation using Kura's LLM-enhanced topic modeling.
You'll learn:
- How topic modeling reveals query patterns invisible to keyword analysis
- Converting raw queries into clustered insights using embeddings
- Why default summaries miss critical domain-specific details
Transform generic clustering results into precise, actionable insights with custom summarization models.
You'll learn:
- Building domain-specific summary models for your use case
- How better summaries dramatically improve clustering quality
- Reducing noise to focus on what matters for your users
3. Classifiers
Build production-ready classifiers that achieve 90%+ accuracy through systematic prompt engineering.
You'll learn:
- Creating weak labels for rapid dataset creation
- Iterative prompt improvement techniques
- Deploying classifiers to monitor query patterns in real-time
This repository was created for the AI Engineering Summit. It demonstrates practical techniques for analyzing and improving Retrieval-Augmented Generation (RAG) systems using real-world data and modern topic modeling tools.